Patent 10437847 was granted and assigned to Trifacta on October, 2019 by the United States Patent and Trademark Office.
A system determines samples of datasets that are typically processed by big data analysis systems. The samples are for use for development and testing of transformations for preprocessing the datasets in preparation for analysis by big data systems. The system receives one or more transform operations input datasets for the transform operations. The system determines samples associated with the transform operations. According to a sampling strategy, the system determines samples that return at least a threshold number of records in the result set obtained by applying a transformation. According to another sampling strategy, the system receives criteria describing the result of the transform operations and determines sample sets that generate result sets satisfying the criteria as a result of applying the transform operations.