Patent 11526785 was granted and assigned to VMware on December, 2022 by the United States Patent and Trademark Office.
Techniques for performing predictability-driven compression of training data sets used for machine learning (ML) are provided. In one set of embodiments, a computer system can receive a training data set comprising a plurality of data instances and can train an ML model using the plurality of data instances, the training resulting in a trained version of the ML model. The computer system can further generate prediction metadata for each data instance in the plurality of data instances using the trained version of the ML model and can compute a predictability measure for each data instance based on the prediction metadata, the predictability measure indicating a training value of the data instance. The computer system can then filter one or more data instances from the plurality of data instances based on the computed predictability measures, the filtering resulting in a compressed version of the training data set.