Patent 10540328 was granted and assigned to Cohesity on January, 2020 by the United States Patent and Trademark Office.
Approaches for parallelized data deduplication. An instruction to perform data deduplication on a plurality of files is received. The plurality of files is organized into two or more work sets that each correspond to a subset of the plurality of files. Responsibility for performing each of said two or more work sets is assigned to a set of nodes in a cluster of nodes. The nodes may be physical nodes or virtual nodes. Each node in the set performs data deduplication on a different work set. In performing data deduplication, each node may store metadata describing where shared chunks of data are maintained in a distributed file system. The shared chunks of data are two or more sequences of bytes which appear in two or more of said plurality of files.