Patent attributes
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining the provenance of source code. One of the methods includes receiving a portion of a file occurring in a source code project. For each of a plurality of windows of characters in the portion of the file, a respective provenance signature is computed. An index that maps each provenance signature to occurrences of the provenance signature in one or more files of a plurality of projects is searched to identify one or more matching files that are each associated with at least one provenance signature computed for the portion of the file. Data identifying the one or more matching files is provided in response to receiving the portion of the file occurring in the source code project.