Similarity analysis of software is disclosed. An input file is received. Pairs of files that consist of the input file and files included in a corpus are categorized into one of a possible match and a mismatch. Those pairs classified as possible matches are analyzed using a pairwise component analysis.