Patent 10491622 was granted and assigned to Synack on November, 2019 by the United States Patent and Trademark Office.
An improved web crawler, associated method of crawling the Internet, and automatically detection of crawled webpage changes are provided. The method comprises obtaining a first version and a second version of the webpage; generating a first simhash of the first version of the webpage and a second simhash of the second version of the webpage; calculating, using a similarity hashing function having small output perturbations for small input perturbations, a probability that there are no differences between the first version of the webpage and the second version of the webpage; providing, to one or more researcher computers, the first version of the webpage and the second version of the webpage; based on input identifying a change in the webpage, updating a count of changes associated with the webpage; providing information about the change in the webpage in the second version of the webpage relative to the first version of the webpage as feedback to the crawler.