US Patent 7627613 Duplicate document detection in a web crawler system

Patent 7627613 was granted and assigned to Google on December, 2009 by the United States Patent and Trademark Office.

Overview Structured Data Issues Contributors Activity

All edits

Edits on 21 Sep, 2023

"update inverses"

Golden AI

edited on 21 Sep, 2023

Edits made to:

Infobox (+1 properties)

Infobox

Patent Citations Received

‌

US Patent 11763013 Transaction document management system and method

Edits on 17 Sep, 2023

"Add patent inventor(s)"

Golden AI

edited on 17 Sep, 2023

Edits made to:

Infobox (+4 properties)

Infobox

Patent Inventor Names

Alexandre A. Verstak0

Daniel Dulitz0

Jeffrey A. Dean0

Sanjay Ghemawat0

Edits on 16 Sep, 2023

"Add patent abstract"

Golden AI

edited on 16 Sep, 2023

Edits made to:

Article (+622 characters)

Article

Patent abstract

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

Edits on 14 Jul, 2023

"Remove leading 0 from patent number"

Golden AI

edited on 14 Jul, 2023

Edits made to:

Infobox (+1/-1 properties)

‌