Common Crawl

commoncrawl.org

Is a

Company

Organization

Company attributes

Industry

Artificial Intelligence (AI)

Location

B2X

Founder

Legal classification

501(c)(3) organization

Number of Employees (Ranges)

1 – 101

Founded Date

2007

Other attributes

Blog

commoncrawl.org/connect/blog/

commoncrawl.org/blog

Wikidata ID

Q12055316

Overview

The Common Crawl Foundation is a California 501(c)(3) registered nonprofit with the goal of democratizing access to web information by producing and maintaining a free, open repository of web crawl data that is universally accessible and analyzable. The Common Crawl corpus contains petabytes of data, consisting of web page data, metadata extracts, and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world. Users can access the corpus for free using Amazon’s cloud platform or downloading it, in whole or in parts.

The Common Crawl Foundation was founded in 2007 by Gil Elbaz, who has been chairman since its inception. The foundation was started with the aim of democratizing data, allowing anyone to perform research and analysis. Common Crawl provides researchers, entrepreneurs, and developers with unrestricted access to data to explore, analyze, and create novel applications and services.

As of 2023, the Common Crawl repository contains over 240 billion pages spanning sixteen years with between three and five billion new pages added each month. The organization's data has been cited in over 8000 research papers. The foundation also regularly releases host- and domain-level graphs, to visualize the crawl data, and provides tools for users to construct and process web graphs themselves.

CCBot

The organization gathers data using the Common Crawl CCBot crawler, a Nutch-based web crawler that makes use of the Apache Hadoop project. Common Crawl uses Map-Reduce to process and extract crawl candidates from its crawl database. This candidate list is sorted by host (domain name) and then distributed to a set of crawler servers. The CCBot can be prevented from crawling a website by including the following lines in the "robots.txt" file:

User-agent: CCBot Disallow: /

LLM training data

Training large language models (LLMs) requires large amounts of data, and many of the datasets used related to internet content originate from the Common Crawl repository. For example, OpenAI used a filtered version of Common Crawl to train GPT-3, making up 82% of the raw tokens. Access to vast and diverse datasets through web crawlers enables LLMs to learn from a wide range of sources, creating more comprehensive and contextually aware models.

Timeline

No Timeline data yet.

Funding Rounds

Products

Acquisitions

SBIR/STTR Awards

Patents

Further Resources

Title

Author

Link

Type

Date

No Further Resources data yet.