Log in
Enquire now
Common Crawl

Common Crawl

CommonCrawl is a nonprofit foundation dedicated to the open web.

OverviewStructured DataIssuesContributors

Contents

OverviewCCBotLLM training dataTimelineTable: Funding RoundsTable: ProductsTable: AcquisitionsTable: SBIR/STTR AwardsTable: PatentsTable: Further ResourcesReferences
commoncrawl.org
Is a
Company
Company
Organization
Organization

Company attributes

Industry
Generative AI
Generative AI
Open data
Open data
‌
Web scraping
Artificial Intelligence (AI)
Artificial Intelligence (AI)
Publishing
Publishing
Semantic Web
Semantic Web
Big data
Big data
Location
San Francisco
San Francisco
16
B2X
B2B
B2B
Founder
Gil Elbaz
Gil Elbaz
1
Legal classification
‌
501(c)(3) organization
1
Number of Employees (Ranges)
1 – 101
Founded Date
2007
1

Other attributes

Blog
commoncrawl.org/connect/blog/
commoncrawl.org/blog
Wikidata ID
Q12055316
Overview

The Common Crawl Foundation is a California 501(c)(3) registered nonprofit with the goal of democratizing access to web information by producing and maintaining a free, open repository of web crawl data that is universally accessible and analyzable. The Common Crawl corpus contains petabytes of data, consisting of web page data, metadata extracts, and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world. Users can access the corpus for free using Amazon’s cloud platform or downloading it, in whole or in parts.

The Common Crawl Foundation was founded in 2007 by Gil Elbaz, who has been chairman since its inception. The foundation was started with the aim of democratizing data, allowing anyone to perform research and analysis. Common Crawl provides researchers, entrepreneurs, and developers with unrestricted access to data to explore, analyze, and create novel applications and services.

As of 2023, the Common Crawl repository contains over 240 billion pages spanning sixteen years with between three and five billion new pages added each month. The organization's data has been cited in over 8000 research papers. The foundation also regularly releases host- and domain-level graphs, to visualize the crawl data, and provides tools for users to construct and process web graphs themselves.

CCBot

The organization gathers data using the Common Crawl CCBot crawler, a Nutch-based web crawler that makes use of the Apache Hadoop project. Common Crawl uses Map-Reduce to process and extract crawl candidates from its crawl database. This candidate list is sorted by host (domain name) and then distributed to a set of crawler servers. The CCBot can be prevented from crawling a website by including the following lines in the "robots.txt" file:

User-agent: CCBot Disallow: /
LLM training data

Training large language models (LLMs) requires large amounts of data, and many of the datasets used related to internet content originate from the Common Crawl repository. For example, OpenAI used a filtered version of Common Crawl to train GPT-3, making up 82% of the raw tokens. Access to vast and diverse datasets through web crawlers enables LLMs to learn from a wide range of sources, creating more comprehensive and contextually aware models.

Timeline

No Timeline data yet.

Funding Rounds

Products

Acquisitions

SBIR/STTR Awards

Patents

Further Resources

Title
Author
Link
Type
Date
No Further Resources data yet.

References

Find more companies like Common Crawl

Use the Golden Query Tool to find similar companies in the same industry, location, or by any other field in the Knowledge Graph.
Open Query Tool
Access by API
Golden Query Tool
Golden logo

Company

  • Home
  • Press & Media
  • Blog
  • Careers
  • WE'RE HIRING

Products

  • Knowledge Graph
  • Query Tool
  • Data Requests
  • Knowledge Storage
  • API
  • Pricing
  • Enterprise
  • ChatGPT Plugin

Legal

  • Terms of Service
  • Enterprise Terms of Service
  • Privacy Policy

Help

  • Help center
  • API Documentation
  • Contact Us
By using this site, you agree to our Terms of Service.