EleutherAI is a nonprofit AI research lab focusing on the interpretability and alignment of large AI models. EleutherAI has developed or had input on many publicly available models, including GPT‑J, GPT‑NeoX, BLOOM, VQGAN‑CLIP, Stable Diffusion, and OpenFold. The institute's models have been downloaded more than 25 million times, and its research has been published and presented at top machine learning and natural language processing conferences, such as NeurIPS, ACL, ICLR, and FAccT. EleutherAI's work is possible due to support from its donors and sponsors, such as CoreWeave, Hugging Face, Stability AI, Google TRC, Lambda Labs, and Nat Friedman.
The high costs and unusual skill set required to advance large-scale AI research means the field is dominated by a small number of large technology companies and start-ups. EleutherAI believes the future of increasingly powerful AI models should not be restricted to a handful of companies developing them for profit and that independent researchers should also be able to study them. EleutherAI's mission is defined by three statements:
- Advance research on interpretability and alignment of foundation models
- Ensure that the ability to study foundation models is not restricted to a handful of companies
- Educate people about the capabilities, limitations, and risks associated with these technologies
EleutherAI was founded in July 2020 by Connor Leahy, Sid Black, and Leo Gao. The research lab grew out of a Discord server discussing the release of OpenAI's large language model GPT-3. In March 2023, EleutherAI registered as a nonprofit research institute.
Originally, EleutherAI's work focused on training and releasing models to provide access to AI technologies and promote open science norms in natural language processing. With access to large-scale pre‑trained AI models becoming more widespread, the institute has shifted focus to researching AI interpretability and alignment. EleutherAI primarily operates through its public Discord server, coordinating research projects and discussing developments in the field. The research lab employs around two dozen full- and part-time research staff. These employees work alongside roughly another dozen regular volunteers and external collaborators. EleutherAI promotes an open and collaborative research model without strong differentiation between employees, volunteers, and collaborators.
EleutherAI started from a small group of AI enthusiasts on Shawn Presser's Discord server. In the summer of 2020, The server was discussing the recently released GPT-3 model from OpenAI. On July 2nd, 2020, Connor Leahy posted a paper about large model training suggesting they should try to build their own GPT-3 like model, stating:
Hey guys lets give OpenAI a run for their money like the good ol' days
Another user Leo Gao responded:
this but unironically
After discussing the project in the text AI-related channels of Shawn Presser's Discord server, Leahy, Gao, and fellow hobbyist Sid Black went on to form a new Discord server called "LibreAI" on July 7, 2020. Shortly after, on July 28, Leahy announced a new name, "EleutherAI," inspired by the ancient greek word for liberty, Eleutheria. Leahy had access to Tensor Processing Units (TPUs) through Google's TPU Research Cloud (TRC) from a previous project. Leahy has stated they didn't expect to get very far, but it was the height of the Covid-19 pandemic, and they didn't have anything better to do. In a 2022 interview, Leahy described the start of EleutherAI:
It really was at first just a fun hobby project during lockdown times when we didn’t have anything better to do, but it quickly gained quite a bit of traction.
Stating the mindset of the founders as:
We consider ourselves descendants of the classic hacker culture of a few decades before, just in new fields, experimenting with technology out of curiosity and love of the challenge.
Initial research at EleutherAI focused on developing its own open-source version of OpenAI's GPT-3, called GPT-Neo. To use TPUs, the team had to work with an obscure library called Mesh TensorFlow. To train their models, the team began collecting their own large dataset called The Pile, which went live on New Year's Day 2021. The Pile is a free and publicly available 825GB dataset of diverse English text for language modeling. The next day, January 2, 2021, EleutherAI announced a collaboration with CoreWeave, a specialized cloud services provider for GPU-based workloads. With the CoreWeave partnership, EleuetherAI researchers were free from having to work with TPUs and Tensorflow, and they began work on a new codebase, with a new LLM called GPT-NeoX following soon after.
On March 21, 2021, EleutherAI released its 1.3B and 2.7B GPT-Neo models as proof of concept. Trained on The Pile, the two models had been sitting in storage before being released. The release of these models drew attention to EleutherAI with articles in WIRED and other publications. While the code for GPT-NeoX could scale to 175B parameters and beyond, EleutherAI struggled to access the hardware they needed due to the global GPU shortage. While waiting, the researchers put their spare TPUs to work, training another model, GPT-J-6B, which was released on June 4, 2021.
In early 2022, EleutherAI released GPT-NeoX-20B a 20 billion parameter autoregressive English language model trained on the Pile. At the time of its release, it was the largest publicly available language model in the world.
As part of a two-and-a-half-year retrospective published on March 2, 2023, EleutherAI announced it is forming a nonprofit research institute. Over the course of the Discord server, many contributors have had to move on to focus on jobs or start their own companies. Funded by a mix of charitable donations and grants, the new nonprofit means EleutherAI is able to make over twenty of its regular contributors employees, working full-time doing research. The organization plans to remain true to its open values, organizing itself through the public Discord server. The new institute will be run by Stella Biderman (Head of Research, Executive Director), Curtis Huebner (Head of Alignment), and Shivanshu Purohit (Head of Engineering), with guidance from a board of directors that includes founder Connor Leahy, Assistant Professor at University of North Carolina (UNC) Colin Raffel, and CEO of Stability AI Emad Mostaque.
EleutherAI's research aims to enable broader participation using open science to increase transparency and reduce the potential harm caused by emerging AI technologies. As of early March 2023, EleutherAI members have authored twenty-eight papers.
EleutherAI's main research focus is on language models. The research lab has trained and released multiple series of large language models and the codebases used to train them. These models have gone on to be used in open-source research applications.
EleutherAI researchers are working to understand AI system behavior to predict or modify future models and ensure that systems are optimized for their desired metrics and that they can be trusted.
EleutherAI performs research on the alignment of AI systems to understand how models fail and how to develop systems exhibiting more robust behavior, performing as desired.
While EleutherAI's main research focus is language models, the research lab also works on other modalities, including image and audio data.
The Pile is an 825 GB open-source language modeling dataset of diverse text. It consists of twenty-two smaller datasets, many of which are from academic or professional sources.
Evaluations find that models trained on the Pile show moderate improvements in traditional language modeling benchmarks and significant improvements in Pile BPB (bits per byte). Pile BPB is a measure of world knowledge and reasoning ability across numerous domains. It is a benchmark of general, cross-domain text modeling ability for large language models. To score well on Pile BPB, models must be able to understand disparate domains, including books, GitHub repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. The Pile is known to contain profanity, lewd, and other forms of abrasive language.
GPT-Neo is a series of LLMs trained on the Pile. GPT-Neo was EleutherAI's first attempt to produce a language model similar to GPT-3. On the initial release (March 21, 2021), EleutherAI released three decoder-only LLMs with 125M, 1.3B, and 2.7B parameter variants. The models were designed using EleutherAI's replication of the GPT-3 architecture. The 2.7B parameter model was trained for 420 billion tokens over 400,000 steps, as a masked autoregressive language model, using cross-entropy loss. The model learns a representation of the English language that can be used to extract features for downstream tasks. The model was pre-trained to generate text from a prompt, predicting the next token. The model can produce offensive content without warning.
At the same time as the release of the three GPT-Neo models, EleutherAI also released GPT-Neo Library—a library for training language models written in Mesh TensorFlow. It was used to train the GPT-Neo models but has since been replaced by the GPT-NeoX Library.
GPT-J is another publicly available GPT-3 like language model released by EleutherAI on June 4, 2021. A six-billion parameter transformer model trained using Ben Wang's Mesh Transformer JAX, a TPU-based library. Upon release, GPT-J was the largest publicly available GPT-3 style LLM in the world. The model consists of twenty-eight layers with a model dimension of 4096 and a feedforward dimension of 16384. The model is trained with a tokenization vocabulary of 50257, with the same set of Byte-Pair Encoding (BPE) as GPT-2 and GPT-3. GPT-J is not intended for use without fine-tuning, supervision, or moderation. It is not a product in and of itself and EleutherAI states it cannot be used for human-facing interactions. The model can generate harmful or offensive text. GPT-J 6B was trained on the Pile, for 402 billion tokens over 383,500 steps.
Carp is a model, similar to OpenAI's CLIP, trained on text-critique pairs with the goal of learning the relationship between a passage of text and the natural language feedback on those passages. EleutherAI released CARP on October 6, 2021. CARP is a scalable method of performing zero-shot evaluation of stories and other passages.
First released on April 3, 2021, VQGAN-CLIP is a model for text-to-image synthesis cheaply using pre-trained CLIP and VQGAN models. The paper describing VQGAN-CLIP was not released until April 2022, describing the model's methodology for producing images of high visual quality from text prompts using a multimodal encoder.
GPT-NeoX-20B is an open-source 20 billion parameter autoregressive language model trained on the Pile using the GPT-NeoX library, a library for efficiently training large language models with tens of billions of parameters in a multimachine distributed context. The model's architecture is almost identical to GPT-J. At the time of release, EleutherAI believed it to be the largest publicly accessible pre-trained general-purpose autoregressive language model.GPT-NeoX-20B was developed primarily for research purposes. Similar to GPT-J it is not intended as a product to be used for human-facing interactions.
CLOOB-Conditioned Latent Diffusion (CCLD) is a text-to-image model that can be trained without captioned images. The model was released by EleutherAI on December 15, 2022. CCLD takes a similar approach to CLIP-conditioned diffusion with a few key differences. CCLD is targeted at hobbyists, academics, and newcomers as the model is easy to set up and has low fine-tuning/training costs.
Pythia is an ongoing project at EleutherAI combining interpretability analysis and scaling laws to understand knowledge development during the training of autoregressive transformers. Pythia is a suite of 16 models with 154 partially trained checkpoints to enable controlled scientific research on openly accessible and transparently trained large language models.