Golden has been acquired by ComplyAdvantage.Read about it here ⟶

Mixture of experts

Mixture of experts (MoE) is a machine learning technique where multiple models, or experts, are trained to specialize in different parts of the input space.

Overview Structured Data Issues Contributors Activity

All edits

Edits on 13 Jan, 2024

Amy Tomlinson Gayle

edited on 13 Jan, 2024

Edits made to:

Timeline (+83/-79 characters)

Article (+247/-178 characters)

Article

Mixture of experts (MoE) is a machine learningmachine learning technique where multiple models, or experts, are trained to specialize in different parts of the input space. Each expert makes a prediction based on the input, and these predictions are combined to produce the final output based on their confidence levels. With an MoE approach, the input space is partitioned into multiple regions handled by different experts trained to specialize for that respective region. A gating network is used to determine the weight given to each expert prediction, allowing the model to leverage strengths from each expert with the aim of improving overall performance.

...

MoE models can capture a wide range of patterns and relationships, making them particularly effective when the input space is large and complex. Typical applications of MoE models include image recognition, natural language processing, and recommendation systems.

...

One of the most important parameters determining a model's quality is its scale. For a fixed computing budget, it is better to train a larger model for fewer steps than train a small model for more steps. MoE enables artificial intelligence (AI) models to be pretrained using less compute, enabling the model or dataset to scale with the same compute budget as a dense model. MoE also offers faster inference compared to a model with the same number of parameters.

...

MoEMoEs have challenges with fine-tuning, leading to overfitting, and they require high VRAM (video RAM) as all experts are loaded in memory.

...

MoEs date back to the 1991 paper, "AAdaptive Mixture of Local Expertsdaptive Mixture of Local Experts," by Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. The original idea was to have a supervised procedure for a system composed of separate networks, each handling a different subset of the training cases. Between 2010-20152010 and 2015, two different research areas contributed to later MoE advancement:

...

This work led to the exploration of MoE for natural language processing with Shazeer et al. publishing the paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts LayerOutrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," in 2017, where they scaled the idea to a 137B LSTM (Long Short-Term Memory) by introducing sparsity.

Open sourceOpen-source MoE

Open sourceOpen-source projects to train MoEs include the following:

...

Open-access MoEs that have been released include those below:

Switch Transformers (Google)—Collection—A collection of T5-based MoEs going from 8 to 2048 experts. The largest model has 1.6 trillion parameters.
Mixtral 8x7B (Mistral)—A high-quality MoE that outperforms Llama 2 70B and has much faster inference. AAn instruct-tuned model is also released. Read more about it in the announcement blog post.

Timeline

January 23, 2017

Shazeer et al. publish "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts LayerOutrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," where they scale the MoE idea to a 137B LSTM (Long Short-Term Memory) by introducing sparsity.

March 1991

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton publish a paper titled "Adaptive Mixtures of Local Experts," introducing the idea of a supervised procedure for a system composed of separate networks, each handling a different subset of the training cases.

Arthur Smalley

edited on 13 Jan, 2024

Edits made to:

Infobox (+5/-1 properties)

Timeline (+2 events) (+481 characters)

Article (+376 characters)

Article

MoEs date back to the 1991 paper, Adaptive Mixture of Local Experts, by Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. The original idea was to have a supervised procedure for a system composed of separate networks, each handling a different subset of the training cases. Between 2010-2015, two different research areas contributed to later MoE advancement:

...

This work led to the exploration of MoE for natural language processing with Shazeer et al. publishing the paper Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, in 2017, where they scaled the idea to a 137B LSTM (Long Short-Term Memory) by introducing sparsity.

Infobox

Related Technology

Generative adversarial network

Related Industries

Natural language processing (NLP)

Computer Vision

Artificial Intelligence (AI)

‌

Image recognition

Generative AI

Timeline

January 23, 2017

Shazeer et al. publish Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, where they scale the MoE idea to a 137B LSTM (Long Short-Term Memory) by introducing sparsity.

March 1991

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton publish a paper titled Adaptive Mixtures of Local Experts, introducing the idea of a supervised procedure for a system composed of separate networks, each handling a different subset of the training cases.

Edits on 12 Jan, 2024

Arthur Smalley

edited on 12 Jan, 2024

Edits made to:

Infobox (+5 properties)

Description (+156 characters)

Article (+3345 characters)

Mixture of experts

Mixture of experts (MoE) is a machine learning technique where multiple models, or experts, are trained to specialize in different parts of the input space.

Article

Overview

Mixture of experts (MoE) is a machine learning technique where multiple models, or experts, are trained to specialize in different parts of the input space. Each expert makes a prediction based on the input and these predictions are combined to produce the final output based on their confidence levels. With an MoE approach, the input space is partitioned into multiple regions handled by different experts trained to specialize for that respective region. A gating network is used to determine the weight given to each expert prediction, allowing the model to leverage strengths from each expert with the aim of improving overall performance.

MoE models can capture a wide range of patterns and relationships making them particularly effective when the input space is large and complex. Typical applications of MoE models include image recognition, natural language processing, and recommendation systems.

One of the most important parameters determining a model's quality is its scale. For a fixed computing budget it is better to train a larger model for fewer steps than train a small model for more steps. MoE enables AI models to be pretrained using less compute, enabling the model or dataset to scale with the same compute budget as a dense model. MoE also offers faster inference compared to a model with the same number of parameters.

In the context of transformer models, MoE consists of two main elements:

Sparse MoE layers—used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts,” where each expert is a neural network. In practice, the experts are often FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs.

Gate network—determines which expert tokens are sent to. A token can also be sent to more than one expert. The network, or router, is composed of learned parameters and is pretrained at the same time as the rest of the network.

MoE have challenges with fine-tuning, leading to overfitting, and require high VRAM (video RAM) as all experts are loaded in memory.

History

MoEs date back to the 1991 paper, Adaptive Mixture of Local Experts. The original idea was to have a supervised procedure for a system composed of separate networks, each handling a different subset of the training cases. Between 2010-2015, two different research areas contributed to later MoE advancement:

Experts as components—work by Eigen, Ranzato, and Ilya explored MoEs as components of deeper networks. This allows having MoEs as layers in a multilayer network, making it possible for the model to be both large and efficient simultaneously.
Conditional Computation—Yoshua Bengio researched approaches to dynamically activate or deactivate components based on the input token.

Open source MoE

Open source projects to train MoEs include:

Megablocks
Fairseq
OpenMoE

Open-access MoEs that have been released include:

Switch Transformers (Google)—Collection of T5-based MoEs going from 8 to 2048 experts. The largest model has 1.6 trillion parameters.
NLLB MoE (Meta)—A MoE variant of the NLLB translation model.
Mixtral 8x7B (Mistral)—A high-quality MoE that outperforms Llama 2 70B and has much faster inference. A instruct-tuned model is also released. Read more about it in the announcement blog post.
OpenMoE—A community effort that has released Llama-based MoEs.

Infobox

Is a

Technology

Related Technology

Deep learning

Generative adversarial network

Implementations

Mixtral 8x7B

Parent Industry

Machine learning

Edits on 6 Mar, 2020

"Attach Wikidata entity ID"

Golden AI

edited on 6 Mar, 2020

Edits made to:

Infobox (+1 properties)

Infobox

Wikidata entity ID

Q30688561

Edits on 1 Jan, 2017

"Initial topic creation"

Golden AI

created this topic on 1 Jan, 2017

Edits made to:

Article

Mixture of experts

Mixture of experts (MoE) is a machine learning technique where multiple models, or experts, are trained to specialize in different parts of the input space.

Find more entities like Mixture of experts

Use the Golden Query Tool to find similar entities by any field in the Knowledge Graph, including industry, location, and more.

Open Query Tool

Access by API

By using this site, you agree to our Terms of Service.