Mixture of Experts (MoE)

Q: What is Mixture of Experts (MoE)?

Mixture of experts is a model architecture where only a subset of expert networks is activated per input — efficient at large parameter counts.

What Is Mixture of Experts?

MoE architectures are behind many of the most powerful AI models like GPT-4 and Mixtral. For practitioners, this means larger models with better performance at reasonable cost. Understanding why a model works efficiently despite an enormous parameter count helps you better evaluate AI solutions and select models suited to your requirements.

Mixture of Experts (MoE) is an architectural principle for neural networks that makes scaling large language models more efficient. Instead of using a single monolithic network, an MoE model consists of many specialized sub-networks — the “experts.” A router network (also called a gating network) decides for each input which two to eight experts are activated. The rest remain inactive.

This principle enables models with enormous total capacity at moderate computational cost. Google’s Switch Transformer had 1.6 trillion parameters but used only a fraction of them per token. Mixtral by Mistral AI activates 2 of 8 experts per layer. GPT-4 reportedly also uses an MoE architecture. The experts automatically specialize in different task types or knowledge domains during training.

For businesses, MoE means larger models become economically viable. An MoE model with 100 billion parameters can run on hardware otherwise only suited for 20-billion-parameter models — because only a fraction of the parameters are computed for each request. Combined with state space models, hybrid architectures like MoE-Mamba / Jamba emerge, setting new standards for both speed and cost.

In brief

What Is Mixture of Experts?