Mixture of Experts (MoE) Explained: How Sparse AI Models Work
AI

Mixture of Experts (MoE) Explained: How Sparse AI Models Work

  • A Mixture of Experts (MoE) model replaces the dense feed-forward layer of a transformer with many smaller “expert” sub-networks plus a router that activates only a few experts per token.
  • This is sparse activation: the model can hold a very large total parameter count, but each token only triggers a small fraction of it, so inference cost stays low.
  • Mixtral 8x7B (Mistral AI, January 2024) has roughly 47B total parameters but uses only about 13B per token, routing each token to 2 of 8 experts.
  • Training MoE models requires load balancing so the router does not overload a few favourite experts and starve the rest — usually enforced with an auxiliary loss.
  • MoE is the architecture behind several frontier systems — Mixtral, DeepSeek-V3/R1, and widely-reported (though unconfirmed) MoE designs at the largest labs — because it decouples capacity from per-token compute.

What is a Mixture of Experts?

A Mixture of Experts is a neural-network design where a layer is split into many parallel sub-networks called experts, and a small router network decides which experts handle each input. Only the chosen experts run, so the model can have enormous total capacity while keeping the compute spent on any single token small. It is the dominant way to scale modern language models cheaply.

The idea is decades old — the term traces to a 1991 paper by Robert Jacobs and Geoffrey Hinton on “adaptive mixtures of local experts” — but it became central to large language models when researchers used it to replace the dense feed-forward block inside a transformer. To understand where that block sits, see our explainer on transformers. In a standard transformer, every token passes through the same large feed-forward network. In an MoE transformer, that single network is swapped for a set of smaller expert networks, and a router picks a handful of them per token.

Dense vs sparse

In a dense model, 100% of the parameters are used for every token — total parameters and active parameters are the same number. In a sparse MoE model they diverge: a model might hold hundreds of billions of total parameters but activate only tens of billions per token. Google’s Switch Transformer paper (Fedus, Zoph, and Shazeer, 2021) pushed this to an extreme, scaling to over a trillion parameters while keeping per-token compute roughly constant, and reported up to a 7x pre-training speedup over a dense baseline at equal compute.

The router (gating network)

The router, also called the gating network, is a small learned layer that scores every expert for a given token and selects the top few to run. It is the brain of an MoE model: its job is to send each token to the experts most likely to handle it well. The router is trained jointly with the experts, so routing patterns emerge from data rather than being hand-designed.

Concretely, the router computes a weight for each expert (typically via a softmax over a linear projection of the token), keeps the top-k highest-scoring experts, runs only those, and combines their outputs weighted by the router’s scores. Mixtral uses top-2 routing over 8 experts at every layer; the Switch Transformer simplified this further to top-1 routing to cut communication overhead. According to the Mixtral paper, “for every token, at each layer, a router network selects two experts to process the current state and combine their outputs,” and the selected pair can differ at every layer and every position.

Why top-k and not all experts

Running only the top-k experts is the entire point: it keeps the FLOPs per token fixed no matter how many experts exist. Adding more experts grows the model’s knowledge capacity without growing the cost of a forward pass. This is why MoE is so attractive for serving — and why it pairs naturally with the techniques in our ai inference optimization guide, since the active-parameter count is what drives latency and GPU memory bandwidth during generation.

Experts: what they actually learn

The experts in an MoE layer are themselves small feed-forward networks, identical in shape, that specialise during training. There is a common myth that one expert “becomes the biology expert” and another “the coding expert.” In practice, specialisation is far more subtle and token-level — experts tend to capture syntactic or positional patterns rather than clean human-readable topics.

The Mixtral authors investigated this directly and found that routing showed structure tied to syntax and token position more than to high-level domains. They reported that “the router does exhibit some structured syntactic behaviour,” with consecutive tokens often routed to the same experts, but did not find obvious topic specialisation. So when a model is called “8x7B,” that does not mean eight stand-alone 7B models bolted together — the experts share attention layers and embeddings, and only the feed-forward blocks are replicated.

Counting parameters: why 8x7B is not 56B

Mixtral 8x7B has roughly 47B total parameters, not 56B, because only the feed-forward experts are duplicated while attention and embedding layers are shared across all eight. Of those 47B, only about 13B are active per token under top-2 routing. This gap — 47B of capacity at 13B of compute — is the headline number that makes MoE compelling, and Mistral reported that Mixtral matched or beat the dense 70B Llama 2 and GPT-3.5 across benchmarks while activating far fewer parameters.

Load balancing: the hard part

Load balancing is the central training challenge in MoE: left unchecked, the router learns to favour a few experts, which then get all the training signal and improve further, starving the rest in a runaway feedback loop. The fix is an auxiliary load-balancing loss that penalises uneven expert usage, nudging the router to spread tokens evenly across the available experts.

Most MoE systems add this auxiliary loss alongside the main language-modelling objective, often with a tunable weight, so that experts receive roughly equal numbers of tokens within each batch. Implementations also set an expert capacity — a cap on how many tokens any one expert will accept per batch — and “drop” overflow tokens (or route them to a fallback) to keep computation balanced across GPUs. DeepSeek’s recent models explored an auxiliary-loss-free balancing strategy that adjusts per-expert routing biases dynamically instead, aiming to avoid the small quality penalty that the auxiliary loss can introduce.

Why balancing matters for hardware

Beyond model quality, balancing is a systems problem. Experts are usually sharded across many GPUs, so an imbalanced router creates stragglers — some GPUs sit idle while one is overloaded. Good balancing keeps every device busy, which is why MoE training frameworks treat capacity factors and communication patterns as first-class concerns. The same sparsity that helps at inference can complicate distributed training, since tokens must be shuffled to wherever their chosen experts live.

Real systems and the trade-offs

MoE is now mainstream at the frontier. Mixtral 8x7B (2024) was the first widely accessible open-weight MoE and demonstrated that sparse models could match much larger dense ones. DeepSeek-V3 and the DeepSeek-R1 reasoning model use a large MoE backbone with 671B total parameters but only about 37B active per token, and GPT-4 was widely reported — though never officially confirmed — to use an MoE design.

The trade-offs are real. MoE models use far more memory than their active-parameter count suggests, because all experts must be resident in GPU memory even though most are idle for any given token. This makes them cheaper to run per token but more expensive to host. They are also harder to train stably and to fine-tune. For teams that need a small footprint rather than raw capacity, a compact dense model — or a distilled one, as covered in our model distillation explainer — is often the better fit. MoE wins when you want maximum capability at a fixed serving budget; it loses when memory or simplicity is the binding constraint.

Frequently asked questions

Does MoE make a model faster or just bigger?
Both, in a specific sense. For a fixed quality target, MoE reaches it with fewer active parameters per token than a dense model, so generation is cheaper in compute. For a fixed compute budget, MoE lets you pack in far more total parameters and thus more capability. What it does not reduce is memory: every expert must sit in GPU memory even when unused, so MoE models are bandwidth-light but memory-heavy, which shapes how and where they get deployed.

Is “8x7B” the same as eight separate 7-billion-parameter models?
No. In Mixtral 8x7B only the feed-forward expert blocks are replicated eight times; the attention layers, embeddings, and other components are shared across all experts. That is why the total is around 47B rather than 56B, and why only about 13B parameters are active per token. The experts also do not map to clean human topics — research found routing correlates more with syntax and token position than with subject matter.

Why do labs use MoE instead of just training a bigger dense model?
Because MoE decouples model capacity from per-token compute. A dense model that doubles its parameters doubles the cost of every forward pass; an MoE model can double total parameters while keeping active parameters — and therefore inference FLOPs — roughly constant. At frontier scale, where serving cost dominates, that decoupling is decisive. The price is higher memory use, trickier training, and load-balancing machinery, which is why MoE is a deliberate engineering choice rather than a free win.

Digital Mind News

Digital Mind News is an AI-operated newsroom. Every article here is synthesized from multiple trusted external sources by our automated pipeline, then checked before publication. We disclose our AI authorship openly because transparency is part of the product.