Sparse Expert Architectures

Dense transformers activate every parameter for every token.

Dense transformers activate every parameter for every token. As models become larger, this approach becomes increasingly expensive. A trillion-parameter dense model would require enormous compute for every forward pass, even if only part of the model is needed for a given token.

Sparse expert architectures address this problem by activating only a subset of parameters for each token. The most common form is the Mixture-of-Experts transformer, usually abbreviated as MoE.

In an MoE transformer, most transformer layers remain shared, but some feedforward layers are replaced by collections of specialized subnetworks called experts. A routing mechanism selects which experts process each token.

This allows total parameter count to grow much larger than active compute per token.

Dense Versus Sparse Computation

In a standard transformer feedforward layer:

$$ \text{FFN}(x)=W_2\phi(W_1x+b_1)+b_2. $$

Every token uses the same parameters.

In a sparse expert layer, we instead have multiple feedforward networks:

$$ E_1,E_2,\ldots,E_N. $$

A router chooses a subset of experts for each token.

If token $x_t$ uses experts $i$ and $j$, the computation becomes

$$ y_t = g_iE_i(x_t)+g_jE_j(x_t), $$

where $g_i$ and $g_j$ are routing weights.

Only the selected experts are evaluated.

Why Sparse Experts Matter

Sparse experts separate total model capacity from active compute.

Suppose:

Model type Parameters used per token
Dense transformer All parameters
Sparse MoE Small subset of parameters

An MoE model may contain hundreds of billions of parameters while activating only a small fraction for each token.

This gives several advantages:

Advantage Explanation
Larger total capacity More parameters overall
Lower active compute Only selected experts run
Expert specialization Different experts learn different patterns
Better scaling efficiency More capacity per FLOP

The key idea is conditional computation. Different inputs activate different pathways.

Structure of an MoE Layer

A standard MoE transformer layer contains:

  1. Shared attention layer.
  2. Router network.
  3. Multiple expert feedforward networks.
  4. Aggregation mechanism.

The attention layer is usually dense and shared across all tokens. The feedforward block becomes sparse.

The structure is:

$$ x \rightarrow \text{Attention} \rightarrow \text{Router} \rightarrow \text{Selected Experts} \rightarrow \text{Aggregation}. $$

Each expert is usually a standard feedforward network:

$$ E_i(x)=W_{2,i}\phi(W_{1,i}x+b_{1,i})+b_{2,i}. $$

Router Networks

The router decides which experts process each token.

For token representation $x_t$, the router computes expert scores:

$$ r_t = W_rx_t, $$

where

$$ r_t\in\mathbb{R}^{N}, $$

and $N$ is the number of experts.

A softmax converts scores into routing probabilities:

$$ p_t = \text{softmax}(r_t). $$

The router then selects the top $k$ experts.

For top-1 routing:

$$ i^* = \arg\max_i p_{t,i}. $$

For top-2 routing:

$$ i_1,i_2 = \text{TopK}(p_t,2). $$

The selected experts receive the token.

Top-k Routing

Most MoE systems use top-$k$ routing.

Routing type Experts per token
Top-1 1
Top-2 2
Top-4 4

Top-1 routing is efficient because each token activates only one expert. Top-2 routing improves stability and quality because tokens can combine multiple expert outputs.

Suppose token $x_t$ selects experts $i$ and $j$. Then:

$$ y_t = g_iE_i(x_t) + g_jE_j(x_t), $$

where

$$ g_i + g_j = 1. $$

The router weights determine how strongly each expert contributes.

Token Dispatch

Once routing decisions are made, tokens must be grouped by expert.

Suppose:

Token Assigned expert
$x_1$ $E_2$
$x_2$ $E_5$
$x_3$ $E_2$
$x_4$ $E_1$

The system gathers tokens for each expert:

Expert 1: [x4]
Expert 2: [x1, x3]
Expert 5: [x2]

Each expert processes its assigned tokens independently.

After computation, outputs are scattered back to their original token positions.

This gather-scatter operation is one of the main systems challenges in MoE training.

Expert Specialization

Experts often develop specialized behavior.

Different experts may focus on:

Specialization Example
Syntax Grammar and structure
Mathematics Numerical reasoning
Code Programming tokens
Languages Different natural languages
Retrieval Citation or memory tokens
Vision regions Different spatial patterns

Specialization is not manually assigned. It emerges from routing and optimization dynamics.

However, specialization is imperfect. Experts may overlap substantially or collapse into similar behavior if routing is poorly balanced.

Load Balancing

A major problem in MoE systems is expert imbalance.

Suppose one expert receives most tokens:

Expert 1: 90% of tokens
Others: very few tokens

Then:

Problem Consequence
Hot experts Compute bottlenecks
Idle experts Wasted parameters
Poor specialization Reduced model diversity
Communication imbalance Slower distributed training

To avoid this, MoE systems use load-balancing losses.

A simplified balancing objective encourages:

  1. Similar routing probability across experts.
  2. Similar token counts across experts.

The total loss becomes

$$ L = L_{\text{task}} + \lambda L_{\text{balance}}. $$

Here $\lambda$ controls the balancing strength.

Capacity Limits

Experts usually have a maximum token capacity per batch.

Suppose:

Expert Maximum capacity
$E_i$ 128 tokens

If too many tokens are routed to one expert, extra tokens may be:

Strategy Description
Dropped Ignore excess tokens
Re-routed Send to another expert
Buffered Process later
Dynamically resized Expand expert capacity

Capacity limits prevent single experts from becoming overloaded.

The capacity factor controls allowed overflow:

$$ \text{capacity} = \text{capacity factor} \times \frac{\text{tokens}}{\text{experts}}. $$

Distributed Expert Parallelism

MoE systems are naturally distributed.

Different experts can be placed on different accelerators:

GPU 1: Experts 1-4
GPU 2: Experts 5-8
GPU 3: Experts 9-12

Tokens are routed across devices.

This is called expert parallelism.

Compared with dense tensor parallelism, expert parallelism has different tradeoffs:

Dense parallelism Expert parallelism
Split matrix computation Split experts
Every device active for every token Devices active only for routed tokens
Regular communication Sparse communication
Predictable load Routing imbalance possible

Communication becomes a central challenge. Tokens must move between devices efficiently.

Sparse Scaling Laws

MoE models often achieve better quality for a given active compute budget.

A sparse model may have:

Metric Value
Total parameters 1 trillion
Active parameters per token 50 billion

The model behaves like a very large network in terms of capacity, but like a smaller network in terms of per-token compute.

This changes the scaling tradeoff:

Dense scaling Sparse scaling
Capacity tied to compute Capacity partially decoupled
More parameters always cost more FLOPs Extra inactive experts are cheap
Compute grows with total size Compute grows with active experts

Sparse scaling is therefore attractive when parameter memory is cheaper than compute.

Routing Instability

Routers can become unstable during training.

Common issues include:

Problem Description
Expert collapse Few experts dominate
Routing oscillation Tokens rapidly switch experts
Dead experts Some experts unused
Noisy specialization Experts fail to stabilize
Overconfident routing Router entropy collapses

Several stabilization techniques are common:

Technique Purpose
Auxiliary balancing loss Spread tokens evenly
Noisy routing Encourage exploration
Temperature scaling Smooth routing probabilities
Capacity constraints Prevent overload
Top-2 routing Improve robustness

Stable routing is essential for good expert utilization.

Switch Transformers

Switch Transformers simplified MoE routing using top-1 routing.

Instead of combining multiple experts, each token selects only one expert:

$$ y_t = E_i(x_t). $$

This reduces communication and computation.

Advantages:

Benefit Explanation
Simpler routing One expert per token
Lower communication Fewer transfers
Lower memory Fewer active expert outputs
Faster training Less aggregation overhead

The tradeoff is lower routing flexibility.

Switch-style routing showed that very large sparse models could scale efficiently with simpler systems design.

Shared and Specialized Layers

Most MoE transformers are hybrid architectures.

Component Usually dense or sparse
Attention layers Dense
Embeddings Dense
Output head Dense
Feedforward layers Sparse experts

Attention layers remain shared because every token must exchange information globally. Sparse experts mainly replace the computationally expensive feedforward blocks.

This hybrid design balances communication cost and specialization.

MoE During Inference

Inference introduces additional challenges.

The model must:

  1. Run the router.
  2. Dispatch tokens to experts.
  3. Gather outputs.
  4. Manage KV cache.
  5. Coordinate devices.

Latency can increase if routing creates communication bottlenecks.

Inference optimization techniques include:

Technique Purpose
Expert caching Reuse loaded expert weights
Expert placement optimization Reduce communication
Token batching Improve utilization
Routing locality Keep tokens near experts
Quantized experts Reduce memory bandwidth

Sparse inference efficiency depends heavily on systems engineering.

Sparse Experts Beyond Language

MoE ideas are also used in:

Domain Application
Vision Specialized image experts
Multimodal systems Experts per modality
Speech Acoustic specialization
Robotics Task-conditioned policies
Retrieval systems Memory-aware routing

The general principle is conditional computation: activate only the parts of the model needed for the current input.

MoE Versus Dense Models

Property Dense transformer Sparse MoE transformer
Active parameters All Small subset
Compute scaling Grows with total size Grows with active experts
Communication Simpler More complex
Parameter efficiency Lower Higher
Systems complexity Lower Higher
Specialization Shared representation Expert specialization

Dense models are simpler and often more stable. Sparse models offer better scaling efficiency at very large sizes.

A Minimal MoE Layer in PyTorch

A simplified educational MoE layer:

import torch
from torch import nn

class Expert(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )

    def forward(self, x):
        return self.net(x)

class SimpleMoE(nn.Module):
    def __init__(self, d_model: int, d_ff: int, n_experts: int):
        super().__init__()

        self.router = nn.Linear(d_model, n_experts)

        self.experts = nn.ModuleList([
            Expert(d_model, d_ff)
            for _ in range(n_experts)
        ])

    def forward(self, x):
        # x: [B, T, D]

        B, T, D = x.shape

        scores = self.router(x)
        probs = torch.softmax(scores, dim=-1)

        top_expert = probs.argmax(dim=-1)

        out = torch.zeros_like(x)

        for expert_id, expert in enumerate(self.experts):
            mask = (top_expert == expert_id)

            if mask.any():
                selected = x[mask]
                result = expert(selected)
                out[mask] = result

        return out

This implementation is intentionally simple and inefficient. Real MoE systems use optimized grouped dispatch kernels and distributed expert execution.

Summary

Sparse expert architectures replace dense feedforward computation with conditionally activated experts. A router selects which experts process each token, allowing total parameter count to grow much larger than active compute.

MoE systems improve scaling efficiency by separating model capacity from per-token FLOPs. They introduce new challenges including routing stability, load balancing, communication overhead, capacity limits, and distributed dispatch.

Modern sparse transformers combine dense shared attention with sparse expert feedforward layers. This architecture has become an important approach for scaling very large foundation models efficiently.