Mixture of experts (MoE)
Dense transformers spend the same compute on every token. A mixture-of-experts transformer swaps each feed-forward block for a set of parallel expert networks plus a learned router, and each token is processed by only the few experts the router selects.
Routing and sparse activation
The router is a small learned gate that scores experts per token and dispatches to the top few. Total parameter count grows with the number of experts while per-token FLOPs stay near a small dense model, which is how qwen3.6-a3b reaches roughly 35B parameters of capacity at about 3B activated per forward pass. The architectural risks are routing collapse, where traffic concentrates on few experts, and load imbalance across devices; production recipes counter both with auxiliary balancing losses.
Why MoE anchors the cheap serving tier
Small-activation MoE models deliver capability per dollar that dense peers cannot match at decode time: served through spotinference, qwen3.6-a3b solved 7 of 8 mined real-repository coding tasks inside a 600K-token budget (measured 2026-06-11). The operational cost surfaces as memory, since every expert must sit in VRAM regardless of how rarely it fires, and as quantization sharp edges, because expert weights tolerate compression differently than dense layers.
On those sharp edges, see How engines work: AWQ; early vLLM releases shipped a known AWQ support gap on MoE checkpoints.