spotinference Sign in with GitHub

A mixture-of-experts model replaces the transformer's dense feed-forward block with many parallel expert networks and a learned router that activates only a few experts per token. Capacity scales far beyond what any one token touches: qwen3.6-a3b carries roughly 35B total parameters yet activates about 3B on each forward pass.

Mixture of experts (MoE)

Dense transformers spend the same compute on every token. A mixture-of-experts transformer swaps each feed-forward block for a set of parallel expert networks plus a learned router, and each token is processed by only the few experts the router selects.

Routing and sparse activation

The router is a small learned gate that scores experts per token and dispatches to the top few. Total parameter count grows with the number of experts while per-token FLOPs stay near a small dense model, which is how qwen3.6-a3b reaches roughly 35B parameters of capacity at about 3B activated per forward pass. The architectural risks are routing collapse, where traffic concentrates on few experts, and load imbalance across devices; production recipes counter both with auxiliary balancing losses.

Why MoE anchors the cheap serving tier

Small-activation MoE models deliver capability per dollar that dense peers cannot match at decode time: served through spotinference, qwen3.6-a3b solved 7 of 8 mined real-repository coding tasks inside a 600K-token budget (measured 2026-06-11). The operational cost surfaces as memory, since every expert must sit in VRAM regardless of how rarely it fires, and as quantization sharp edges, because expert weights tolerate compression differently than dense layers.

On those sharp edges, see How engines work: AWQ; early vLLM releases shipped a known AWQ support gap on MoE checkpoints.

Cost and reliability implications

Sparse activation is what makes a cheap serving tier viable: served through spotinference, the 3B-active qwen3.6-a3b solved 7 of 8 mined real-repository coding tasks inside a 600K-token budget (measured 2026-06-11), at a fraction of dense-model cost per token. The bill arrives as memory instead: all 35B parameters must sit in VRAM even though roughly 3B fire per token, so MoE pressures GPU capacity before it pressures arithmetic.

Part of Model behavior on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.