Model behavior

What a model is (its architecture, its numeric precision, its decoding strategy) moves serving cost as much as any infrastructure choice. These pages cover the model-side levers: sparse mixture-of-experts routing, quantisation formats, and draft-model speculation.

What a model is, its architecture, its numeric precision, and its decoding strategy, moves serving cost as much as any infrastructure choice. A sparse mixture-of-experts activates a fraction of its parameters per token; a quantised checkpoint halves or quarters the memory per weight; a draft model amortises the memory traffic of decoding.

These model-side levers decide whether a model fits on one GPU or demands a tensor-parallel rack, and how fast it decodes once it does. The pages below cover the formats and techniques, and the research bibliography that grounds them in the original papers.

This pillar collects every page on the topic. Each one below opens with the answer; follow a link for the full treatment, or use the rail to cross into a neighbouring pillar.

Research and field notes

The bibliography and original-telemetry articles behind the model-side claims. Read more.

Quantisation

Storing weights at lower precision than the training format. Read more.

FP8 quantization

8-bit floating-point weights executed natively by Hopper tensor cores. Read more.

Speculative decoding

Draft-model token proposals verified in one target-model pass. Read more.

Mixture of experts (MoE)

Sparse expert routing: large total capacity, small per-token compute. Read more.

AWQ (Lin et al., MLSys 2024)

Activation-aware INT4 weight quantisation via per-channel pre-scaling. Read more.

GPTQ (Frantar et al., ICLR 2023)

Inverse-Hessian-guided post-training INT4 weight quantisation at 175B scale. Read more.

Speculative decoding (Leviathan / Chen, 2023)

Draft-and-verify decoding; 2-3x lower latency at zero quality cost. Read more.

Every page in this pillar describes the system running behind one endpoint. Point an OpenAI SDK at spotinference when ready: swap the base URL and the key, and the first request answers from the same fleet these pages measure.