Quantisation is the practice of storing a trained model's weights at lower numeric precision than the BF16 or FP16 used in training, with FP8, INT8, and INT4 the dominant production formats. Every halving of precision halves VRAM per parameter, deciding whether a model fits one GPU and how much memory remains for KV cache.

Quantisation

Quantisation stores the parameters of a trained neural network at a numeric precision lower than the BF16 or FP16 format used during training, with FP8, INT8, and INT4 the dominant production formats for transformer LLMs.

Every halving of weight precision halves the VRAM cost per parameter, which decides whether a model fits on one GPU and frees the memory budget that the serving engine then spends on KV cache and concurrent requests.

For the longer treatment, see How engines work: quantisation.

Cost and reliability implications

Quantisation is the single largest lever on dollars per token: INT4 weights fit a model on a quarter of the VRAM, often downgrading the required fleet tier from multi-GPU to single-GPU. The risk is silent quality loss; accuracy regressions do not throw errors, and KV-cache dtype mismatches on hybrid architectures can corrupt output silently, so eval gates and conservative cache settings are mandatory.

Part of Model behavior on the learn hub.

See also

References

Frantar, Ashkboos, Hoefler, Alistarh. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (ICLR 2023). Layer-wise reconstruction-error minimisation with an inverse-Hessian-guided greedy update; the foundational paper for production INT4 weight quantisation of LLMs.
Lin, Tang, Tang, et al. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (MLSys 2024). Identifies the salient weight channels by activation magnitude and preserves them at higher effective precision; tends to outperform GPTQ on outlier-sensitive tasks.
vLLM issue #37554. Silent output corruption with non-auto KV cache dtype on hybrid Gated-DeltaNet architectures. Documents the silent-corruption failure mode that motivates leaving the KV cache at native precision for hybrid GDN models even when weights are quantised.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.