spotinference Sign in with GitHub

Quantisation is the practice of storing a trained model's weights at lower numeric precision than the BF16 or FP16 used in training, with FP8, INT8, and INT4 the dominant production formats. Every halving of precision halves VRAM per parameter, deciding whether a model fits one GPU and how much memory remains for KV cache.

Quantisation

Quantisation stores the parameters of a trained neural network at a numeric precision lower than the BF16 or FP16 format used during training, with FP8, INT8, and INT4 the dominant production formats for transformer LLMs.

Every halving of weight precision halves the VRAM cost per parameter, which decides whether a model fits on one GPU and frees the memory budget that the serving engine then spends on KV cache and concurrent requests.

For the longer treatment, see How engines work: quantisation.

Cost and reliability implications

Quantisation is the single largest lever on dollars per token: INT4 weights fit a model on a quarter of the VRAM, often downgrading the required fleet tier from multi-GPU to single-GPU. The risk is silent quality loss; accuracy regressions do not throw errors, and KV-cache dtype mismatches on hybrid architectures can corrupt output silently, so eval gates and conservative cache settings are mandatory.

Part of Model behavior on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.