Quantisation
Quantisation stores the parameters of a trained neural network at a numeric precision lower than the BF16 or FP16 format used during training, with FP8, INT8, and INT4 the dominant production formats for transformer LLMs.
Every halving of weight precision halves the VRAM cost per parameter, which decides whether a model fits on one GPU and frees the memory budget that the serving engine then spends on KV cache and concurrent requests.
For the longer treatment, see How engines work: quantisation.