GPTQ (Frantar et al., ICLR 2023)
The ICLR 2023 paper that made INT4 weight quantisation of 175-billion-parameter transformers a single-GPU-hour procedure, using an inverse-Hessian-guided greedy column sweep that minimises layer-output reconstruction error on a small calibration set.
GPTQ collapsed the eight-GPU rack needed to host an FP16 175B-class checkpoint to a single 80-gigabyte accelerator, making open-weight serving economically tractable across 2023 and 2024.
For the longer treatment in narrative context, see How engines work: quantisation.