GPTQ is a post-training quantisation algorithm that compresses transformer weights to INT4 one layer at a time, using an inverse-Hessian-guided column sweep to minimise reconstruction error against a small calibration set. Published at ICLR 2023, it made quantising a 175-billion-parameter model a single-GPU, few-hour procedure.

GPTQ (Frantar et al., ICLR 2023)

The ICLR 2023 paper that made INT4 weight quantisation of 175-billion-parameter transformers a single-GPU-hour procedure, using an inverse-Hessian-guided greedy column sweep that minimises layer-output reconstruction error on a small calibration set.

GPTQ collapsed the eight-GPU rack needed to host an FP16 175B-class checkpoint to a single 80-gigabyte accelerator, making open-weight serving economically tractable across 2023 and 2024.

For the longer treatment in narrative context, see How engines work: quantisation.

Part of Model behavior on the learn hub.

See also

References

Frantar, Ashkboos, Hoefler, Alistarh. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023. The primary source. Sections 3 and 4 derive the inverse-Hessian update; Section 5 reports the OPT-175B and BLOOM-176B numbers.
Frantar, Alistarh. Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. NeurIPS 2022. The OBQ predecessor that GPTQ scales to the 100-billion-parameter regime by replacing the per-row Hessian update with a column-block sweep.
Lin, Tang, Tang, et al. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024. The activation-aware follow-on that outperforms GPTQ on outlier-sensitive benchmarks; both schemes coexist in current serving stacks.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.