spotinference Sign in with GitHub

GPTQ is a post-training quantisation algorithm that compresses transformer weights to INT4 one layer at a time, using an inverse-Hessian-guided column sweep to minimise reconstruction error against a small calibration set. Published at ICLR 2023, it made quantising a 175-billion-parameter model a single-GPU, few-hour procedure.

GPTQ (Frantar et al., ICLR 2023)

The ICLR 2023 paper that made INT4 weight quantisation of 175-billion-parameter transformers a single-GPU-hour procedure, using an inverse-Hessian-guided greedy column sweep that minimises layer-output reconstruction error on a small calibration set.

GPTQ collapsed the eight-GPU rack needed to host an FP16 175B-class checkpoint to a single 80-gigabyte accelerator, making open-weight serving economically tractable across 2023 and 2024.

For the longer treatment in narrative context, see How engines work: quantisation.

Part of Model behavior on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.