AWQ is a post-training quantisation method that stores LLM weights at INT4 by protecting the one percent of weight channels with the largest activation magnitudes, rescaling them before a uniform quantiser. Published at MLSys 2024, it is the dominant 4-bit weight-only format on GPUs without FP8 tensor cores.

AWQ (Lin et al., MLSys 2024)

The MLSys 2024 paper that introduced activation-aware weight quantisation: rank channels by activation magnitude on a small calibration set, protect the salient one percent through per-channel pre-scaling, and store the result at uniform INT4 with no mixed-precision kernel.

AWQ became the dominant 4-bit weight-only scheme on Ampere and older accelerators where FP8 tensor cores are unavailable, and ships first-class in vLLM, llama.cpp, and HuggingFace transformers.

For the longer treatment in narrative context, see How engines work: quantisation.

Part of Model behavior on the learn hub.

See also

gptq

References

Lin, Tang, Tang, Yang, Chen, Wang, Xiao, Dang, Gan, Han. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (MLSys 2024). The original paper. Introduces the activation-magnitude saliency criterion and the per-channel scaling trick that collapses a mixed-precision idea into a uniform INT4 kernel.
llm-awq reference implementation (MIT Han Lab). The authors' reference code for calibration, scale search, and INT4 packing; also home to the TinyChat kernels used as the original inference path.
Frantar, Ashkboos, Hoefler, Alistarh. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (ICLR 2023). The prior-art baseline AWQ is measured against; layer-wise Hessian-guided reconstruction with no activation-aware saliency.
vLLM AWQ quantisation support. Production AWQ inference path in vLLM; the dominant deployment target for INT4 weight-only checkpoints on Ampere-class accelerators.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.