spotinference Sign in with GitHub

AWQ is a post-training quantisation method that stores LLM weights at INT4 by protecting the one percent of weight channels with the largest activation magnitudes, rescaling them before a uniform quantiser. Published at MLSys 2024, it is the dominant 4-bit weight-only format on GPUs without FP8 tensor cores.

AWQ (Lin et al., MLSys 2024)

The MLSys 2024 paper that introduced activation-aware weight quantisation: rank channels by activation magnitude on a small calibration set, protect the salient one percent through per-channel pre-scaling, and store the result at uniform INT4 with no mixed-precision kernel.

AWQ became the dominant 4-bit weight-only scheme on Ampere and older accelerators where FP8 tensor cores are unavailable, and ships first-class in vLLM, llama.cpp, and HuggingFace transformers.

For the longer treatment in narrative context, see How engines work: quantisation.

Part of Model behavior on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.