AWQ (Lin et al., MLSys 2024)
The MLSys 2024 paper that introduced activation-aware weight quantisation: rank channels by activation magnitude on a small calibration set, protect the salient one percent through per-channel pre-scaling, and store the result at uniform INT4 with no mixed-precision kernel.
AWQ became the dominant 4-bit weight-only scheme on Ampere and older accelerators where FP8 tensor cores are unavailable, and ships first-class in vLLM, llama.cpp, and HuggingFace transformers.
For the longer treatment in narrative context, see How engines work: quantisation.