FP8 quantization
FP8 is the 8-bit floating-point pair standardised by NVIDIA, Arm, and Intel: E4M3 for weights and activations, E5M2 where dynamic range matters more than precision. Hopper-class tensor cores execute both natively, so an FP8 checkpoint runs without a dequantization step in the matmul path.
FP8 versus AWQ versus compressed-tensors
The three names answer different questions. FP8 is a numeric format the silicon executes directly. AWQ is an INT4 quantization algorithm that preserves the activation-salient weight channels, the tightest mainstream memory footprint. compressed-tensors is a checkpoint container produced by llm-compressor that packages INT8, FP8, and sparse layouts for vLLM to load directly.
Picking between them starts with the GPU generation: Hopper and later execute FP8 natively, while Ampere has no FP8 tensor cores and is served better by an integer format it can run on its INT8 path.
What the format is worth on real tiers
Measured on spotinference's fleet at short context, the h100x2 tier decodes 118.9 tokens per second with FP8 weights while the a100x2 tier decodes 106.6 tokens per second with a compressed-tensors checkpoint. The gap is the price of matching format to silicon across two GPU generations; mismatching them costs far more, because a format the hardware cannot execute natively dequantizes on the fly.
For the concept behind all of these formats, see How engines work: quantisation and the AWQ deep-dive.