spotinference Sign in with GitHub

FP8 quantization stores model weights, and often activations, in an 8-bit floating-point format (E4M3 or E5M2) that Hopper-class tensor cores execute natively. Against integer formats such as AWQ INT4 or compressed-tensors INT8, FP8 keeps floating-point dynamic range, halving VRAM versus FP16 with near-native accuracy and no dequantization step.

FP8 quantization

FP8 is the 8-bit floating-point pair standardised by NVIDIA, Arm, and Intel: E4M3 for weights and activations, E5M2 where dynamic range matters more than precision. Hopper-class tensor cores execute both natively, so an FP8 checkpoint runs without a dequantization step in the matmul path.

FP8 versus AWQ versus compressed-tensors

The three names answer different questions. FP8 is a numeric format the silicon executes directly. AWQ is an INT4 quantization algorithm that preserves the activation-salient weight channels, the tightest mainstream memory footprint. compressed-tensors is a checkpoint container produced by llm-compressor that packages INT8, FP8, and sparse layouts for vLLM to load directly.

Picking between them starts with the GPU generation: Hopper and later execute FP8 natively, while Ampere has no FP8 tensor cores and is served better by an integer format it can run on its INT8 path.

What the format is worth on real tiers

Measured on spotinference's fleet at short context, the h100x2 tier decodes 118.9 tokens per second with FP8 weights while the a100x2 tier decodes 106.6 tokens per second with a compressed-tensors checkpoint. The gap is the price of matching format to silicon across two GPU generations; mismatching them costs far more, because a format the hardware cannot execute natively dequantizes on the fly.

For the concept behind all of these formats, see How engines work: quantisation and the AWQ deep-dive.

Cost and reliability implications

Format choice is hardware arithmetic, and it shows up directly in dollars per token: measured on spotinference's tiers, short-context decode runs 118.9 tokens per second on h100x2 with FP8 against 106.6 tokens per second on a100x2 with compressed-tensors, because Ampere lacks FP8 tensor cores and needs an integer path. Pick the format the silicon executes natively; anything else dequantizes on the fly and quietly burns the speedup.

Part of Model behavior on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.