spotinference

Sign in with GitHub
GPU hardware for LLM inference

Most production LLM serving lands on three NVIDIA architectures: Hopper (H100, H200), Ampere (A100, A40), and Ada Lovelace (L40, L40S). HBM bandwidth sets the ceiling for memory-bound decode, and native FP8 support on Hopper roughly doubles effective throughput for models that tolerate the lower precision.

Rough single-device numbers for a memory-bound decode workload:

  • H100 SXM5: 80 GB HBM3, around 3.35 TB/s, native FP8. The default part for serving a single 70B-class dense model at high throughput.
  • A100 SXM4: 80 GB HBM2e, around 2.0 TB/s, no native FP8. Mature and abundant, often cheaper per hour, and throughput per dollar can beat an H100 on quantised workloads.
  • L40 / L40S: 48 GB GDDR6, around 864 GB/s, PCIe form factor. Best for smaller models (≤30B at FP8) and smaller batch sizes. It's less suited to long-context dense decode, but it's cheap to leave idle.

The NVIDIA architecture whitepapers (Hopper Tensor Core, A100 datasheet) are the authoritative numbers. Real-world inference benchmarks tend to fall well below these peaks on memory-bound decode, which is part of why the PagedAttention work cited on the economics page matters so much.

Multi-GPU topology

For models that don't fit on a single device, two parallelism strategies dominate. Tensor parallelism shards each layer's weights across multiple GPUs and synchronises activations on every layer. It works well over NVLink and gets painful over PCIe. Pipeline parallelism splits the model into sequential layer groups across devices. It's less bandwidth-hungry, but it introduces bubble inefficiency at low batch sizes. The glossary defines both: tensor parallelism, pipeline parallelism.

Quantisation tradeoffs

Each architecture supports a different set of quantised inference formats. Hopper has native FP8 (E4M3 and E5M2). Ampere and earlier rely on integer formats: INT8 is mature and widely supported, INT4 via GPTQ or AWQ is newer and more accuracy-preserving, and INT4 GGUF covers CPU-fallback flows.