Original telemetry articles built on the fleet's own measurements, followed by a short, opinionated bibliography on LLM inference serving, GPU memory management, and the production-engineering style that shapes systems in this space. Each link leads to a deep page.
Articles that state their methodology, carry the fleet's measured numbers, and mark every pending measurement as pending instead of estimating it.
- H100 vs A100: measured decode throughputDual H100s at 118.9 tok/s vs dual A100s at 106.6, short-context decode through the production gateway.
- Anatomy of a GPU wake: the 8-minute budgetHibernate-restore decomposed: VM restore, weight load, compile cache, CUDA graphs, first token.
- Invoice-true cost per million tokensMinute-level billing snapshots over per-request token counts: an honest $0.30 to $0.70 per Mtok band.
- Orca: continuous batching (Yu et al., OSDI 2022)Iteration-level scheduling and selective batching for transformer serving.
- PagedAttention (Kwon et al., SOSP 2023)Paged KV cache; 2-4x throughput by eliminating fragmentation.
- vLLM (the serving engine)Reference open-source LLM serving engine; PagedAttention plus continuous batching.
- AWQ (Lin et al., MLSys 2024)Activation-aware INT4 weight quantisation via per-channel pre-scaling.
- GPTQ (Frantar et al., ICLR 2023)Inverse-Hessian-guided post-training INT4 weight quantisation at 175B scale.
- Speculative decoding (Leviathan / Chen, 2023)Draft-and-verify decoding; 2-3x lower latency at zero quality cost.
- Lean-stack engineeringOne binary, one box; auditability as the deliverable.
- NVIDIA Hopper architecture: Tensor Core whitepapers and H100 reference. Authoritative numbers for HBM bandwidth, FP8 throughput, NVLink topology.
- NVIDIA A100 datasheet and L40 product page. Comparison points cited on the hardware page.