Research

Original telemetry articles built on the fleet's own measurements, followed by a short, opinionated bibliography on LLM inference serving, GPU memory management, and the production-engineering style that shapes systems in this space. Each link leads to a deep page.

Measured telemetry

Articles that state their methodology, carry the fleet's measured numbers, and mark every pending measurement as pending instead of estimating it.

H100 vs A100: measured decode throughputDual H100s at 118.9 tok/s vs dual A100s at 106.6, short-context decode through the production gateway.
Anatomy of a GPU wake: the 8-minute budgetHibernate-restore decomposed: VM restore, weight load, compile cache, CUDA graphs, first token.
Invoice-true cost per million tokensMinute-level billing snapshots over per-request token counts: an honest $0.30 to $0.70 per Mtok band.

Serving systems and the KV cache

Orca: continuous batching (Yu et al., OSDI 2022)Iteration-level scheduling and selective batching for transformer serving.
PagedAttention (Kwon et al., SOSP 2023)Paged KV cache; 2-4x throughput by eliminating fragmentation.
vLLM (the serving engine)Reference open-source LLM serving engine; PagedAttention plus continuous batching.

Post-training quantisation

AWQ (Lin et al., MLSys 2024)Activation-aware INT4 weight quantisation via per-channel pre-scaling.
GPTQ (Frantar et al., ICLR 2023)Inverse-Hessian-guided post-training INT4 weight quantisation at 175B scale.

Decoding strategies

Speculative decoding (Leviathan / Chen, 2023)Draft-and-verify decoding; 2-3x lower latency at zero quality cost.

Engineering practice

Lean-stack engineeringOne binary, one box; auditability as the deliverable.

Hardware datasheets

NVIDIA Hopper architecture: Tensor Core whitepapers and H100 reference. Authoritative numbers for HBM bandwidth, FP8 throughput, NVLink topology.
NVIDIA A100 datasheet and L40 product page. Comparison points cited on the hardware page.