Performance and latency

LLM serving speed is set by memory and scheduling, not raw arithmetic: the KV cache decides how many requests fit on a GPU, and iteration-level batching decides how busy the GPU stays. These pages walk the engine mechanics, the hardware, and the throughput numbers measured on this fleet.

LLM serving speed is set by memory and scheduling, not raw arithmetic. The first forward pass over a prompt is compute-bound and decides time to first token; every step after it is memory-bandwidth-bound and decides the steady decode rate. The two regimes have different bottlenecks, so the tuning that helps one is largely orthogonal to the other.

Aggregate throughput is a capacity question: how many sequences fit on the GPU at once. That ceiling is set by the key-value cache, which grows with context length, so the techniques that matter most are the ones that pack the cache and schedule the batch densely. The pages below walk the engine mechanics, the hardware that hosts them, and the numbers measured on this fleet.

This pillar collects every page on the topic. Each one below opens with the answer; follow a link for the full treatment, or use the rail to cross into a neighbouring pillar.

How LLM inference engines work

One long argument from TTFT through the KV cache to vLLM. Read more.

GPU hardware for LLM inference

Hopper, Ampere, and Ada Lovelace: bandwidth, FP8, and parallelism. Read more.

Continuous batching

Iteration-level scheduling where requests join and leave every decode step. Read more.

KV cache

Per-sequence attention tensors retained between decode steps. Read more.

PagedAttention

Page-based KV cache allocator modeled on OS virtual memory. Read more.

Pipeline parallelism (PP)

Layer-group stages assigned one per GPU, micro-batches streaming forward. Read more.

Tensor parallelism (TP)

Intra-layer sharding of weight matrices across N GPUs. Read more.

TTFT, time to first token

End-to-end latency from request submit to first streamed response byte. Read more.

Prefill vs decode

The compute-bound prompt pass and the memory-bound token loop every request splits into. Read more.

Orca: continuous batching (Yu et al., OSDI 2022)

Iteration-level scheduling and selective batching for transformer serving. Read more.

PagedAttention (Kwon et al., SOSP 2023)

Paged KV cache; 2-4x throughput by eliminating fragmentation. Read more.

vLLM (the serving engine)

Reference open-source LLM serving engine; PagedAttention plus continuous batching. Read more.

H100 vs A100: measured decode throughput

Dual H100s at 118.9 tok/s vs dual A100s at 106.6, short-context decode through the production gateway. Read more.

How the KV cache sets LLM API latency and cost

Prefill against decode, the memory arithmetic that caps concurrency, and what an API caller can do about both. Read more.

Every page in this pillar describes the system running behind one endpoint. Point an OpenAI SDK at spotinference when ready: swap the base URL and the key, and the first request answers from the same fleet these pages measure.