The KV cache is the per-sequence store of key and value attention tensors that an LLM retains between decode steps, so each new token costs one forward pass instead of recomputing attention over the whole prior context. Its GPU memory footprint, not arithmetic, usually caps how many requests a server can run concurrently.

KV cache

The KV cache is the per-sequence store of Key and Value attention tensors retained between decode steps so each new token costs one forward pass rather than a recomputation over the entire prior context.

Cache memory, not arithmetic, sets the ceiling on how many sequences fit on a GPU concurrently, which makes KV occupancy the binding constraint on serving throughput.

For the longer treatment, see How engines work: the KV cache.

Cost and reliability implications

Cache memory is the budget line that decides concurrency: a GPU whose VRAM is exhausted by resident sequences rejects or queues new work long before its arithmetic saturates. Quantising weights frees VRAM for cache and lifts throughput per dollar, while an overcommitted cache shows up operationally as preemption, recomputation, and sudden tail-latency spikes.

Part of Performance and latency on the learn hub.

See also

References

Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. Reports 20-40% KV cache utilisation under contiguous allocation and the path to near-100% with block paging.
Ainslie et al. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP 2023. Introduces grouped-query attention, the n_kv_heads shrink that bounds the per-token formula in modern Transformers.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.