spotinference Sign in with GitHub

The KV cache is the per-sequence store of key and value attention tensors that an LLM retains between decode steps, so each new token costs one forward pass instead of recomputing attention over the whole prior context. Its GPU memory footprint, not arithmetic, usually caps how many requests a server can run concurrently.

KV cache

The KV cache is the per-sequence store of Key and Value attention tensors retained between decode steps so each new token costs one forward pass rather than a recomputation over the entire prior context.

Cache memory, not arithmetic, sets the ceiling on how many sequences fit on a GPU concurrently, which makes KV occupancy the binding constraint on serving throughput.

For the longer treatment, see How engines work: the KV cache.

Cost and reliability implications

Cache memory is the budget line that decides concurrency: a GPU whose VRAM is exhausted by resident sequences rejects or queues new work long before its arithmetic saturates. Quantising weights frees VRAM for cache and lifts throughput per dollar, while an overcommitted cache shows up operationally as preemption, recomputation, and sudden tail-latency spikes.

Part of Performance and latency on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.