PagedAttention is a page-based allocator for the transformer KV cache, modeled on operating-system virtual memory: logical token positions map through a per-sequence block table to non-contiguous fixed-size pages of GPU memory. Paging removes the fragmentation of contiguous cache buffers, lifting utilisation from roughly 20 to 40 percent toward nearly 100 percent.

PagedAttention

PagedAttention is a page-based allocator for the transformer key-value cache, modeled on the demand-paged virtual memory of an operating system: logical token positions map through a per-sequence block table to non-contiguous physical pages of GPU memory.

Pages eliminate the internal and external fragmentation of contiguous KV buffers, lifting cache utilisation from roughly 20-40 percent toward 100 percent and unlocking the concurrency that continuous batching depends on.

For the longer treatment, see How engines work: PagedAttention.

Cost and reliability implications

Fragmentation is wasted rent: every unused byte inside an over-provisioned contiguous KV buffer is VRAM paid for but never serving tokens. Page-level allocation converts that waste into admitted requests, raising throughput per GPU-hour severalfold on mixed-length traffic. The trade is allocator complexity inside the engine, where a block-table bug surfaces as cross-sequence corruption rather than a clean crash.

Part of Performance and latency on the learn hub.

See also

References

Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) Original paper; introduces the block table and copy-on-write sharing, reports the 2-4x throughput result.
vLLM project repository Canonical open-source implementation.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.