PagedAttention is the SOSP 2023 memory-management algorithm that maps each sequence's KV cache through a block table onto fixed-size, non-contiguous GPU memory pages, the way an operating system pages virtual memory. Eliminating fragmentation lifted KV utilisation from roughly 20 to 40 percent toward 100 and doubled to quadrupled serving throughput.

PagedAttention (Kwon et al., SOSP 2023)

A SOSP 2023 paper that reframed the transformer key-value cache as an operating-system-style paged address space, reporting two-to-four times higher sustained throughput on a fixed GPU by eliminating fragmentation losses from contiguous KV allocation.

The block-table indirection is what unlocks aggressive rebalancing under continuous batching, so PagedAttention and iteration-level scheduling now compose as table stakes in every production serving engine.

For the longer treatment in narrative context, see How engines work: PagedAttention.

Part of Performance and latency on the learn hub.

See also

References

Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) The original paper. Introduces the block table and copy-on-write sharing, reports KV utilisation rising from 20-40% to near 100% and 2-4x throughput on OPT and LLaMA models at unchanged latency.
vLLM project repository Reference open-source implementation that shipped alongside the paper and became the canonical serving engine.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.