spotinference Sign in with GitHub

PagedAttention is the SOSP 2023 memory-management algorithm that maps each sequence's KV cache through a block table onto fixed-size, non-contiguous GPU memory pages, the way an operating system pages virtual memory. Eliminating fragmentation lifted KV utilisation from roughly 20 to 40 percent toward 100 and doubled to quadrupled serving throughput.

PagedAttention (Kwon et al., SOSP 2023)

A SOSP 2023 paper that reframed the transformer key-value cache as an operating-system-style paged address space, reporting two-to-four times higher sustained throughput on a fixed GPU by eliminating fragmentation losses from contiguous KV allocation.

The block-table indirection is what unlocks aggressive rebalancing under continuous batching, so PagedAttention and iteration-level scheduling now compose as table stakes in every production serving engine.

For the longer treatment in narrative context, see How engines work: PagedAttention.

Part of Performance and latency on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.