spotinference Sign in with GitHub

PagedAttention is a page-based allocator for the transformer KV cache, modeled on operating-system virtual memory: logical token positions map through a per-sequence block table to non-contiguous fixed-size pages of GPU memory. Paging removes the fragmentation of contiguous cache buffers, lifting utilisation from roughly 20 to 40 percent toward nearly 100 percent.

PagedAttention

PagedAttention is a page-based allocator for the transformer key-value cache, modeled on the demand-paged virtual memory of an operating system: logical token positions map through a per-sequence block table to non-contiguous physical pages of GPU memory.

Pages eliminate the internal and external fragmentation of contiguous KV buffers, lifting cache utilisation from roughly 20-40 percent toward 100 percent and unlocking the concurrency that continuous batching depends on.

For the longer treatment, see How engines work: PagedAttention.

Cost and reliability implications

Fragmentation is wasted rent: every unused byte inside an over-provisioned contiguous KV buffer is VRAM paid for but never serving tokens. Page-level allocation converts that waste into admitted requests, raising throughput per GPU-hour severalfold on mixed-length traffic. The trade is allocator complexity inside the engine, where a block-table bug surfaces as cross-sequence corruption rather than a clean crash.

Part of Performance and latency on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.