PagedAttention
PagedAttention is a page-based allocator for the transformer key-value cache, modeled on the demand-paged virtual memory of an operating system: logical token positions map through a per-sequence block table to non-contiguous physical pages of GPU memory.
Pages eliminate the internal and external fragmentation of contiguous KV buffers, lifting cache utilisation from roughly 20-40 percent toward 100 percent and unlocking the concurrency that continuous batching depends on.
For the longer treatment, see How engines work: PagedAttention.