PagedAttention (Kwon et al., SOSP 2023)
A SOSP 2023 paper that reframed the transformer key-value cache as an operating-system-style paged address space, reporting two-to-four times higher sustained throughput on a fixed GPU by eliminating fragmentation losses from contiguous KV allocation.
The block-table indirection is what unlocks aggressive rebalancing under continuous batching, so PagedAttention and iteration-level scheduling now compose as table stakes in every production serving engine.
For the longer treatment in narrative context, see How engines work: PagedAttention.