spotinference Sign in with GitHub

vLLM is the open-source LLM serving engine that shipped with the PagedAttention paper and productionised paged KV management together with Orca-style continuous batching in one scheduler. Apache 2.0 licensed and community governed, it is the most-deployed open-source implementation of the OpenAI-compatible chat completions API.

vLLM (UC Berkeley Sky Lab, 2023)

The open-source serving engine that shipped alongside the PagedAttention SOSP 2023 paper, productionised Orca-style continuous batching in the same scheduler, and became the default benchmark target plus the most-deployed open-source OpenAI-compatible chat-completions server in 2026.

Chunked prefill, multi-backend coverage (CUDA, ROCm, TPU), and a rich quantisation matrix (FP8, AWQ, GPTQ, compressed-tensors) make vLLM the reference implementation against which new engines are measured.

For the longer treatment in narrative context, see How engines work: vLLM as the reference implementation.

Part of Performance and latency on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.