vLLM is the open-source LLM serving engine that shipped with the PagedAttention paper and productionised paged KV management together with Orca-style continuous batching in one scheduler. Apache 2.0 licensed and community governed, it is the most-deployed open-source implementation of the OpenAI-compatible chat completions API.

vLLM (UC Berkeley Sky Lab, 2023)

The open-source serving engine that shipped alongside the PagedAttention SOSP 2023 paper, productionised Orca-style continuous batching in the same scheduler, and became the default benchmark target plus the most-deployed open-source OpenAI-compatible chat-completions server in 2026.

Chunked prefill, multi-backend coverage (CUDA, ROCm, TPU), and a rich quantisation matrix (FP8, AWQ, GPTQ, compressed-tensors) make vLLM the reference implementation against which new engines are measured.

For the longer treatment in narrative context, see How engines work: vLLM as the reference implementation.

Part of Performance and latency on the learn hub.

See also

References

vLLM project repository (vllm-project/vllm) Source of record. Apache 2.0 licence; governance moved from the original UC Berkeley group to the community-run vllm-project GitHub organisation through 2024.
vLLM documentation Canonical reference for supported models, quantisation backends, scheduler flags (including chunked prefill), and hardware targets across CUDA, ROCm, and TPU.
Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) The paper that introduced PagedAttention and shipped alongside the first public vLLM release.
Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) The continuous-batching design that vLLM's scheduler implements and then extends with chunked prefill.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.