spotinference

Sign in with GitHub
How LLM inference engines work

Modern LLM inference engines turn a stream of incoming chat requests into a stream of generated tokens against a fixed GPU budget. The interesting design choices are about memory and scheduling, not raw compute.

This page is one long argument, not a glossary. It starts at the metric a reader actually perceives (time to the first visible token) and the data structure that decides how many concurrent users a GPU can hold (the key-value cache). From there it walks the two storage and scheduling techniques that lifted open-source serving throughput by an order of magnitude in 2022 and 2023: PagedAttention and continuous batching.

The second half covers the quantisation work that decides whether a model fits on the available hardware at all (GPTQ and AWQ), the parallelism strategies invoked when one GPU is not enough, the decoding trick that amortises memory bandwidth across multiple tokens per forward pass, and the reference implementation that ties the bundle together (vLLM). The aim is a single coherent model of how an inference engine actually spends its cycles.

What an engine is measured on

Three numbers describe most of the user-visible behaviour of an inference engine. Time to first token (TTFT) is the elapsed wall-clock time between a client submitting a request and that client receiving the first response byte that carries generated content. Tokens per second is the steady-state per-request decode rate after the first token has been emitted, sometimes reported as inter-token latency (ITL). Concurrent sequences is the count of requests the engine can keep in flight at once on a single GPU.

The first two are latency metrics on the same request; the third is a capacity metric across requests. An engine optimises all three under the constraint of a fixed hardware budget, and the trade-offs between them shape every interesting design decision below.

TTFT and tokens-per-second describe two different halves of the response curve because they reflect two different computations. The first forward pass through the model after a request arrives is a prefill step: the engine reads every prompt token in one wide pass, computes attention over the prompt, and populates the per-layer key and value tensors. Prefill is arithmetic-intensive and scales roughly linearly with prompt length once the prompt exceeds the tile size of the attention kernel.

Every step after the first is a decode step: the model consumes the single new token from the previous step, computes its query, attends to the cached keys and values of every prior position, and emits one new token. Decode is memory-bandwidth-bound and runs at a near-fixed cost per token regardless of position. A 1,000-token prompt followed by a 100-token reply spends most of its TTFT inside that one prefill pass and most of its remaining wall clock inside 99 decode passes.

The two regimes have different bottlenecks, which is why the same engine tuning that improves TTFT (chunked prefill, prefix caching) is largely orthogonal to the tuning that improves tokens-per-second (paged KV, speculative decoding).

The KV cache and why it dominates memory

Transformer self-attention computes, at every layer and head, a query vector for the current token and the corresponding key and value vectors for every position in the context. During generation the model emits one token per decode step. The key and value tensors for every prior token are identical from one step to the next; only the new token contributes fresh entries.

The key-value cache is the data structure that retains those keys and values across decode steps and reads them back instead of recomputing them. Without caching, generating a sequence of length T from a prompt of length P costs O((P+T)^2) attention work, because every step would rerun attention over every prior position. With caching, the per-step cost collapses to a single forward pass through the model plus the constant work of appending one (K, V) pair per layer per head.

Each cached token consumes a fixed amount of memory determined by the architecture and the storage precision:

KV bytes per token = 2 × n_layers × n_kv_heads × d_head × dtype_bytes
Total KV for a sequence = (KV bytes per token) × seq_len

The leading factor of two accounts for storing both K and V. The product over layers and heads reflects that every attention layer keeps its own independent cache and each head within a layer keeps its own slot. The dtype factor is two bytes for FP16 or BF16, one byte for FP8 or INT8, and a half byte for INT4.

A worked example: a 70-billion-parameter dense transformer with 80 layers, 8 KV heads (after grouped-query reduction from 64 query heads), head dimension 128, and FP16 cache precision stores 2 × 80 × 8 × 128 × 2 = 327,680 bytes per token, roughly 327 KB. At a 128K context that single sequence holds 43 GB of cache.

A single H100 with 80 GB of HBM can therefore hold the weights and at most one such sequence before exhausting memory. The same model at 4K context consumes 1.3 GB per sequence, leaving room for dozens of concurrent users.

Two architectural choices push the cost down. Grouped-query attention (GQA, Ainslie et al. EMNLP 2023) shrinks the cache by replacing per-head K and V projections with a smaller number of shared projections, so the n_kv_heads factor in the formula is the GQA-reduced count rather than the query head count. A model with 64 query heads and 8 KV heads stores one eighth of the multi-head cache without measurable quality loss on standard benchmarks.

Hybrid architectures interleave conventional attention layers with linear-recurrence layers such as Mamba state-space blocks or Gated DeltaNet blocks. The recurrent layers hold a fixed-size internal state and contribute no token-indexed cache at all, so the n_layers factor collapses to the attention-layer count. An 80-layer hybrid with the repeating pattern of three recurrent blocks followed by one attention block keeps attention in only twenty layers, and the KV footprint at long context falls accordingly.

The arithmetic cost of decoding one token is roughly fixed per model and per hardware tier; the variable that determines aggregate throughput is how many sequences fit on the GPU at once, and that ceiling is set by KV memory rather than by compute.

PagedAttention: virtualizing the cache

A naive serving engine allocates a contiguous KV buffer for each incoming request, sized for the worst-case sequence length the engine advertises. Two forms of waste follow. Internal fragmentation: most requests finish well before the worst-case length, so the tail of every reserved buffer is allocated but never written. External fragmentation: when a finished request frees its buffer, the resulting hole is rarely the right shape for the next incoming request, and the allocator either splits new work across smaller holes or leaves the hole idle.

Kwon and colleagues at UC Berkeley instrumented production traces and reported in their SOSP 2023 paper that under realistic request-length distributions only twenty to forty percent of the reserved KV memory ever held live tokens. The remaining sixty to eighty percent was reserved-but-unused, and the wasted bytes bounded the number of concurrent sequences the GPU could hold.

PagedAttention borrows the demand-paged virtual-memory pattern from operating systems. The KV cache of each running sequence is partitioned into fixed-size blocks (typically holding the keys and values for sixteen tokens of one attention head), and a per-sequence block table maps the logical token positions of the sequence to the physical blocks that hold the cached entries. The attention kernel is rewritten to consult the block table on every step, so the physical blocks need not be contiguous in device memory and need not be allocated up front.

When a new sequence arrives the scheduler allocates zero blocks; it appends a new block to the sequence's entry only when the previous one fills. When a sequence finishes, every block it owned is returned to a free list and is immediately available to other sequences.

Just as an operating system can overcommit physical memory because most processes touch only a fraction of their address space, a serving engine using PagedAttention can pack many more concurrent sequences onto a GPU because each sequence consumes only as many blocks as the tokens it has actually produced.

With PagedAttention the live fraction of the KV cache rose from the twenty-to-forty percent typical of contiguous allocation to near one hundred percent. The original paper reported two-to-four times higher sustained throughput at the same latency budget across model sizes from OPT-13B through LLaMA-65B, achieved entirely by fitting more concurrent sequences into the same KV memory. Per-token decode latency was unchanged because the block-table indirection added negligible overhead inside the existing memory-load pipeline.

A second consequence of the indirection is that two sequences can share a physical block by pointing to the same frame. The scheduler enforces copy-on-write semantics: when one sequence writes to a shared block, the engine copies the block first and updates only the writer's block table. This makes parallel sampling strategies such as beam search and parallel decoding nearly free in memory, because the common prefix occupies one set of blocks no matter how many branches are active and only the divergent suffixes consume additional blocks.

Continuous batching: scheduling at iteration level

Where PagedAttention is a memory-allocation improvement, continuous batching is a scheduling improvement, and the two are independent in concept though they almost always ship together.

Continuous batching is a request-scheduling discipline for transformer inference servers in which the scheduler considers batch composition at every decoder iteration rather than at every request boundary. A request becomes eligible to join the running batch the moment its prefill completes; a request leaves the batch the moment it emits its end-of-sequence token or hits its max-token cap. The batch shape therefore changes one step at a time.

The static-batching baseline it replaced groups N requests into a fixed batch and runs the model on the whole batch in lockstep. Every request advances by one token per iteration. The batch retires only when every member has finished.

Generation lengths are heterogeneous (one prompt may want twenty tokens of output, the next two thousand), so under static batching the short request finishes first and then waits, holding its slot but contributing no useful work, until the longest-running peer in the batch finishes. The GPU spends those iterations doing matmul work for a batch whose effective size has shrunk to the long-tail straggler. New requests queued behind that batch cannot start. Throughput collapses to whatever the slowest request in each batch dictates.

Iteration-level scheduling fixes this directly: admission is opportunistic, eviction is immediate, and the next iteration sees a batch that already includes a replacement.

Orca, where this came from

The technique was introduced and named in the Orca paper (Yu et al., OSDI 2022, Seoul National University). Orca calls the underlying primitive iteration-level scheduling; the operational shorthand continuous batching has since become the more common label across production engines.

The Orca paper pairs iteration-level scheduling with a second mechanism it calls selective batching: most operators in a transformer decoder (per-token projections, the MLP, the layer norms) are shape-agnostic and work fine on a flattened tensor of tokens regardless of which request each token belongs to, while the attention operator is the exception because queries must attend to keys and values from the same sequence and only up to the current position.

Orca runs the shape-agnostic operators batched across all requests and runs attention separately per request, breaking the training-time assumption that the entire forward pass must operate on a tensor of uniform shape.

Across the workloads tested, Orca reported throughput improvements ranging from roughly 1.4x to 23x over the FasterTransformer baseline, with the largest gains under skewed output-length distributions (the empirical norm for production traffic) and the smallest under near-uniform output lengths. A second-order finding concerns tail latency: under static batching, a short request's completion time is bounded below by the longest peer in its batch; under iteration-level scheduling there is no shared retirement barrier, so median and tail latencies improve together rather than trading off.

Continuous batching composes cleanly with PagedAttention because both treat the batch as a fluid object. With contiguous KV allocation, eviction between iterations would require copying or compacting contiguous regions, which limits how aggressively a scheduler can rebalance. With paged KV, eviction is a free-list operation on blocks, so the scheduler can rebalance freely. The two techniques together are now table stakes for any production inference engine.

Quantisation: shrinking weights to fit more in VRAM

A trained transformer ships from its training pipeline with each weight encoded as a 16-bit floating-point number, typically BF16. Post-training quantisation converts those weights into a narrower representation (8 bits, 4 bits, or, on Blackwell-class hardware, 4-bit floating point) together with a small amount of per-tensor or per-channel metadata (a scale and, for asymmetric formats, a zero point) that lets the inference kernel reconstruct an approximation of the original value at matmul time.

The conversion is performed once after training, and the quantised checkpoint replaces the high-precision one on disk and in GPU memory. Every halving of weight precision halves the VRAM cost per parameter: a 70-billion-parameter dense model occupies roughly 140 GB at FP16, 70 GB at FP8 or INT8, and 35 GB at INT4.

That arithmetic decides whether a model fits on one GPU, on a tensor-parallel pair, or not at all on a given accelerator class, and the freed memory translates directly into more KV cache and therefore more concurrent users.

On Hopper-class hardware a second axis matters: tensor-core throughput is higher at FP8 than at FP16, because the same silicon processes twice as many multiply-accumulate operations per clock when each operand is half the width. A well-calibrated FP8 checkpoint on an H100 frequently runs faster than the FP16 checkpoint of the same model, in addition to fitting in less memory. FP8 is native to Hopper and Ada generation accelerators (H100, H200, L40, L40S).

INT8 has broader hardware support, including older Ampere parts that lack the FP8 tensor-core path. INT4, produced by GPTQ or AWQ, delivers a roughly four-fold VRAM reduction against FP16; quality loss is measurable on hard reasoning benchmarks (MMLU, GSM8K, code generation) but is typically acceptable for open-ended chat workloads and is the format of choice when a model must fit on a single GPU that would otherwise demand tensor parallelism. FP4 is the newest entrant, supported by Blackwell parts (B100, B200, GB200).

GPTQ

GPTQ (Frantar, Ashkboos, Hoefler, and Alistarh, ICLR 2023) made INT4 quantisation of 175-billion-parameter transformers a practical single-GPU-hour procedure rather than a research curiosity. The algorithm treats each linear layer in isolation: given the layer's weight matrix and a small calibration set of activations, find a quantised weight matrix that minimises the squared error between the original layer output and the quantised layer output.

The Hessian of this quadratic loss with respect to the weights is computed once per layer from the calibration activations, and the inverse Hessian directs a greedy column-by-column quantisation in which the residual error from each rounded column is propagated forward to the not-yet-quantised columns.

On OPT-175B and BLOOM-176B the produced INT4 checkpoints sat within 0.1 to 0.4 perplexity points of the FP16 baseline on WikiText-2 and C4, and the quantisation procedure itself completed in roughly four GPU-hours on a single A100. The headline practical consequence: a 175-billion-parameter model compressed from 350 GB at FP16 to 88 GB at INT4 fits on a single 80-gigabyte accelerator, instead of demanding an eight-GPU tensor-parallel rack.

AWQ

AWQ (Lin, Tang, Tang, and colleagues, MLSys 2024, Han Lab at MIT) takes an activation-aware approach. The observation that motivates the method is that not all weights matter equally: a small fraction, roughly the top one percent, carry most of the layer's expressive capacity, and those weights are identifiable not by their own magnitude but by the magnitude of the activations that pass through them.

AWQ preserves the salient weight channels at higher effective precision by rescaling activations and weights together before applying a uniform INT4 quantiser. If a weight column is multiplied by a per-channel scalar before quantisation and the corresponding activation channel is divided by the same scalar after quantisation, the matmul result is unchanged in full precision and the salient weights effectively occupy a smaller portion of the quantisation grid. The scale is absorbed into the preceding LayerNorm or linear projection, so no runtime rescaling is needed.

AWQ checkpoints tend to outperform GPTQ on tasks sensitive to outlier behaviour and on bit widths below INT4, and the calibration is cheaper because it requires no Hessian estimate. Both algorithms ship as first-class formats in vLLM today, and the choice of which to use is typically a per-checkpoint decision made by the model author.

Parallelism strategies when one GPU is not enough

Tensor parallelism (TP) is a form of intra-layer model parallelism, introduced for transformer training by Shoeybi et al. in the Megatron-LM paper (arXiv:1909.08053) and applied at inference time without modification. Each layer's weight matrices are partitioned across N GPUs along a chosen axis: the attention projections shard across the head dimension, the first feed-forward linear shards along its output dimension, and the second feed-forward linear shards along its input dimension.

After the attention sublayer and after the MLP sublayer the partial results are combined via an all-reduce, in which every GPU contributes its partial output and every GPU receives the full sum. A model with L blocks therefore incurs 2L all-reduces per forward pass, with the communicated tensor scaling as batch size times sequence length times hidden dimension. TP is bandwidth-intensive and latency-sensitive: each all-reduce moves the full activation across the interconnect, and the synchronisation barrier means the slowest link sets the step time.

Within a single node connected by NVLink, where aggregate bidirectional bandwidth is measured in hundreds of gigabytes per second and per-hop latency is sub-microsecond, TP=2 doubles the weight capacity available to the model at a throughput tax in the fifteen-to-twenty-five percent range. Across nodes, even over InfiniBand HDR or NDR, the all-reduces dominate step time and throughput collapses. Multi-node tensor parallelism is rarely deployed at inference for this reason.

Pipeline parallelism

Pipeline parallelism (PP), formalised for large-scale training by Huang and colleagues in GPipe (NeurIPS 2019), is the alternative axis. It partitions the transformer into contiguous groups of layers called stages, assigns each stage to one GPU, and streams activations forward from stage to stage. At steady state with a sufficiently full pipeline, each of the K stages is busy on a different micro-batch every step, and the only inter-stage traffic is one point-to-point send of one activation tensor per micro-batch (no all-reduce, no broadcast).

The activation volume per stage boundary is dwarfed by what tensor parallelism would impose on the same boundary, which is why pipeline parallelism is the conventional choice when the parallelism cuts across a slow interconnect. The cost is the pipeline bubble: the pipeline only reaches full utilisation when at least K micro-batches are in flight, and the bubble fraction for a synchronous schedule with M micro-batches and K stages is approximately (K-1)/(M+K-1). Interactive workloads with low queue depth and short responses cannot afford a deep pipeline.

The conventional production layout combines both axes: tensor parallelism within a single node to exploit NVLink, pipeline parallelism across nodes to tolerate the slower cross-node fabric. A 64-GPU deployment might run as 8-way tensor parallelism per node across 8 pipeline stages, with the total device count being the product of the two degrees.

Speculative decoding: amortising memory bandwidth

Autoregressive decoding from a transformer on contemporary GPUs is memory-bandwidth-bound rather than compute-bound. Each decode step must load the full set of model weights from HBM into the compute units to produce a single output token, and the arithmetic intensity of that pass (floating-point operations per byte of weight traffic) is grossly below the hardware's peak ratio. The tensor cores sit largely idle while the memory subsystem moves weights.

The asymmetry is structural: a forward pass that produces one token reads roughly the same bytes from memory as a forward pass that produces a hundred tokens at adjacent positions, because the weights are loaded once and the per-position activations are comparatively tiny. Prefill exploits this directly. Decode does not, because each step depends on the previous step's sampled token and so cannot be widened without changing the semantics.

Speculative decoding, introduced concurrently by Leviathan, Kalman, and Matias at Google Research (ICML 2023) and by Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper at DeepMind (also 2023), restores the missing width by introducing a second much smaller draft model. The draft is run autoregressively for K steps, producing K candidate tokens together with their probabilities under the draft distribution.

The large target model is then run once on the prompt concatenated with the K draft tokens; in that single forward pass it emits K+1 sets of next-token probabilities, one at each prefix length. A probabilistic acceptance rule walks those probabilities left to right and accepts the draft token at position k with probability min(1, p_target(k)/p_draft(k)); on the first rejection a corrected token is sampled from a recalibrated residual distribution and the round ends. The accepted prefix is committed and the next round begins.

Both papers prove that the token sequence produced by this procedure is distributed identically to a sequence produced by sampling from the target model one token at a time. There is no approximation, no quality knob, no temperature distortion. The speedup is two-to-three times on common benchmarks at zero quality cost, with the largest gains on decode-dominated workloads (short prompts, long outputs) and the smallest on prefill-dominated ones.

The technique now ships in vLLM as a first-class mode under the --speculative-config flag, and has spawned a derivative line of work (Medusa, EAGLE, tree-shaped speculation) that experiments with alternative draft mechanisms while keeping the acceptance rule unchanged.

vLLM as the reference implementation

vLLM is the open-source serving engine, released under the Apache 2.0 licence, that productionised PagedAttention alongside Orca-style continuous batching and became the default benchmark target for new serving systems. It originated in the Sky Computing Lab at UC Berkeley in 2023 as the reference implementation that shipped with the PagedAttention paper. Woosuk Kwon and Zhuohan Li, the first two listed authors of the SOSP 2023 paper, drove the initial release.

Governance moved from the original Berkeley group to a community-run GitHub organisation, vllm-project, through 2024, and the repository remains the source of record. The scheduler implements iteration-level scheduling and paged KV in one engine, plus an extension the original Orca formulation did not have: chunked prefill, which splits the prefill of a long prompt into multiple smaller compute steps and interleaves decode tokens of the existing batch between them.

The arriving prompt still finishes prefill in the same total compute, but the decoder no longer sees a single long stall that would have caused visible time-between-token regressions for sequences already in flight.

The engine is CUDA-first and remains most-tested on NVIDIA GPUs, with ROCm support for AMD MI-series accelerators landing during 2024 as a tier-1 backend and TPU support contributed through 2024 and 2025 over an XLA backend. Quantisation coverage tracks what the formats and the hardware allow: FP8 weight-and-activation quantisation runs natively on Hopper and Blackwell parts, AWQ and GPTQ four-bit weight-only quantisation runs on Ampere and older parts where FP8 is unavailable, and the compressed-tensors format is the recent common ground used to ship quantised checkpoints across backends.

Speculative decoding is exposed as a first-class mode. The OpenAI-compatible HTTP server (vllm serve exposing /v1/chat/completions and /v1/completions) is the most-deployed open-source implementation of the chat-completions API surface in 2026, and most published quantised checkpoints on the HuggingFace Hub annotate vLLM compatibility before they mention any other engine. The engine is the place where every concept on this page meets actual silicon.

References