Prefill and decode are the two phases of serving an LLM request. Prefill ingests the whole prompt in one parallel, compute-bound pass and builds the KV cache; decode then generates one token per step in a memory-bound loop. The split explains why first-token latency scales with prompt length while per-token speed stays flat.

Prefill vs decode

The two phases stress different hardware. Prefill processes every prompt token in parallel and saturates compute, which is why time to first token grows with prompt length. Decode emits one token per step and saturates memory bandwidth, reading the whole KV cache each iteration, which is why per-token speed stays roughly flat however long the answer runs. The engine-level treatment is at How engines work: TTFT.

Schedulers that interleave the phases inherit an interference problem: a large prefill admitted into a running batch stalls every in-flight decode for its duration, surfacing as a tail-latency spike on otherwise healthy traffic. Chunked prefill, now standard in production engines, slices prompt ingestion into bounded pieces so decode steps keep flowing between them.

For an API caller the split yields one rule each way: trim what gets prefilled (standing prompts, replayed history) to protect first-token latency, and budget generation length against a roughly constant decode rate, the arithmetic the KV-cache field note and the retries guide both build on.

Cost and reliability implications

The two phases buy different hardware resources: prefill saturates compute, decode saturates memory bandwidth, and a fleet's realised throughput depends on how the scheduler interleaves them. Mixed workloads fail in a characteristic way: one large prefill stalls every in-flight decode, which surfaces as a sudden tail-latency spike on otherwise healthy traffic.

Part of Performance and latency on the learn hub.

See also

References

Agrawal et al. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. Names the prefill-decode interference problem and introduces chunked prefill, the mitigation production engines adopted.
Yu et al. Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022). The iteration-level scheduler that made interleaving prefills and decodes the production norm.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.