Prefill vs decode
The two phases stress different hardware. Prefill processes every prompt token in parallel and saturates compute, which is why time to first token grows with prompt length. Decode emits one token per step and saturates memory bandwidth, reading the whole KV cache each iteration, which is why per-token speed stays roughly flat however long the answer runs. The engine-level treatment is at How engines work: TTFT.
Schedulers that interleave the phases inherit an interference problem: a large prefill admitted into a running batch stalls every in-flight decode for its duration, surfacing as a tail-latency spike on otherwise healthy traffic. Chunked prefill, now standard in production engines, slices prompt ingestion into bounded pieces so decode steps keep flowing between them.
For an API caller the split yields one rule each way: trim what gets prefilled (standing prompts, replayed history) to protect first-token latency, and budget generation length against a roughly constant decode rate, the arithmetic the KV-cache field note and the retries guide both build on.