Continuous batching is an iteration-level scheduling discipline for LLM serving: a new request joins the running batch as soon as its prefill completes, and a finished request frees its slot the moment it emits its end-of-sequence token. It removes the static-batch barrier that made every request wait for its slowest peer.

Continuous batching

Continuous batching is an iteration-level scheduling discipline for autoregressive generation: a new request joins the in-flight batch as soon as its prefill completes, and a finished request frees its slot the same step it emits its end-of-sequence token.

The technique replaces the static-batch retirement barrier that coupled every request's completion time to its longest peer, which is why production engines report multi-fold throughput gains over the lockstep baseline.

For the longer treatment, see How engines work: continuous batching.

Cost and reliability implications

Iteration-level scheduling is the difference between paying for one GPU and paying for several: the multi-fold throughput gain over static batching comes from the same hardware, so the dollars-per-million-tokens figure drops in proportion. The failure mode is interference, where a heavy prefill stalls in-flight decodes and inflates tail latency for every concurrent request, so admission control still matters under load.

Part of Performance and latency on the learn hub.

See also

References

Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) introduces iteration-level scheduling and reports 2x-23x throughput gains over static batching on heterogeneous workloads.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.