spotinference Sign in with GitHub

Orca is the OSDI 2022 serving system that introduced iteration-level scheduling: requests join the running batch the moment prefill completes and leave the moment they finish, instead of waiting for a static batch to drain. It reported 1.4x to 23x throughput gains and named the technique production engines now call continuous batching.

Orca: continuous batching (Yu et al., OSDI 2022)

The OSDI 2022 paper that named and validated iteration-level scheduling plus selective batching for transformer inference, reporting throughput gains from 1.4x to 23x over the FasterTransformer baseline on GPT-3-class models.

Every production serving engine in 2026 schedules at the iteration boundary and decouples the attention operator from the shape-uniform path; the design traces back to this paper.

For the longer treatment in narrative context, see How engines work: continuous batching.

Part of Performance and latency on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.