Orca is the OSDI 2022 serving system that introduced iteration-level scheduling: requests join the running batch the moment prefill completes and leave the moment they finish, instead of waiting for a static batch to drain. It reported 1.4x to 23x throughput gains and named the technique production engines now call continuous batching.

Orca: continuous batching (Yu et al., OSDI 2022)

The OSDI 2022 paper that named and validated iteration-level scheduling plus selective batching for transformer inference, reporting throughput gains from 1.4x to 23x over the FasterTransformer baseline on GPT-3-class models.

Every production serving engine in 2026 schedules at the iteration boundary and decouples the attention operator from the shape-uniform path; the design traces back to this paper.

For the longer treatment in narrative context, see How engines work: continuous batching.

Part of Performance and latency on the learn hub.

See also

References

Yu, Jeong, Kim, Shin, Wee. Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022. The primary source. Introduces iteration-level scheduling and selective batching; reports the throughput numbers cited above.
USENIX OSDI 2022 proceedings Venue. Co-located with the broader OSDI program; Orca appeared in the systems-for-ML track.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.