spotinference Sign in with GitHub

Pipeline parallelism is a model-partitioning scheme that splits a deep network along the layer dimension into sequential stages, places each stage on a different GPU, and streams activations forward so several micro-batches occupy the pipeline at once. Each cross-stage hop is a point-to-point send, so it tolerates slower interconnects than tensor parallelism at the price of a fill-and-drain bubble.

Pipeline parallelism (PP)

Pipeline parallelism splits a deep network along the layer dimension into sequential stages, places each stage on a different GPU, and streams activations forward so that several micro-batches occupy the pipeline at once.

The cross-stage hop is a point-to-point send rather than a collective, which is why PP tolerates slow interconnects that tensor parallelism cannot, at the cost of a fill-and-drain bubble that only amortises when the in-flight micro-batch count is large.

For the longer treatment, see How engines work: pipeline parallelism.

Cost and reliability implications

Pipeline parallelism buys model capacity with cheaper interconnect: stages communicate by point-to-point sends, so it spans nodes that tensor parallelism cannot. The cost is the bubble, idle GPU cycles during pipeline fill and drain that are still billed by the provider, and at low request rates utilisation sags, so the technique pays off mainly at sustained batch depth.

Part of Performance and latency on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.