Pipeline parallelism is a model-partitioning scheme that splits a deep network along the layer dimension into sequential stages, places each stage on a different GPU, and streams activations forward so several micro-batches occupy the pipeline at once. Each cross-stage hop is a point-to-point send, so it tolerates slower interconnects than tensor parallelism at the price of a fill-and-drain bubble.

Pipeline parallelism (PP)

Pipeline parallelism splits a deep network along the layer dimension into sequential stages, places each stage on a different GPU, and streams activations forward so that several micro-batches occupy the pipeline at once.

The cross-stage hop is a point-to-point send rather than a collective, which is why PP tolerates slow interconnects that tensor parallelism cannot, at the cost of a fill-and-drain bubble that only amortises when the in-flight micro-batch count is large.

For the longer treatment, see How engines work: pipeline parallelism.

Cost and reliability implications

Pipeline parallelism buys model capacity with cheaper interconnect: stages communicate by point-to-point sends, so it spans nodes that tensor parallelism cannot. The cost is the bubble, idle GPU cycles during pipeline fill and drain that are still billed by the provider, and at low request rates utilisation sags, so the technique pays off mainly at sustained batch depth.

Part of Performance and latency on the learn hub.

See also

References

Huang et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (NeurIPS 2019). The foundational paper; introduces synchronous pipeline parallelism with micro-batching and quantifies the bubble penalty.
Narayanan et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021). Develops the interleaved 1F1B pipeline schedule and the composition of pipeline parallelism with tensor parallelism that production stacks now follow.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.