Tensor parallelism is an intra-layer sharding strategy that splits each transformer layer's weight matrices across N GPUs and synchronises activations with an all-reduce after every sublayer, so the devices jointly execute a model too large for any one of them. It depends on NVLink-class bandwidth and is conventionally bounded to a single node.

Tensor parallelism (TP)

Tensor parallelism shards each transformer layer's weight matrices across N GPUs and synchronises the per-layer activations through an all-reduce after every sublayer, so the N devices jointly execute the forward pass of a model whose weights do not fit on any single one.

The pattern depends on NVLink-class intra-node bandwidth: across slower interconnects the per-block all-reduces dominate step time and the scaling collapses, which is why TP is conventionally bounded by a single node.

For the longer treatment, see How engines work: tensor parallelism.

Cost and reliability implications

Tensor parallelism is how a model too large for one GPU still serves with single-node latency, but the all-reduce after every sublayer makes it interconnect-bound: on PCIe-class links the collectives dominate step time and dollars-per-token rise instead of fall. Multi-GPU NCCL also adds a hang-class failure mode, where a single misconfigured environment flag stalls the whole replica.

Part of Performance and latency on the learn hub.

See also

References

Shoeybi, Patwary, Puri, LeGresley, Casper, Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (arXiv:1909.08053, 2019). Introduces the canonical attention-head and feed-forward sharding pattern, with two all-reduces per transformer block, that every subsequent tensor-parallel implementation has inherited.
NVIDIA NVLink and NVSwitch product overview. Aggregate bidirectional bandwidth and topology details for the SXM-form-factor interconnect that sets the practical ceiling on intra-node tensor-parallel scaling.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.