spotinference Sign in with GitHub

Tensor parallelism is an intra-layer sharding strategy that splits each transformer layer's weight matrices across N GPUs and synchronises activations with an all-reduce after every sublayer, so the devices jointly execute a model too large for any one of them. It depends on NVLink-class bandwidth and is conventionally bounded to a single node.

Tensor parallelism (TP)

Tensor parallelism shards each transformer layer's weight matrices across N GPUs and synchronises the per-layer activations through an all-reduce after every sublayer, so the N devices jointly execute the forward pass of a model whose weights do not fit on any single one.

The pattern depends on NVLink-class intra-node bandwidth: across slower interconnects the per-block all-reduces dominate step time and the scaling collapses, which is why TP is conventionally bounded by a single node.

For the longer treatment, see How engines work: tensor parallelism.

Cost and reliability implications

Tensor parallelism is how a model too large for one GPU still serves with single-node latency, but the all-reduce after every sublayer makes it interconnect-bound: on PCIe-class links the collectives dominate step time and dollars-per-token rise instead of fall. Multi-GPU NCCL also adds a hang-class failure mode, where a single misconfigured environment flag stalls the whole replica.

Part of Performance and latency on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.