Tensor parallelism (TP)
Tensor parallelism shards each transformer layer's weight matrices across N GPUs and synchronises the per-layer activations through an all-reduce after every sublayer, so the N devices jointly execute the forward pass of a model whose weights do not fit on any single one.
The pattern depends on NVLink-class intra-node bandwidth: across slower interconnects the per-block all-reduces dominate step time and the scaling collapses, which is why TP is conventionally bounded by a single node.
For the longer treatment, see How engines work: tensor parallelism.