Speculative decoding (Leviathan / Chen, 2023)
Two concurrent 2023 formulations (Leviathan, Kalman, Matias at Google Research; Chen, Borgeaud, Irving, Lespiau, Sifre, Jumper at DeepMind) introduced a draft-and-verify decoding loop in which a small model proposes K tokens, a single forward pass of the large target verifies them, and a probabilistic acceptance rule preserves the target sampling distribution exactly.
The technique restores arithmetic intensity on bandwidth-bound decode workloads and delivers two-to-three times lower per-token latency at zero quality cost on common chat traffic.
For the longer treatment in narrative context, see How engines work: speculative decoding.