Speculative decoding is a draft-and-verify technique published concurrently by Google Research and DeepMind in 2023: a small draft model proposes several tokens, one forward pass of the large target model verifies them, and a probabilistic acceptance rule keeps the output distribution exactly equal to standard decoding while cutting per-token latency two to three times.

Speculative decoding (Leviathan / Chen, 2023)

Two concurrent 2023 formulations (Leviathan, Kalman, Matias at Google Research; Chen, Borgeaud, Irving, Lespiau, Sifre, Jumper at DeepMind) introduced a draft-and-verify decoding loop in which a small model proposes K tokens, a single forward pass of the large target verifies them, and a probabilistic acceptance rule preserves the target sampling distribution exactly.

The technique restores arithmetic intensity on bandwidth-bound decode workloads and delivers two-to-three times lower per-token latency at zero quality cost on common chat traffic.

For the longer treatment in narrative context, see How engines work: speculative decoding.

Part of Model behavior on the learn hub.

See also

References

Leviathan, Kalman, Matias. Fast Inference from Transformers via Speculative Decoding (ICML 2023) Google Research formulation; introduces the speculative-sampling acceptance rule that preserves the target distribution exactly.
Chen, Borgeaud, Irving, Lespiau, Sifre, Jumper. Accelerating Large Language Model Decoding with Speculative Sampling (2023) Concurrent DeepMind formulation, posted within weeks of Leviathan et al.; same acceptance rule, separate derivation, broader empirical evaluation on Chinchilla-class models.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.