spotinference Sign in with GitHub

Speculative decoding is a draft-and-verify technique published concurrently by Google Research and DeepMind in 2023: a small draft model proposes several tokens, one forward pass of the large target model verifies them, and a probabilistic acceptance rule keeps the output distribution exactly equal to standard decoding while cutting per-token latency two to three times.

Speculative decoding (Leviathan / Chen, 2023)

Two concurrent 2023 formulations (Leviathan, Kalman, Matias at Google Research; Chen, Borgeaud, Irving, Lespiau, Sifre, Jumper at DeepMind) introduced a draft-and-verify decoding loop in which a small model proposes K tokens, a single forward pass of the large target verifies them, and a probabilistic acceptance rule preserves the target sampling distribution exactly.

The technique restores arithmetic intensity on bandwidth-bound decode workloads and delivers two-to-three times lower per-token latency at zero quality cost on common chat traffic.

For the longer treatment in narrative context, see How engines work: speculative decoding.

Part of Model behavior on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.