Speculative decoding
Autoregressive decoding is memory-bandwidth-bound: each new token requires streaming every weight through the GPU once. Speculative decoding attacks that bound by making one pass of the large model verify several candidate tokens at once instead of producing just one.
How verification keeps outputs exact
A small draft model proposes a short run of tokens. The target model scores all of them in a single parallel forward pass, accepting each proposal with a probability that corrects for the difference between the two distributions. The first rejection truncates the run and the target model's own sample takes its place. The rejection-sampling construction guarantees the output distribution is identical to decoding with the target model alone, so speculation is a pure systems optimisation, not a quality trade.
When a draft model pays
The speedup is acceptance rate times draft length, minus the cost of running the draft. A well-matched draft on predictable text yields multi-fold decode gains; a mismatched draft burns VRAM that could hold KV cache and can leave the deployment slower than its plain baseline. On spotinference's h100x2 FP8 tier that baseline is 118.9 tokens per second of short-context decode, the number any speculative configuration has to beat.
For the longer treatment, see How engines work: speculative decoding.