spotinference Sign in with GitHub

Speculative decoding accelerates autoregressive generation by letting a small draft model propose several tokens ahead, which the large target model verifies in one parallel forward pass. Accepted tokens ship as a batch; any rejection falls back to the target model's own choice, so output quality is provably identical to standard decoding.

Speculative decoding

Autoregressive decoding is memory-bandwidth-bound: each new token requires streaming every weight through the GPU once. Speculative decoding attacks that bound by making one pass of the large model verify several candidate tokens at once instead of producing just one.

How verification keeps outputs exact

A small draft model proposes a short run of tokens. The target model scores all of them in a single parallel forward pass, accepting each proposal with a probability that corrects for the difference between the two distributions. The first rejection truncates the run and the target model's own sample takes its place. The rejection-sampling construction guarantees the output distribution is identical to decoding with the target model alone, so speculation is a pure systems optimisation, not a quality trade.

When a draft model pays

The speedup is acceptance rate times draft length, minus the cost of running the draft. A well-matched draft on predictable text yields multi-fold decode gains; a mismatched draft burns VRAM that could hold KV cache and can leave the deployment slower than its plain baseline. On spotinference's h100x2 FP8 tier that baseline is 118.9 tokens per second of short-context decode, the number any speculative configuration has to beat.

For the longer treatment, see How engines work: speculative decoding.

Cost and reliability implications

Speculation multiplies a baseline the operator must already know: spotinference's h100x2 FP8 tier measures 118.9 tokens per second of plain short-context decode, and a draft model pays for itself only while its acceptance rate beats the VRAM and compute it occupies. When workload and draft distribution drift apart, acceptance collapses and the deployment runs slower than the unassisted baseline while still paying for two models.

Part of Model behavior on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.