Spot vs on-demand GPU inference: cost against latency
Spot capacity cuts the GPU bill substantially, often by more than half, in exchange for the provider's right to reclaim the machine. On-demand buys certainty at full rate. For inference the strong position is usually neither extreme but a priority order: spot first, on-demand as fallback, on an architecture that makes interruption survivable instead of catastrophic.
The trade, restated as a price
The glossary entry defines the two markets; this page prices the choice. Spot's discount is not free money: it converts a fixed cost (the on-demand rate) into a variable one (a lower rate plus a stochastic interruption tax). The tax has three components: requests in flight when the machine disappears, the cold start to restore capacity, and the engineering needed to keep both small. Whether the discount beats the tax depends almost entirely on what an interruption costs the workload, which is a latency question before it is a dollar question.
What reclamation does to a serving fleet
When the provider takes a machine back, every request on it fails mid-decode and every byte of GPU state evaporates, including the model weights in VRAM and the KV cache of in-flight sequences. Notice periods are short where they exist at all: the best-known spot market gives a two-minute warning, which is enough to drain a load balancer but not to gracefully finish multi-minute generations. Three design responses determine whether this is an outage or a blip:
- Persistent weights. If model weights and compile caches live on a volume that survives the instance, recovery is a disk mount plus an engine start instead of a multi-gigabyte re-download. This single decision moves recovery from tens of minutes to single digits; it is the same mechanism that bounds planned hibernation, per Reliability: hibernation.
- A capacity fallback. A priority-ordered fleet walks down a list: spot tier first, on-demand tier when spot is gone. Customers see a slower request, not a refused one.
- An honest client contract. Interrupted requests fail with retryable errors, and clients that follow the retry guidance reissue them; chat completions are stateless, so a reissued request loses nothing but time.
The latency cost, bounded
The recovery path after an interruption is the same machinery as a scale-to-zero wake: restore or replace the machine, mount the weights, start the engine. On this fleet that path is budgeted at 8 minutes end to end with a 10-minute hard cap, the decomposition traced in the wake-anatomy article. That bound is what makes spot capacity governable: an interruption costs, at worst, a bounded number of minutes of elevated latency on one tier while the fallback absorbs traffic. Without the bound (weights on ephemeral disk, no fallback tier), the same interruption is an open-ended outage, and no discount prices that correctly.
The arithmetic: when spot wins
Spot wins when the discount on every hour exceeds the interruption tax paid on a few of them. The qualitative break-even is easy to state: workloads tolerant of occasional minutes-long latency excursions (batch processing, internal tools, anything queued) should run on spot essentially always, because the discount is certain and the tax is rare and bounded. Workloads with hard per-request latency contracts tighter than the recovery bound need warm on-demand capacity in the path, either as the primary or as a standing fallback. The blended position, spot-first with fallback, captures most of the discount while capping the worst case; that stance is one of the inputs behind this fleet's published cost of goods of $0.30 to $0.70 per million output tokens at good utilisation, accounted in the invoice-true article.
A decision table
- Batch, queued, or internal traffic: spot, without much further analysis.
- Interactive traffic with seconds-level patience: spot-first with an on-demand fallback and a bounded recovery path.
- Hard sub-second SLOs on every request: warm on-demand or reserved capacity; spot only as overflow.
- Unknown or spiky demand: spot-first again; the same persistent-weights design that survives reclamation also enables scale-to-zero between bursts, per Reliability: scale to zero.
What this page does not claim
Interruption frequency varies by provider, region, GPU class, and season, and published third-party numbers age quickly; this page deliberately cites none. The honest planning stance is to treat interruption rate as unknown but nonzero, engineer the recovery bound, and then measure the realised rate in production. A measured interruption-frequency series for this fleet is pending and will be published when it exists, not estimated before then.