spotinference Sign in with GitHub

A GPU cold start costs twice: minutes of user-facing latency for the request that triggers it, and warm-up GPU-minutes billed while nothing serves. With per-minute billing, hibernating into gaps longer than the wake usually wins on dollars; the binding constraint is the latency contract, which is why the worst case must be bounded.

What a cold start actually costs

A GPU cold start costs twice. Once in latency: minutes of wall clock between a request and its first token, bounded on this fleet at 8 minutes with a 10-minute hard cap. And once in dollars: the machine bills from power-on while it loads weights and serves nothing. Whether scale-to-zero saves money is arithmetic between those two costs and the idle gaps in the traffic, and the arithmetic usually favors it more strongly than intuition suggests.

Two costs, kept separate

Discussions of cold starts blur a user-facing cost and an operator-facing one. The user-facing cost is waiting: the cold-start interval sits in the latency path of whichever request was unlucky enough to arrive first. The operator-facing cost is warm-up spend: on per-minute billing, a wake that takes minutes bills those minutes at the full GPU rate for zero tokens of output. The two costs respond to different engineering, and a deployment decision needs both on the table.

The latency side

An LLM cold start is minutes, not seconds, and the size is structural: tens of gigabytes of weights move from disk to VRAM, the engine initialises, capture and compilation warm up. The phase-by-phase decomposition, with the external measurements that calibrate each phase, is in the wake-anatomy article; the narrative treatment is at Reliability: cold starts. What matters for cost analysis is the shape of the worst case. An unbounded cold start (weights re-downloaded over the network on every start) makes scale-to-zero unsellable; a bounded one is a latency line item that can be priced. This fleet's bound is the 8-minute budget with its 10-minute cap, after which the gateway fails the request honestly with 504 rather than holding it open, and clients handle that code with one delayed retry per the retries guide.

The dollar side

Always-on capacity pays for every idle hour at full rate; that is the baseline scale-to-zero competes against. The challenger pays differently: each wake burns warm-up minutes of GPU time producing nothing, and each hibernation risks a wake soon after. On per-minute billing the pure dollar break-even is short: a wake costing around 8 minutes of GPU time beats staying warm through any idle gap meaningfully longer than those minutes. In dollars alone, hibernating into any hour-scale gap wins by a wide margin; overnight gaps are not close. The real constraint is almost never the bill, it is the latency contract: every wake puts one user request through the cold path, so the decision is how often the traffic pattern makes someone wait, and whether the product can spend that wait. The scale-to-zero arithmetic, gap distributions included, is laid out at Reliability: scale to zero.

What keeps the worst case bounded

The bound is engineered, not hoped for. The decisive choice is where weights and compile caches live: on a volume that survives hibernation, a wake reads tens of gigabytes from local disk; on ephemeral storage, it re-downloads them across the network first, and the bound dissolves. Secondary levers (load-format tuning, pinned graph capture, cache reuse) shave further minutes, and the anatomy article attributes external measurements to each. The cap completes the design: a bound that fails open is not a bound, so past 10 minutes the request fails with a named error instead of degrading into an unbounded hang.

When always-on is still right

Cold starts price out of some products entirely. Traffic with no gaps long enough to hibernate into never collects the savings; hard sub-second latency contracts on every request cannot spend a multi-minute worst case, whatever the refund. There is also a subtler failure: an always-on fleet sized for peak but utilised at a few percent walks out of any honest cost band. This fleet's published $0.30 to $0.70 per million output tokens is conditioned on good utilisation, and the invoice-true article is explicit about that condition; scale-to-zero is one of the mechanisms that defends it. The spot-vs-on-demand comparison covers the related decision on the capacity side.

Measured vs pending

The 8-minute budget and 10-minute cap are the fleet's operating numbers, enforced at the gateway. The per-phase wake timings cited in the anatomy article are attributed external measurements; this fleet's own per-tier wake-duration distribution is pending and will be published when the harness produces it, not estimated before.

Methodology

The 8-minute budget and the 10-minute cap are the fleet's operating numbers, enforced at the gateway and documented on the reliability page. Billing granularity comes from the provider's public per-minute pricing. The break-even statements are derived from those two published quantities by arithmetic stated in the text; this fleet's measured wake-duration distribution is pending and stated as pending.

Part of Reliability on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.