What a cold start actually costs
A GPU cold start costs twice. Once in latency: minutes of wall clock between a request and its first token, bounded on this fleet at 8 minutes with a 10-minute hard cap. And once in dollars: the machine bills from power-on while it loads weights and serves nothing. Whether scale-to-zero saves money is arithmetic between those two costs and the idle gaps in the traffic, and the arithmetic usually favors it more strongly than intuition suggests.
Two costs, kept separate
Discussions of cold starts blur a user-facing cost and an operator-facing one. The user-facing cost is waiting: the cold-start interval sits in the latency path of whichever request was unlucky enough to arrive first. The operator-facing cost is warm-up spend: on per-minute billing, a wake that takes minutes bills those minutes at the full GPU rate for zero tokens of output. The two costs respond to different engineering, and a deployment decision needs both on the table.
The latency side
An LLM cold start is minutes, not seconds, and the size is structural:
tens of gigabytes of weights move from disk to VRAM, the engine
initialises, capture and compilation warm up. The phase-by-phase
decomposition, with the external measurements that calibrate each phase,
is in the wake-anatomy
article; the narrative treatment is at
Reliability: cold starts. What
matters for cost analysis is the shape of the worst case. An unbounded
cold start (weights re-downloaded over the network on every start) makes
scale-to-zero unsellable; a bounded one is a latency line item that can
be priced. This fleet's bound is the 8-minute budget with its 10-minute
cap, after which the gateway fails the request honestly with
504 rather than holding it open, and clients handle that
code with one delayed retry per the
retries guide.
The dollar side
Always-on capacity pays for every idle hour at full rate; that is the baseline scale-to-zero competes against. The challenger pays differently: each wake burns warm-up minutes of GPU time producing nothing, and each hibernation risks a wake soon after. On per-minute billing the pure dollar break-even is short: a wake costing around 8 minutes of GPU time beats staying warm through any idle gap meaningfully longer than those minutes. In dollars alone, hibernating into any hour-scale gap wins by a wide margin; overnight gaps are not close. The real constraint is almost never the bill, it is the latency contract: every wake puts one user request through the cold path, so the decision is how often the traffic pattern makes someone wait, and whether the product can spend that wait. The scale-to-zero arithmetic, gap distributions included, is laid out at Reliability: scale to zero.
What keeps the worst case bounded
The bound is engineered, not hoped for. The decisive choice is where weights and compile caches live: on a volume that survives hibernation, a wake reads tens of gigabytes from local disk; on ephemeral storage, it re-downloads them across the network first, and the bound dissolves. Secondary levers (load-format tuning, pinned graph capture, cache reuse) shave further minutes, and the anatomy article attributes external measurements to each. The cap completes the design: a bound that fails open is not a bound, so past 10 minutes the request fails with a named error instead of degrading into an unbounded hang.
When always-on is still right
Cold starts price out of some products entirely. Traffic with no gaps long enough to hibernate into never collects the savings; hard sub-second latency contracts on every request cannot spend a multi-minute worst case, whatever the refund. There is also a subtler failure: an always-on fleet sized for peak but utilised at a few percent walks out of any honest cost band. This fleet's published $0.30 to $0.70 per million output tokens is conditioned on good utilisation, and the invoice-true article is explicit about that condition; scale-to-zero is one of the mechanisms that defends it. The spot-vs-on-demand comparison covers the related decision on the capacity side.
Measured vs pending
The 8-minute budget and 10-minute cap are the fleet's operating numbers, enforced at the gateway. The per-phase wake timings cited in the anatomy article are attributed external measurements; this fleet's own per-tier wake-duration distribution is pending and will be published when the harness produces it, not estimated before.