spotinference Sign in with GitHub

Cold-start wake latency is the operational bound on how long a scale-to-zero inference fleet may take from the first request reaching a hibernated tier to that tier serving tokens. Where cold start names the general phenomenon, wake latency is the number an operator budgets, measures, and enforces with a hard timeout.

Cold-start wake latency

The general phenomenon is cold start: a fresh inference process takes minutes to load weights before it can serve. Wake latency is the operational version of that fact, the bound a scale-to-zero fleet publishes and enforces for the path from first request to first served token on a hibernated tier.

What the budget buys

A stated wake budget converts an unpredictable failure into a priced trade: the operator knows exactly what the first request after an idle period costs in latency, and can decide per tier whether that cost beats paying for an always-warm replica. Without the budget, scale-to-zero is a reliability hazard; with it, scale-to-zero is a pricing feature.

Bounding the wake

The bound holds only if nothing on the wake path is unbounded. Persisting model weights and the torch.compile cache on a volume that survives hibernation removes the two worst offenders, the multi-gigabyte download and the recompilation, leaving VM restore plus engine start. Measured on spotinference's fleet, the full path from hibernate-restore through engine start runs 8 minutes or less, and a 10-minute hard cap returns 504 rather than letting a stuck wake bill forever.

For the fleet-level policy, see Reliability: wake budgets.

Cost and reliability implications

spotinference's measured bound is 8 minutes or less from hibernate-restore through engine start to first served token, achievable because model weights and the torch.compile cache persist on an attached volume across hibernation, reducing a cold start to a disk mount plus engine boot. A 10-minute hard cap converts a stuck wake into an honest 504 rather than an open-ended bill, which keeps the worst case priced.

Part of Reliability on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.