Cold-start wake latency
The general phenomenon is cold start: a fresh inference process takes minutes to load weights before it can serve. Wake latency is the operational version of that fact, the bound a scale-to-zero fleet publishes and enforces for the path from first request to first served token on a hibernated tier.
What the budget buys
A stated wake budget converts an unpredictable failure into a priced trade: the operator knows exactly what the first request after an idle period costs in latency, and can decide per tier whether that cost beats paying for an always-warm replica. Without the budget, scale-to-zero is a reliability hazard; with it, scale-to-zero is a pricing feature.
Bounding the wake
The bound holds only if nothing on the wake path is unbounded. Persisting model weights and the torch.compile cache on a volume that survives hibernation removes the two worst offenders, the multi-gigabyte download and the recompilation, leaving VM restore plus engine start. Measured on spotinference's fleet, the full path from hibernate-restore through engine start runs 8 minutes or less, and a 10-minute hard cap returns 504 rather than letting a stuck wake bill forever.
For the fleet-level policy, see Reliability: wake budgets.