Anatomy of a GPU wake: inside the 8-minute budget
When a request arrives for a hibernated tier, the gateway wakes the machine, waits for the engine to report healthy, then forwards the request. The whole wake is budgeted at 8 minutes and hard-capped at 10 minutes, at which point the gateway stops waiting and returns 504.
The budget and the cap
Scale-to-zero economics only work if the wake is bounded. The fleet's
operating target is 8 minutes from hibernate-restore to first token; the
gateway enforces a 10-minute hard cap and fails the request with
wake_timeout rather than holding the connection open
indefinitely. The narrative version lives at
Reliability: wake budgets; the
glossary defines cold-start
wake latency.
Where the time goes
A wake is a provider hibernate-restore followed by an engine cold start. The phases, in order:
- VM restore. The provider resumes the instance, in practice a clean reboot. GPU memory does not survive hibernation, so every wake pays a full model reload; the design accepts that cost and bounds it instead of pretending it away.
- Service start. The inference engine comes up on boot with no operator action.
- Weight load. The engine reads the checkpoint from the machine's persistent volume. Local-disk loads run in the tens of seconds: Tensorfuse measured 33 seconds falling to 18 after load-format tuning on a 70B-class checkpoint.
- torch.compile. A cache hit when the artifact directory persists; the vLLM docs state the whole cache directory can be copied between deployments, and on this fleet it simply lives on the persistent volume.
- CUDA graph capture. Not covered by that cache. Tensorfuse measured 54 seconds at default settings and 7 seconds with capture pinned to the batch sizes actually served.
- Engine init and first token. Scheduler and KV-cache allocation, then the first forward pass.
External calibration for the total: Microsoft's cold-start study measured roughly 176 seconds of pure engine startup for vLLM serving Llama 3.1 8B on an A100, constant no matter how fast the container image arrived. Minutes, not seconds, is the honest unit for a vLLM cold start, which is why the budget is 8 minutes and not 80 seconds.
Why the persistent volume bounds the wake
Ephemeral mounts are wiped on hibernate-restore, so model weights and the compile cache live on a persistent per-tier volume that survives hibernation with its data intact. A wake reads tens of gigabytes from local disk instead of re-downloading them from Hugging Face, which is the difference between a bounded wake and an unbounded one. More at Reliability: cold starts.
Measured vs pending
The 8-minute budget and the 10-minute cap are the fleet's operating numbers, enforced in code. The per-phase split above leans on the attributed external measurements; a per-tier timing sweep of this fleet's own wakes, weight load against graph capture against first token, is pending. This page will carry those numbers when the harness produces them, and not before.