spotinference Sign in with GitHub

A wake on this fleet is a provider hibernate-restore followed by an engine cold start, budgeted at 8 minutes end to end and hard-capped at 10 minutes, after which the gateway returns 504. The budget holds because model weights and the torch.compile cache live on a persistent volume, so a wake reads disk instead of re-downloading.

Anatomy of a GPU wake: inside the 8-minute budget

When a request arrives for a hibernated tier, the gateway wakes the machine, waits for the engine to report healthy, then forwards the request. The whole wake is budgeted at 8 minutes and hard-capped at 10 minutes, at which point the gateway stops waiting and returns 504.

The budget and the cap

Scale-to-zero economics only work if the wake is bounded. The fleet's operating target is 8 minutes from hibernate-restore to first token; the gateway enforces a 10-minute hard cap and fails the request with wake_timeout rather than holding the connection open indefinitely. The narrative version lives at Reliability: wake budgets; the glossary defines cold-start wake latency.

Where the time goes

A wake is a provider hibernate-restore followed by an engine cold start. The phases, in order:

  1. VM restore. The provider resumes the instance, in practice a clean reboot. GPU memory does not survive hibernation, so every wake pays a full model reload; the design accepts that cost and bounds it instead of pretending it away.
  2. Service start. The inference engine comes up on boot with no operator action.
  3. Weight load. The engine reads the checkpoint from the machine's persistent volume. Local-disk loads run in the tens of seconds: Tensorfuse measured 33 seconds falling to 18 after load-format tuning on a 70B-class checkpoint.
  4. torch.compile. A cache hit when the artifact directory persists; the vLLM docs state the whole cache directory can be copied between deployments, and on this fleet it simply lives on the persistent volume.
  5. CUDA graph capture. Not covered by that cache. Tensorfuse measured 54 seconds at default settings and 7 seconds with capture pinned to the batch sizes actually served.
  6. Engine init and first token. Scheduler and KV-cache allocation, then the first forward pass.

External calibration for the total: Microsoft's cold-start study measured roughly 176 seconds of pure engine startup for vLLM serving Llama 3.1 8B on an A100, constant no matter how fast the container image arrived. Minutes, not seconds, is the honest unit for a vLLM cold start, which is why the budget is 8 minutes and not 80 seconds.

Why the persistent volume bounds the wake

Ephemeral mounts are wiped on hibernate-restore, so model weights and the compile cache live on a persistent per-tier volume that survives hibernation with its data intact. A wake reads tens of gigabytes from local disk instead of re-downloading them from Hugging Face, which is the difference between a bounded wake and an unbounded one. More at Reliability: cold starts.

Measured vs pending

The 8-minute budget and the 10-minute cap are the fleet's operating numbers, enforced in code. The per-phase split above leans on the attributed external measurements; a per-tier timing sweep of this fleet's own wakes, weight load against graph capture against first token, is pending. This page will carry those numbers when the harness produces them, and not before.

Methodology

The decomposition follows the wake path itself plus attributed external measurements; a per-tier timing sweep of this fleet's own wakes is pending. Each phase boundary is independently observable: machine state transitions from the provider's API, engine start and ready events from the machine's service logs, and the first post-wake request's time to first token from the gateway's per-request timing. The 10-minute cap is enforced in the gateway, which fails the request with 504 wake_timeout when exceeded.

Part of Reliability on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.