spotinference

Sign in with GitHub
Reliability

A fleet that hibernates when idle is cheaper than an always-on one only when the worst case is bounded: here a wake completes in eight minutes or less measured, with a ten-minute hard cap that fails honestly. These pages cover cold starts, wake budgets, and the engineering that keeps the bound real.

A fleet that hibernates when idle is cheaper than an always-on one, but only when the worst case is bounded. The request that lands on a parked tier pays the wake, so the wake interval is the single most important number a scale-to-zero operator publishes: it is the worst-case first-byte latency a customer will observe under the policy.

Here a wake completes in eight minutes or less measured, with a ten-minute hard cap that returns an explicit error rather than hanging the connection. The pages below decompose the cold start that the wake budget bounds, the hibernation that shrinks it, and the lean-stack engineering that keeps the whole system inspectable for a small operator.

This pillar collects every page on the topic. Each one below opens with the answer; follow a link for the full treatment, or use the rail to cross into a neighbouring pillar.

Reliability and operations for GPU inference fleets

Scale-to-zero arithmetic, cold-start anatomy, hibernation, wake budgets. Read more.

Cold start

Time from a fresh process initialising to its first served request. Read more.

Cold-start wake latency

The budgeted SLO for waking a hibernated tier into service. Read more.

Lean-stack engineering

One binary, one box; auditability as the deliverable. Read more.

Anatomy of a GPU wake: the 8-minute budget

Hibernate-restore decomposed: VM restore, weight load, compile cache, CUDA graphs, first token. Read more.

What a cold start actually costs

Two costs, separated: minutes of user-facing latency and warm-up GPU-minutes billed for zero tokens. Read more.

Every page in this pillar describes the system running behind one endpoint. Point an OpenAI SDK at spotinference when ready: swap the base URL and the key, and the first request answers from the same fleet these pages measure.