An inference fleet that idles most hours of the day is cheaper per token than one that runs continuously, but only when the idle path is fast, the wake path is bounded, and the operator can reason about every minute of compute that lands on the bill.
The page walks the operational primitives that make scale-to-zero serving viable for large language models: the duty-cycle arithmetic that justifies the pattern, the anatomy of a cold start that threatens to undo the savings, and the snapshot-based hibernation that shrinks the wake to seconds.
It continues through the wake budget that bounds first-byte latency for the request unlucky enough to land on a parked replica, and the lean-stack engineering tenets that keep the whole system inspectable for a small operator. The page closes with the failure modes of that lean stack, named so readers whose problem lies elsewhere can self-select out.
Scale-to-zero economics
An accelerator left online twenty-four hours a day is rented for every one of those hours. At a representative spot rate of three dollars eighty cents per hour for a single H100, a continuous month costs roughly two thousand seven hundred thirty-six dollars. A workload that actually uses the same accelerator for four hours a day, paid at the same rate, costs four hundred fifty-six dollars over the same month.
The ratio is the duty cycle: useful seconds divided by paid seconds. Continuous serving has a duty cycle of one; bursty interactive serving sits an order of magnitude below that. The arithmetic is unambiguous and the savings are large, but only on the condition that idle hours genuinely cost nothing.
Per-token hosted APIs hide the duty-cycle dimension behind a blended markup. The provider amortises idle GPU time across the entire customer base, takes a margin on top, and quotes a flat per- million-tokens rate. That rate is convenient and predictable, and for low-volume traffic it is almost always the right answer.
For high-volume traffic on a bursty schedule, the same rate is several multiples of the underlying compute cost, because the provider has to defend against the worst-case duty cycle across every tenant. Self-operating the same workload shifts the duty-cycle risk onto the operator, and rewards the operator who can prove the idle hours are truly idle.
Cold-start anatomy
Cold start is the wall-clock interval between a fresh inference process beginning to initialise and the same process accepting its first request. It is paid once per process lifetime, regardless of how many requests follow, and on a large language model the interval is dominated by reading the weights from persistent storage into accelerator memory. A reasonable decomposition for a seventy-billion-parameter transformer on a modern GPU host runs as follows.
VM provision is the first phase when the underlying box is not already allocated. A fresh virtual machine takes from one to several minutes to come up at the cloud control plane, get its network interfaces, and finish first-boot setup.
Image or weight download follows. A seventy-billion-parameter checkpoint in half precision is roughly one hundred forty gigabytes; pulling it from object storage over a gigabit-class link is a quarter of an hour, and over a slower link or a contended object store the same read can stretch past thirty minutes.
CUDA context and driver setup is a fixed cost on the order of seconds: the first CUDA call creates a context, loads the kernel image, and enumerates the device topology, with an additional NCCL handshake on multi-GPU configurations.
Kernel compilation is the next phase. Engines that emit fused
operators via torch.compile, Triton, or a similar
just-in-time path pay the full compilation cost on the first cold
start of a fresh host, which runs from tens of seconds to several
minutes depending on architecture and the number of fused shapes.
Subsequent cold starts that find a warm compile cache on
persistent storage pay only the read cost for the cached
artifacts, which is back in the seconds range.
KV-cache allocation reserves the block-table memory the engine uses to hold attention keys and values for in-flight sequences. The reservation itself is fast but must complete before the engine accepts traffic, because the block table is the address space the attention kernel walks. The first forward pass finishes any lazy initialisation, typically costs a few hundred milliseconds, and is usually absorbed by a synthetic warmup request before the engine is marked ready.
Summed end-to-end, a cold start on a seventy-billion-class model with a fresh VM and no cached weights lands between five and fifteen minutes, with the weight read accounting for the majority of the budget. A cold start on a provisioned host with persistent weights and a warm compile cache lands an order of magnitude faster, which is the architectural premise that makes scale-to-zero serving compatible with interactive latencies at all.
Hibernation versus full cold start
Hibernation is a hypervisor primitive that snapshots a virtual machine's memory image to persistent storage and parks the underlying hardware. A wake restores the snapshot, the guest kernel resumes from the point of suspension, and any process that was running, including a fully initialised inference engine with loaded weights and allocated KV cache, comes back ready to serve.
The weights are still resident in the kernel page cache when the process resumes, and the engine's internal state, kernel compilation cache included, is exactly where it was when the snapshot was taken.
The effect on cold-start latency is an order of magnitude. A full cold start on a seventy-billion-class model takes minutes; a wake from hibernation on the same model takes tens of seconds, bounded by the time to read the snapshot back into memory and resume the guest.
Hibernation is not free: the parked VM continues to bill at a fraction of the active rate, typically for the underlying disk and reserved address space rather than the GPU itself. The savings are real only when the idle interval exceeds the round-trip cost of hibernating and waking, which depends on the workload's traffic shape and the provider's price for parked state.
For interactive workloads with quiet hours measured in tens of minutes or longer, hibernation dominates both continuous serving and full cold-start scale-to-zero on the cost-versus-latency Pareto frontier.
Wake budgets and request queueing
The request that arrives at a hibernated replica is the request that pays the wake. A wake taking thirty seconds means a thirty- second first-byte latency for the request that triggered it. A wake taking ten minutes means a ten-minute first-byte latency, which is comfortably past every reasonable HTTP timeout and most user patience budgets. The wake interval is the single most important number a scale-to-zero operator publishes, because it is the worst- case latency a customer will observe under the policy.
Two engineering primitives make the wake survivable. The first is a hard upper bound enforced at the gateway: the wake is given a budget, and the request is failed with a known error code if the budget elapses. Without the bound, an unhealthy hibernated replica pins a connection forever and confuses every downstream timeout.
The second is a streaming heartbeat over the open HTTP connection. Server-sent events or chunked responses that send a keepalive byte every few seconds prevent intermediaries from collapsing the connection while the wake is in flight, and let well-behaved clients distinguish "still working" from "dead". Once the wake completes and the first request returns, every subsequent request enters at steady-state latency until the next quiet interval triggers another hibernation.
Lean-stack engineering, after Hanov and Ryer
An inference fleet for a small operator is a Hanov-shaped problem. Hanov argues that a single virtual private server running a single statically linked binary handles tens of thousands of requests per second on modest hardware, and that vertical scaling postpones every distributed-systems problem until measurement proves the postponement cannot continue.
The monthly bill on a lean stack is itself an observability
signal: when the line item is small and constant, any unexpected
change in it maps to a specific cause an operator can trace. SQLite
over a network database, with write-ahead logging enabled,
outperforms a remote Postgres on every workload that fits on one
disk and removes a network hop from the failure surface. Static
compilation over interpreted runtimes collapses the deploy step to
scp followed by systemctl restart.
"The goal is to serve requests, not to maintain infrastructure. When you have one server, you know exactly where the logs are, exactly why it crashed, and exactly how to restart it."
Ryer's essay on writing HTTP services in Go after thirteen years
supplies the complementary axis: given a single binary, how should
the inside be organised so it stays auditable as it grows? Ryer
argues that NewServer(deps...) takes every dependency
as an argument, with no package-level globals and no init-time side
effects, and that handlers are closure factories rather than struct
methods, so the dependency surface of every endpoint is visible at
the route table.
The same essay wires context cancellation through
every layer, so every request-scoped goroutine exits when the
request does. main contains no magic: the underlying
run() function accepts arguments, environment lookup,
and standard streams as parameters so the program is testable
end-to-end through httptest. sync.Once
guards deferred expensive setup so startup is not blocked on
resources most requests will never touch.
The two axes compose. Hanov's tenets keep the perimeter of the system small, which is what makes the bill legible and the deployment auditable. Ryer's tenets keep the inside of the binary inspectable as it grows toward that perimeter, which is what prevents a small system from becoming an unmaintainable one as features accrete. The result is a fleet that a single person can read end-to-end in an evening, deploy from a laptop, and explain to a customer line-item by line-item, on the same day it serves production traffic.
Where the lean-stack pattern breaks
The lean stack is not universal. Past the load a single vertically scaled box can serve, the system has to grow horizontally and pays the distributed-systems tax it had been postponing. Multi-tenant compliance regimes such as HIPAA, PCI-DSS, and FedRAMP require infrastructure controls, including network segmentation, key custody, and audit-log retention, that a single- box stack cannot supply without bolting on most of what the lean stack rejected.
Hard real-time systems with safety-critical deadlines rely on specialised runtimes and formal scheduling analyses outside the lean toolkit. Problems whose structure is intrinsically horizontal, including large-scale crawling, video transcoding fleets, and training runs on tens of accelerators, have parallelism in their shape and are not well served by one binary on one box regardless of operator preference.
Readers whose workload sits in any of those categories will spend their attention better on a different pattern; the value of naming the failure modes is letting the self-selection happen up front.
References
- Steve Hanov. Running multiple ten-thousand-dollar monthly- revenue companies on a twenty-dollar tech stack. Available at stevehanov.ca/blog/?id=199. The headline essay for the lean-stack philosophy: one VPS, one statically linked binary, SQLite with write-ahead logging, transparent billing as the operator's first observability signal.
- Mat Ryer. Writing HTTP services in Go after thirteen
years. Published February 2024 on the
Grafana Labs blog.
The complementary essay on internal
organisation: explicit dependency injection through
NewServer, closure-factory handlers,contextcancellation at every layer, end-to-end testing viahttptest, andsync.Oncefor deferred setup. - Kwon et al. Efficient memory management for large language model serving with PagedAttention. SOSP 2023. The foundational paper on block-table KV-cache management, relevant background for the KV-allocation phase of cold start.
- Yu et al. Orca: a distributed serving system for transformer-based generative models. OSDI 2022. Introduced iteration-level continuous batching, which determines how a warmed-up engine schedules concurrent decode steps and therefore sets the steady-state baseline against which cold-start and wake penalties are measured.
- Knative and similar serverless frameworks document scale-to- zero with a configurable minimum-replica floor, which is the industry's prevailing answer to cold-start sensitivity when a zero floor is operationally unacceptable.