Cold start is the wall-clock time from a fresh inference process beginning to initialise to that process serving its first request. For LLM serving the interval is dominated by loading multi-gigabyte model weights from disk into GPU memory, so it runs minutes rather than the milliseconds typical of CPU function cold starts.

Cold start

Cold start is the elapsed wall-clock time between a fresh inference process beginning to initialise and that same process serving its first request, dominated by the disk-to-VRAM transfer of multi-gigabyte model weights.

The interval upper-bounds the latency contract for any scale-to-zero deployment, which is why an LLM cold start two or three orders of magnitude larger than a CPU function cold start reshapes the operational regime around hibernation, warm pools, and persistent weight caches.

For the longer treatment, see Reliability: cold-start anatomy.

Cost and reliability implications

Every cold start is billed GPU time that serves nobody: a multi-minute weight load on an on-demand H100 costs real dollars before the first token ships. Scale-to-zero economics only work when weights and the torch.compile cache live on a persistent volume, turning a fresh download into a disk read and bounding wake latency to a predictable single-digit-minute window.

Part of Reliability on the learn hub.

See also

References

AWS Lambda Operator Guide: Understanding Lambda function scaling and cold starts. Canonical write-up of cold-start economics in the serverless model from which the term entered general infrastructure usage; the LLM regime inherits the framing but operates at two or three orders of magnitude larger latencies because weight loading dominates.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.