spotinference Sign in with GitHub

Cold start is the wall-clock time from a fresh inference process beginning to initialise to that process serving its first request. For LLM serving the interval is dominated by loading multi-gigabyte model weights from disk into GPU memory, so it runs minutes rather than the milliseconds typical of CPU function cold starts.

Cold start

Cold start is the elapsed wall-clock time between a fresh inference process beginning to initialise and that same process serving its first request, dominated by the disk-to-VRAM transfer of multi-gigabyte model weights.

The interval upper-bounds the latency contract for any scale-to-zero deployment, which is why an LLM cold start two or three orders of magnitude larger than a CPU function cold start reshapes the operational regime around hibernation, warm pools, and persistent weight caches.

For the longer treatment, see Reliability: cold-start anatomy.

Cost and reliability implications

Every cold start is billed GPU time that serves nobody: a multi-minute weight load on an on-demand H100 costs real dollars before the first token ships. Scale-to-zero economics only work when weights and the torch.compile cache live on a persistent volume, turning a fresh download into a disk read and bounding wake latency to a predictable single-digit-minute window.

Part of Reliability on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.