spotinference Sign in with GitHub

Spot GPUs are surplus accelerator capacity sold at steep discounts but reclaimable by the provider on short notice; on-demand GPUs cost several times more and stay allocated until released. The split sets an inference fleet's cost floor, and discounted capacity only pays off when serving survives interruption without losing model state.

Spot vs on-demand GPU

Every cloud sells the same accelerator two ways: on-demand, where the instance is yours until released, and spot (or preemptible), where the provider sells spare capacity at a discount and reserves the right to take it back. For GPU inference the discount is large enough to dominate fleet economics, often more than half the on-demand rate.

The economics

An inference fleet's bill is GPU hours, and most of those hours are idle tail: capacity held for traffic that has not arrived yet. On-demand pricing charges full rate for that tail. Spot pricing discounts it heavily but adds a new failure mode, reclamation mid-request, that the serving layer has to absorb without losing customer traffic or model state.

Interruption as a design input

The reliability answer is to make interruption equivalent to a planned hibernate: keep model weights and the torch.compile cache on a persistent volume that survives the VM, so any restore is a disk mount plus an engine boot rather than a fresh multi-gigabyte download. Measured on spotinference's fleet, that bounds the wake from hibernate-restore to a serving engine at 8 minutes or less, which turns reclaimable capacity from a gamble into a latency budget.

For how the fleet exploits this, see Reliability: scale to zero and hibernation.

Cost and reliability implications

Hibernation turns the spot trade into a schedule rather than a gamble: an idle VM stops billing GPU hours, and because model weights and the torch.compile cache live on a persistent volume, a wake from hibernate-restore to a serving engine completes within 8 minutes measured on spotinference's fleet, under a 10-minute hard cap that returns 504 instead of billing an unbounded restore.

Part of Inference economics on the learn hub.

See also
References
  • Amazon EC2 Spot Instances. The canonical spot model: spare-capacity pricing with a two-minute interruption notice, the contract every spot-tolerant architecture designs against.
  • Google Cloud: Spot VMs. Second-vendor confirmation of the same economics: 60-91% discounts against on-demand in exchange for preemptibility.
  • Hyperstack: Hibernating a Virtual Machine. The hibernate-restore primitive that lets a fleet release GPU spend while idle and resume on the same volume.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.