Spot vs on-demand GPU
Every cloud sells the same accelerator two ways: on-demand, where the instance is yours until released, and spot (or preemptible), where the provider sells spare capacity at a discount and reserves the right to take it back. For GPU inference the discount is large enough to dominate fleet economics, often more than half the on-demand rate.
The economics
An inference fleet's bill is GPU hours, and most of those hours are idle tail: capacity held for traffic that has not arrived yet. On-demand pricing charges full rate for that tail. Spot pricing discounts it heavily but adds a new failure mode, reclamation mid-request, that the serving layer has to absorb without losing customer traffic or model state.
Interruption as a design input
The reliability answer is to make interruption equivalent to a planned hibernate: keep model weights and the torch.compile cache on a persistent volume that survives the VM, so any restore is a disk mount plus an engine boot rather than a fresh multi-gigabyte download. Measured on spotinference's fleet, that bounds the wake from hibernate-restore to a serving engine at 8 minutes or less, which turns reclaimable capacity from a gamble into a latency budget.
For how the fleet exploits this, see Reliability: scale to zero and hibernation.