spotinference Sign in with GitHub

Cost of goods on this fleet runs roughly $0.30 to $0.70 per million output tokens on an H100-class tier at good utilisation serving 4-bit weights. The band is invoice-true: the provider's billing API is snapshotted once per minute, every request's tokens land in a usage log, and cost per token is the quotient of the two.

Invoice-true cost per million tokens: spot GPU economics without hand-waving

Most published cost-per-token numbers are list price divided by theoretical throughput. The number on this site is built the other way around: from the provider's own billing data, snapshotted once per minute, divided by the tokens actually served.

The band

Cost of goods runs roughly $0.30 to $0.70 per million output tokens on an H100-class tier at good utilisation, serving 4-bit quantised weights. It is a band and not a point because utilisation dominates: the same GPU-hour spread over fewer tokens costs more per token, and no amount of hardware enthusiasm changes that arithmetic.

The accounting

Two inputs carry the whole story. The provider's billing is sampled once per minute, recording each resource's current rate and cumulative bill. Every request's prompt and completion tokens are counted. Spend for a window is a subtraction between billing samples; cost per million tokens is that delta divided by the tokens served in the same window.

There is deliberately no local estimate multiplying rates by elapsed time. Every published figure folds over the provider's own billing data, so the spend number cannot drift from the invoice. An idle fleet under healthy billing data honestly reads $0 spend; billing data going dark is treated as lost observability, not as zero.

Failing closed on lost observability

The daily budget gate fails closed: when billing observability is lost, the gateway stops serving rather than spending blind. Serving without knowing the spend is treated as an outage, the same as any other broken invariant.

Why hibernation shapes the band

The fleet buys spot-priced and on-demand GPU capacity and hibernates idle tiers, so an idle hour costs storage cents rather than GPU dollars, while the persistent volume keeps the weights so a wake stays bounded (the wake anatomy article walks through that budget). Scale-to-zero is what pulls real utilisation toward the cheap end of the band.

What the band leaves out

The band covers GPU compute and the per-tier storage volume. It excludes the fixed gateway VPS and egress bandwidth, which do not scale with tokens served; they are real costs, but they amortise across the whole fleet rather than attaching to a tier. Pricing built on the band lives on the pricing page.

Methodology

Two inputs, joined by time window. The provider's billing is sampled once per minute, recording each resource's current rate and cumulative bill, so spend for any window is a subtraction between samples rather than a locally integrated estimate. Prompt and completion tokens are counted for every request. Dollars per million output tokens for a window is the spend delta divided by the summed completion tokens. The daily budget gate fails closed when the sampling stops, so the quotient is never computed from stale spend.

Part of Inference economics on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.