Invoice-true cost per million tokens: spot GPU economics without hand-waving

Most published cost-per-token numbers are list price divided by theoretical throughput. The number on this site is built the other way around: from the provider's own billing data, snapshotted once per minute, divided by the tokens actually served.

The band

Cost of goods runs roughly $0.30 to $0.70 per million output tokens on an H100-class tier at good utilisation, serving 4-bit quantised weights. It is a band and not a point because utilisation dominates: the same GPU-hour spread over fewer tokens costs more per token, and no amount of hardware enthusiasm changes that arithmetic.

The accounting

Two inputs carry the whole story. The provider's billing is sampled once per minute, recording each resource's current rate and cumulative bill. Every request's prompt and completion tokens are counted. Spend for a window is a subtraction between billing samples; cost per million tokens is that delta divided by the tokens served in the same window.

There is deliberately no local estimate multiplying rates by elapsed time. Every published figure folds over the provider's own billing data, so the spend number cannot drift from the invoice. An idle fleet under healthy billing data honestly reads $0 spend; billing data going dark is treated as lost observability, not as zero.

Failing closed on lost observability

The daily budget gate fails closed: when billing observability is lost, the gateway stops serving rather than spending blind. Serving without knowing the spend is treated as an outage, the same as any other broken invariant.

Why hibernation shapes the band

The fleet buys spot-priced and on-demand GPU capacity and hibernates idle tiers, so an idle hour costs storage cents rather than GPU dollars, while the persistent volume keeps the weights so a wake stays bounded (the wake anatomy article walks through that budget). Scale-to-zero is what pulls real utilisation toward the cheap end of the band.

What the band leaves out

The band covers GPU compute and the per-tier storage volume. It excludes the fixed gateway VPS and egress bandwidth, which do not scale with tokens served; they are real costs, but they amortise across the whole fleet rather than attaching to a tier. Pricing built on the band lives on the pricing page.