Inference economics

Self-served GPU inference runs $0.30 to $0.70 per million output tokens on an H100-class tier at good utilisation, several times below hosted API list prices. These pages show where that number comes from, what it includes, and the spot-capacity trade that makes it possible.

Inference cost is a unit-economics question: dollars billed by the GPU provider divided by tokens served. Everything else, the duty cycle, the quantisation format, the hardware tier, and the idle policy, is an input to that ratio. The pages below trace where the realised number comes from and what it does and does not include.

The decisive lever is utilisation. A self-served accelerator bills for every paid hour whether or not it is working, so the realised cost per million tokens tracks how busy the fleet stays, not a best-case rate. That is why the headline figure here is invoice-true and blended over a trailing window rather than quoted at full utilisation.

This pillar collects every page on the topic. Each one below opens with the answer; follow a link for the full treatment, or use the rail to cross into a neighbouring pillar.

Economics: the realised cost per million tokens

The live invoice-true number, blended across the fleet, updated daily. Read more.

Pricing: illustrative pre-launch bands

Both sides published: the cost of goods and the bands built on it. Read more.

Spot vs on-demand GPU

Discounted reclaimable GPU capacity versus paid-for-certainty allocation. Read more.

Invoice-true cost per million tokens

Minute-level billing snapshots over per-request token counts: an honest $0.30 to $0.70 per Mtok band. Read more.

OpenAI-compatible gateway vs self-hosting vLLM

Labor plus idle GPU hours against a per-token price; utilisation decides it. Read more.

Spot vs on-demand GPU inference: cost against latency

Spot's discount against the interruption tax; the strong position is spot-first with a bounded recovery path. Read more.

Every page in this pillar describes the system running behind one endpoint. Point an OpenAI SDK at spotinference when ready: swap the base URL and the key, and the first request answers from the same fleet these pages measure.