spotinference

Sign in with GitHub
Abstract

spotinference serves open-weights models behind the standard chat completions API, fronting a GPU fleet that hibernates when idle. An existing OpenAI SDK points at it unchanged. Throughput is measured through the production gateway, and the published cost per million tokens is the provider invoice divided by tokens served.

Key takeaways
  • The endpoint speaks the OpenAI chat completions shape, so an existing SDK works after swapping the base URL and the key; nothing else changes.
  • Idle tiers hibernate and the next request wakes them, so the bill runs only while a GPU is actually serving.
  • Throughput is measured through the production gateway, not quoted from datasheets: 118.9 tok/s on a pair of H100s and 106.6 on a pair of A100s at short context (experimental, 2026-04).
  • Scope limitation: this page is the orientation; the economics, performance, and reliability claims summarised here are derived and bounded on their own pages.
An OpenAI-compatible inference API

spotinference serves open-weights models behind the standard chat completions API. Point an existing OpenAI SDK at the endpoint, swap the base URL and the key, and requests stream back tokens. No new SDK, no new request shape, nothing to relearn.

Behind the endpoint is a fleet of rented GPU machines that hibernate when idle and wake on demand. The fleet management is invisible from the outside: the endpoint answers, and the bill only runs while a GPU is actually working.

Measured, not promised

The throughput numbers are measured through the production gateway, not quoted from datasheets. A pair of H100s sustains 118.9 tokens per second at short context; a pair of A100s sustains 106.6. The gateway adds less than 5 ms at P95 over a direct engine call.

The reliability promise is a bounded worst case: a request that lands on a sleeping machine waits for the wake, eight minutes or less measured, with a hard cap at ten. Warm requests stream at steady-state latency. The reliability page walks the full anatomy.

The operating model

GPUs are rented by the minute and hibernate when nobody is using them, and the provider's invoice passes straight through. The published cost per million tokens is that invoice divided by the tokens served; holding it down is built into how the service runs, not something a customer manages.

Where to go next

The economics page publishes the realised cost per million tokens, updated daily, and the pricing page shows the bands built on it. Everything educational is organised on the learn hub: the economics behind that number, performance, reliability, the API standard, and model behavior, one hop from every guide, glossary term, and reference on this site. Signing in with GitHub mints an API key once signup opens.