spotinference

Sign in with GitHub
How it works and who runs it

spotinference is one Go binary operated by one person: a gateway that walks a priority-ordered lineup of GPU virtual machines, wakes a hibernated machine when a request needs one, and proxies to vLLM.

This page covers the request path in plain terms, the security posture, the uptime promise the design supports, the measured numbers behind the throughput claims, and who is accountable when something breaks.

The request path, in plain terms

A request arrives over TLS carrying a bearer key. The gateway checks the key, then walks its tier lineup in priority order for a GPU machine that can serve the model. If the best machine is hibernated, the gateway wakes it, waits for vLLM to report healthy, and forwards the request. The response streams back through the gateway, and a usage row records the prompt and completion token counts.

Machines hibernate after fifteen idle minutes, so quiet hours cost storage rather than GPU rental. The wake path and its failure modes are the subject of the reliability page; the engine underneath is the subject of the engines page.

Security posture

API keys are random secrets stored only as SHA-256 hashes; a database read cannot recover a key. Dashboard access rides GitHub sign-in with cookie sessions, and every session issuance writes a permanent audit row. The inference endpoint accepts nothing without a valid bearer key.

Every request that reaches a model writes a usage row: tokens in, tokens out, latency, status. Every fleet mutation (a machine created, woken, hibernated, or deleted) lands in an operations journal. The audit trail is not an add-on; it is the same data the published cost numbers fold over.

Uptime and the wake budget

The honest reliability statement is a bounded worst case, not a nines figure. A request that lands on a hibernated tier waits for the wake: eight minutes or less measured, from hibernate-restore through vLLM cold start. A hard cap at ten minutes returns an explicit 504 rather than hanging the connection. Warm requests skip all of this and stream at steady-state latency; the full anatomy is on the reliability page.

Measured numbers

Throughput on the two benchmarked tiers, measured April 2026 with canned traffic through the gateway at short context: a pair of H100s serving an FP8-quantised checkpoint sustained 118.9 tokens per second, and a pair of A100s on compressed-tensors quantisation sustained 106.6. The gateway added less than 5 ms at P95 over a direct vLLM call.

On cost, the published spot band sits one third to one half below the standard band for output tokens: $2 against $3 is 33 percent less, $2 against $4 is 50 percent. That is arithmetic over the illustrative bands on the pricing page, not a measurement; the trade behind it is explained at spot vs on-demand GPU.

Who runs this

One operator, no team. That cuts both ways: there is no support rotation, and there is also nothing a sales deck can hide behind. The entire stack is a single statically linked Go binary, standard library plus three pinned dependencies, small enough to read end-to-end in an evening.

The cost accounting is invoice-truth: the provider's billing API is snapshotted once per minute into a local table, and every published dollar figure folds over that table. The economics page shows the realised number; /economics/method publishes the exact SQL so the fold can be checked rather than trusted.