spotinference is a developer-facing inference service built on a deliberately small, auditable stack. It runs open models on a priority-ordered GPU fleet and bills to the underlying provider invoice rather than a per-token markup.
The engineering posture is lean by design: one statically linked Go binary serves the gateway, operates the fleet from the command line, and runs the idle-hibernate daemon on the VM. The whole system favours operational transparency over distributed cleverness, so the billing ledger reconciles against the provider invoice line by line.
How the service is run
The fleet serves a single region today, scales to zero when idle, and wakes a tier on the next request within a bounded budget. The operational details, the request path, the security posture, and the measured throughput are published openly rather than kept behind a sales call.
- How it worksThe request path from key to streamed response, in plain terms.
- What inference actually costsThe realised cost per million tokens, traced to the invoice.
- The background readingFive pillars covering the economics, performance, and reliability of GPU inference.
Source and contact
The project is developed in the open. The code and its history are available on GitHub.