spotinference

Sign in with GitHub
Learn LLM inference

Every educational page on this site, organised into five clusters: what inference costs, what makes it fast, what keeps it reliable, how the standard API works, and how the model itself shapes all four.

Each cluster below opens with the answer, then links to its pages: the narrative treatments, the glossary terms, and the research references. Every deep page links back here, so wherever you land, the rest of its cluster is one hop away.

Inference economics

Self-served GPU inference runs $0.30 to $0.70 per million output tokens on an H100-class tier at good utilisation, several times below hosted API list prices. These pages show where that number comes from, what it includes, and the spot-capacity trade that makes it possible.

Performance and latency

LLM serving speed is set by memory and scheduling, not raw arithmetic: the KV cache decides how many requests fit on a GPU, and iteration-level batching decides how busy the GPU stays. These pages walk the engine mechanics, the hardware, and the throughput numbers measured on this fleet.

Reliability

A fleet that hibernates when idle is cheaper than an always-on one only when the worst case is bounded: here a wake completes in eight minutes or less measured, with a ten-minute hard cap that fails honestly. These pages cover cold starts, wake budgets, and the engineering that keeps the bound real.

OpenAI-compatible integration

An OpenAI-compatible endpoint means existing SDKs, agent frameworks, and tools work after swapping the base URL and the key; nothing else changes. These pages cover the contract itself: the chat completions shape, SSE streaming, tool calls, and structured output.

Model behavior

What a model is (its architecture, its numeric precision, its decoding strategy) moves serving cost as much as any infrastructure choice. These pages cover the model-side levers: sparse mixture-of-experts routing, quantisation formats, and draft-model speculation.

All of this runs behind one endpoint. When you have read enough, point an OpenAI SDK at spotinference: swap the base URL and the key, and the first request answers from the same fleet these pages measure.