Learn LLM inference

Every educational page on this site, organised into five clusters: what inference costs, what makes it fast, what keeps it reliable, how the standard API works, and how the model itself shapes all four.

Each cluster below opens with the answer, then links to its pages: the narrative treatments, the glossary terms, and the research references. Every deep page links back here, so wherever you land, the rest of its cluster is one hop away.

Inference economics

Self-served GPU inference runs $0.30 to $0.70 per million output tokens on an H100-class tier at good utilisation, several times below hosted API list prices. These pages show where that number comes from, what it includes, and the spot-capacity trade that makes it possible.

Economics: the realised cost per million tokensThe live invoice-true number, blended across the fleet, updated daily.
Pricing: illustrative pre-launch bandsBoth sides published: the cost of goods and the bands built on it.
Spot vs on-demand GPUDiscounted reclaimable GPU capacity versus paid-for-certainty allocation.
Invoice-true cost per million tokensMinute-level billing snapshots over per-request token counts: an honest $0.30 to $0.70 per Mtok band.
OpenAI-compatible gateway vs self-hosting vLLMLabor plus idle GPU hours against a per-token price; utilisation decides it.
Spot vs on-demand GPU inference: cost against latencySpot's discount against the interruption tax; the strong position is spot-first with a bounded recovery path.

Performance and latency

LLM serving speed is set by memory and scheduling, not raw arithmetic: the KV cache decides how many requests fit on a GPU, and iteration-level batching decides how busy the GPU stays. These pages walk the engine mechanics, the hardware, and the throughput numbers measured on this fleet.

How LLM inference engines workOne long argument from TTFT through the KV cache to vLLM.
GPU hardware for LLM inferenceHopper, Ampere, and Ada Lovelace: bandwidth, FP8, and parallelism.
Continuous batchingIteration-level scheduling where requests join and leave every decode step.
KV cachePer-sequence attention tensors retained between decode steps.
PagedAttentionPage-based KV cache allocator modeled on OS virtual memory.
Pipeline parallelism (PP)Layer-group stages assigned one per GPU, micro-batches streaming forward.
Tensor parallelism (TP)Intra-layer sharding of weight matrices across N GPUs.
TTFT, time to first tokenEnd-to-end latency from request submit to first streamed response byte.
Prefill vs decodeThe compute-bound prompt pass and the memory-bound token loop every request splits into.
Orca: continuous batching (Yu et al., OSDI 2022)Iteration-level scheduling and selective batching for transformer serving.
PagedAttention (Kwon et al., SOSP 2023)Paged KV cache; 2-4x throughput by eliminating fragmentation.
vLLM (the serving engine)Reference open-source LLM serving engine; PagedAttention plus continuous batching.
H100 vs A100: measured decode throughputDual H100s at 118.9 tok/s vs dual A100s at 106.6, short-context decode through the production gateway.
How the KV cache sets LLM API latency and costPrefill against decode, the memory arithmetic that caps concurrency, and what an API caller can do about both.

Reliability

A fleet that hibernates when idle is cheaper than an always-on one only when the worst case is bounded: here a wake completes in eight minutes or less measured, with a ten-minute hard cap that fails honestly. These pages cover cold starts, wake budgets, and the engineering that keeps the bound real.

Reliability and operations for GPU inference fleetsScale-to-zero arithmetic, cold-start anatomy, hibernation, wake budgets.
Cold startTime from a fresh process initialising to its first served request.
Cold-start wake latencyThe budgeted SLO for waking a hibernated tier into service.
Lean-stack engineeringOne binary, one box; auditability as the deliverable.
Anatomy of a GPU wake: the 8-minute budgetHibernate-restore decomposed: VM restore, weight load, compile cache, CUDA graphs, first token.
What a cold start actually costsTwo costs, separated: minutes of user-facing latency and warm-up GPU-minutes billed for zero tokens.

OpenAI-compatible integration

An OpenAI-compatible endpoint means existing SDKs, agent frameworks, and tools work after swapping the base URL and the key; nothing else changes. These pages cover the contract itself: the chat completions shape, SSE streaming, tool calls, and structured output.

The OpenAI-compatible chat completions APIHow the shape became the standard, and what the contract looks like.
Structured outputServer-enforced response shape matching a caller-supplied JSON Schema.
Tool calls (function calling)Model returns structured function invocations instead of free text.
OpenAI-compatible APIThe drop-in /v1/chat/completions contract for swapping inference providers.
Context windowThe shared token budget a request's prompt, history, tools, and answer all draw from.
Migrating from OpenAI to an OpenAI-compatible APISwap base_url and the key, keep the SDK; audit model names, token counts, defaults, and the cold path.
Retries, timeouts, and backoff for LLM API callsThree timeouts, a failure taxonomy decided before retrying, and jittered backoff sized to token budgets.
Streaming LLM responses over SSEThe wire format, the parsing rules that survive production, and the middleboxes that buffer streams back into batch.
Structured output and tool calling in practiceDeclare the shape and let the server enforce it; the failures that remain are schema design and truncation.

Model behavior

What a model is (its architecture, its numeric precision, its decoding strategy) moves serving cost as much as any infrastructure choice. These pages cover the model-side levers: sparse mixture-of-experts routing, quantisation formats, and draft-model speculation.

QuantisationStoring weights at lower precision than the training format.
FP8 quantization8-bit floating-point weights executed natively by Hopper tensor cores.
Speculative decodingDraft-model token proposals verified in one target-model pass.
Mixture of experts (MoE)Sparse expert routing: large total capacity, small per-token compute.
AWQ (Lin et al., MLSys 2024)Activation-aware INT4 weight quantisation via per-channel pre-scaling.
GPTQ (Frantar et al., ICLR 2023)Inverse-Hessian-guided post-training INT4 weight quantisation at 175B scale.
Speculative decoding (Leviathan / Chen, 2023)Draft-and-verify decoding; 2-3x lower latency at zero quality cost.

All of this runs behind one endpoint. When you have read enough, point an OpenAI SDK at spotinference: swap the base URL and the key, and the first request answers from the same fleet these pages measure.