Inference on GPUs

LLM inference is a GPU loading a multi-gigabyte weight file and running matrix multiplies for every token it generates. The cost of that GPU, per hour, is now public and rentable.

The interesting question stopped being "which hosted API is cheapest" and became "what does an hour of an H100 or an A100 actually cost, and how much of that hour does a real workload keep the card busy." Per-token list prices roll up those two numbers and add a margin. The underlying rates are legible.

What changed 2022-2026

Open-weights models caught up to the closed frontier on the workloads most teams actually run. Qwen3, DeepSeek, and Llama-class checkpoints now serve coding, reasoning, and tool-using agents at quality that was hosted-API-only two years ago. The weights are downloadable; the serving stack (vLLM, SGLang, TGI) is open source.

Quantisation moved from research to default. FP8 on Hopper, AWQ and compressed-tensors on Ampere, GPTQ across the board: 70B-class models now fit on a single H100 or a pair of A100s with negligible quality loss. At the same time, hourly GPU markets (Hyperstack, Lambda, RunPod) made per-hour rental costs legible. Per-token hosted-API pricing is now a markup on per-hour rental, not a necessity.

The wedge

I run inference on rented GPUs by the minute, idle them when nobody is talking to them, and pass the invoice through.

The result is a standard OpenAI-compatible endpoint billed against the actual upstream invoice from the GPU provider, not a per-token markup. Idle hours cost nothing because idle GPUs hibernate.

The rest of this site

A working engineer's tour of inference. The economics page publishes the realised cost per million tokens, blended across the fleet over the trailing seven days. The hardware page covers what separates Hopper, Ampere, and Ada Lovelace cards for production serving. The API page walks through the OpenAI Chat Completions request and response contract. The glossary defines the working vocabulary (TTFT, KV cache, PagedAttention, continuous batching, quantisation), and the research page tracks the papers and engineering write-ups that shape modern inference serving.