spotinference Sign in with GitHub

On a short-context decode benchmark run through the production gateway on 2026-04-19, a dual-H100 tier serving FP8 weights sustained 118.9 tokens per second and a dual-A100 tier serving compressed-tensors weights sustained 106.6 tokens per second. That gap, roughly 12 percent, is far smaller than the datasheet gap between the two GPUs.

H100 vs A100: measured decode throughput on a priority-lineup gateway

Two GPU tiers, one benchmark, no extrapolation: on 2026-04-19 the dual-H100 tier decoded 118.9 tokens per second and the dual-A100 tier 106.6, short context, measured end to end through the production gateway rather than against a bare engine port.

The result

TierCheckpointShort-context decode
Dual H100FP8118.9 tok/s
Dual A100compressed-tensors106.6 tok/s
Dual L404-bit compressed-tensorsnot yet measured

Both measured tiers serve the same model behind the same OpenAI-compatible API, split across two GPUs with tensor parallelism. The H100 advantage on this workload lands at roughly 12 percent. Datasheet ratios between the two parts are far larger, and that difference is the interesting part of the result.

What drives the gap, and what shrinks it

Short-context decode streams the model weights from GPU memory once per generated token, so memory bandwidth sets the ceiling. FP8 quantisation runs on the H100's native FP8 tensor cores; the A100 predates hardware FP8 and serves a compressed-tensors checkpoint instead. The narrative version of that trade lives at Engines: quantisation.

The measured gap is narrower than the hardware gap because a served token pays for more than matrix multiplies. Scheduling, sampling, detokenisation, and the HTTP hop ride along on every token, and none of them get faster on Hopper. Quantisation also compresses the difference: both checkpoints shrink the weights that move per token, which spends part of the H100's bandwidth advantage before the benchmark starts.

The price side

The hourly cost figures, labelled estimates and not invoice numbers, put the dual-H100 tier at $3.80 and the dual-A100 tier at $2.80: a price gap of roughly 36 percent for a 12 percent throughput edge. Per token of single-stream decode the A100 tier is the cheaper rung, which is why the lineup keeps both, and why the cost band on the pricing page is dominated by utilisation rather than by GPU generation.

What the gateway adds

These numbers include the proxy. The gateway is held to a budget of less than 5 ms added p95 latency over a direct engine call at the same token count; anything above that budget is treated as a defect to investigate, not as overhead to accept.

Not yet measured

There is no L40 throughput number yet, and no TTFT-by-quantisation curve. A dual-L40 tier exists in the lineup, but publishing a figure before the harness produces one would defeat the point of measuring. The table will carry the L40 row's number once that tier runs the same workload.

Methodology

Numbers come from the fleet's bench harness: it fires a canned set of short-context chat completions through the production gateway at one tier, tagging every request with a workload label. A scorecard over those requests reports average tokens per second, p50 and p95 duration, and average time to first token. The 2026-04-19 run compared the dual-H100 FP8 tier against the dual-A100 compressed-tensors tier on the same scenario set. Figures are short-context decode averages, not peak batched throughput.

Part of Performance and latency on the learn hub.

See also
References
  • NVIDIA H100 Tensor Core GPU. Datasheet context for the Hopper generation: HBM bandwidth and the native FP8 support that the h100x2 tier's FP8 checkpoint exercises.
  • NVIDIA A100 Tensor Core GPU. Datasheet context for the Ampere generation; the a100x2 tier serves a compressed-tensors checkpoint because Ampere predates hardware FP8.
  • vLLM quantisation support. Which quantisation backends run on which compute capability; the constraint that shapes the per-tier checkpoint choices.
  • spotinference economics. The cost side of the same accounting; the throughput figures here and the dollar figures there come from the same per-request counts.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.