H100 vs A100: measured decode throughput on a priority-lineup gateway

Two GPU tiers, one benchmark, no extrapolation: on 2026-04-19 the dual-H100 tier decoded 118.9 tokens per second and the dual-A100 tier 106.6, short context, measured end to end through the production gateway rather than against a bare engine port.

The result

Tier	Checkpoint	Short-context decode
Dual H100	FP8	118.9 tok/s
Dual A100	compressed-tensors	106.6 tok/s
Dual L40	4-bit compressed-tensors	not yet measured

Both measured tiers serve the same model behind the same OpenAI-compatible API, split across two GPUs with tensor parallelism. The H100 advantage on this workload lands at roughly 12 percent. Datasheet ratios between the two parts are far larger, and that difference is the interesting part of the result.

What drives the gap, and what shrinks it

Short-context decode streams the model weights from GPU memory once per generated token, so memory bandwidth sets the ceiling. FP8 quantisation runs on the H100's native FP8 tensor cores; the A100 predates hardware FP8 and serves a compressed-tensors checkpoint instead. The narrative version of that trade lives at Engines: quantisation.

The measured gap is narrower than the hardware gap because a served token pays for more than matrix multiplies. Scheduling, sampling, detokenisation, and the HTTP hop ride along on every token, and none of them get faster on Hopper. Quantisation also compresses the difference: both checkpoints shrink the weights that move per token, which spends part of the H100's bandwidth advantage before the benchmark starts.

The price side

The hourly cost figures, labelled estimates and not invoice numbers, put the dual-H100 tier at $3.80 and the dual-A100 tier at $2.80: a price gap of roughly 36 percent for a 12 percent throughput edge. Per token of single-stream decode the A100 tier is the cheaper rung, which is why the lineup keeps both, and why the cost band on the pricing page is dominated by utilisation rather than by GPU generation.

What the gateway adds

These numbers include the proxy. The gateway is held to a budget of less than 5 ms added p95 latency over a direct engine call at the same token count; anything above that budget is treated as a defect to investigate, not as overhead to accept.

Not yet measured

There is no L40 throughput number yet, and no TTFT-by-quantisation curve. A dual-L40 tier exists in the lineup, but publishing a figure before the harness produces one would defeat the point of measuring. The table will carry the L40 row's number once that tier runs the same workload.