OpenAI-compatible gateway vs self-hosting vLLM

Self-hosting vLLM wins when utilisation is high and steady and the team has engineering time to own a serving stack. A gateway wins when traffic is bursty, the team is small, or the model menu is standard open weights. The honest comparison is labor plus idle GPU hours against a per-token price, and the deciding variable is utilisation.

The criteria, stated up front

Four axes decide this, and they should be weighed in this order: realised cost per million tokens at the workload's actual utilisation (not the datasheet's); operational labor, both setup and the permanent tax; the latency profile including cold paths; and control, meaning custom models, fine-tunes, and data-path requirements. Vendors on either side of the build-or-buy line tend to compare only the axis they win; this page tries to price all four.

What self-hosting actually involves

vLLM itself installs in minutes; the serving stack around it is the real project. The recurring work, drawn from the engine's own documentation: sourcing and holding GPU capacity (and re-bidding it if spot instances are in play); choosing per-model quantisation and parallelism settings; tracking an engine release cadence that ships multiple versions per quarter, against a CUDA-driver compatibility matrix; storing and versioning tens of gigabytes of weights per model; and wiring health checks, capacity alarms, and a wake-or-keep-warm policy. None of it is exotic, all of it is permanent. The engine's flag surface also has real teeth: settings exist that silently degrade output quality or multiply decode latency, which is why engine-configuration knowledge has to live somewhere in the team. The vLLM page covers what the engine does well, which is exactly why it is the default choice on both sides of this comparison.

What a gateway abstracts

A gateway sells the same engine behind a stable contract: an endpoint, a key, the OpenAI-compatible surface, and a published error contract for the cold path. Capacity management, engine upgrades, wake-on-demand, and hardware fallbacks happen behind the curtain; the integration work shrinks to the two-field change in the migration guide. The buyer gives up flag-level control and custom-model freedom, and gains a fixed per-token price for which someone else holds the pager.

The utilisation math

A GPU bills by the hour whether it decodes or idles, so realised cost per token is the rented rate divided by tokens actually produced. Two consequences follow. First, per-request decode speed is not fleet throughput: a single stream decodes at this fleet's measured 118.9 tokens per second on dual H100s, but continuous batching multiplies aggregate tokens per GPU-hour well beyond one stream's worth, so cost models built on single-stream arithmetic overstate cost badly. Second, idle hours dominate bursty workloads: a self-hosted GPU serving traffic ten percent of the time costs ten times its busy-hour rate per token served. This fleet's published cost of goods, $0.30 to $0.70 per million output tokens on an H100-class tier, is stated explicitly at good utilisation, and the invoice-true accounting article shows the measurement. A self-hosting estimate deserves the same honesty: compute the expected idle fraction first, because that fraction is the price of admission.

Latency, including the cold paths

An always-on self-hosted GPU has no wake path and wins every cold-path comparison; it pays for that with every idle hour. The moment self-hosting adopts scale-to-zero to control idle cost, it inherits the same cold-start physics a gateway manages: minutes of weight loading and engine start, the anatomy traced in the wake article. The difference is who engineers the bound. This fleet budgets a wake at 8 minutes, enforces a 10-minute hard cap, and fails honestly with 504 past it, per Reliability: scale to zero; a self-hosted stack needs the same persistent-weights and cache design to hit comparable numbers, or accepts cold starts measured in tens of minutes when weights re-download.

When self-hosting wins

Sustained high utilisation. A fleet that is busy around the clock amortises both the hardware and the ops labor; per-token cost approaches the physical floor.
Custom models and fine-tunes. Serving weights no one else hosts is the clearest forcing function.
Data-path requirements. Residency, audit, or no-third-party rules that a hosted endpoint cannot satisfy.
An existing platform team. Where GPU operations already exist, the marginal labor cost of one more service drops sharply.

When a gateway wins

Bursty or unproven traffic. Paying per token converts the idle-hour problem into the provider's problem.
Small teams. The permanent ops tax of a serving stack is the same size for a two-person team as for a fifty-person one; only one of them can afford it.
Standard open models. When the menu is shared, hosting it privately duplicates work without differentiating the product.
Cost predictability. A per-token price with published cost of goods is easier to budget than a GPU bill plus an engineer's calendar.

The crossover, honestly

Run the comparison with the workload's own numbers: expected tokens per month, expected idle fraction, the local cost of an engineer-hour, and a provider's per-token price. Self-hosting crosses over when utilisation is high enough that GPU rent per token undercuts the per-token price by more than the ops labor costs, and not before. Most workloads discover they are burstier than they believed; measuring the gap between peak and median demand before committing to hardware is the cheapest experiment in this whole decision.