Migrating from OpenAI to an OpenAI-compatible API
The mechanical migration is two fields: point the SDK's
base_url at the new endpoint and swap the API key. The request
shape, the streaming framing, and the tool-call envelope stay exactly as
they are. The real work is auditing everything that was silently tied to
one vendor: model names, tokenizer-dependent token counts, sampling
defaults, and the latency profile of the new provider.
The two-line change
Every official OpenAI SDK accepts a base_url override, and
the major frameworks (LangChain, LlamaIndex, the Vercel AI SDK) expose the
same knob. Pointing the Python SDK at an
OpenAI-compatible API looks
like this:
from openai import OpenAI
client = OpenAI(
base_url="https://spotinference.com/v1",
api_key="YOUR_KEY",
)
resp = client.chat.completions.create(
model="qwen3-coder",
messages=[{"role": "user", "content": "Say hello."}],
)
print(resp.choices[0].message.content)
The same call as plain HTTP, useful as a smoke test before touching application code:
curl https://spotinference.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_KEY" \
-d '{"model": "qwen3-coder",
"messages": [{"role": "user", "content": "Say hello."}]}'
Authentication is unchanged: the standard
Authorization: Bearer header, in the same place the OpenAI SDK
already puts it. Request and response bodies follow the
Chat Completions shape that every
compatible server shares.
What actually needs auditing
Model names
The model field is the one request parameter with no
cross-vendor meaning. gpt-4o does not exist on any other
provider; the new endpoint serves its own catalogue under its own names.
Centralise the mapping in one configuration value rather than scattering
model strings through the codebase, because the second migration (or the
first A/B test between providers) reuses that seam.
Token counts and tokenizers
Token counts come from the served model's tokenizer, not from the
client. Code that estimates budgets with tiktoken is counting
in the wrong currency after the switch: a prompt that measured 900 tokens
under one tokenizer can be 700 or 1,100 under another. Trust the
usage object in each response (prompt_tokens,
completion_tokens) for accounting, and re-derive any
context-window guardrails against the new model's limits instead of
porting hard-coded thresholds.
Sampling defaults
Defaults for temperature, top_p, and the
penalty fields are server-side and vary across serving stacks and model
families. A workload tuned against one vendor's implicit defaults can shift
in tone or determinism after migration with no code change at all. Pin
every sampling parameter explicitly in the request; explicit values travel,
defaults do not.
Feature edges
The core contract (messages in, choices out, SSE streaming, tool calls,
structured output) is broadly portable. The edges are not:
logprobs, seed, n > 1,
multimodal content parts, and vendor-specific extensions need a support
check before the cutover. Compatible servers differ in how they reject
what they do not implement: some ignore unknown fields, others return 400.
Send one probe request per feature in use and read the actual behavior.
Tool calls and structured output
The tool-call envelope is standard; the quality of emission is model-specific. A prompt-and-schema combination that produced clean invocations from one model family needs a re-test against the new one, especially under streaming where arguments arrive as fragments. The same applies to structured output: confirm which enforcement mechanism the server provides and whether it guarantees schema-valid output or merely encourages it. The practical patterns are in the structured output and tool calling guide.
Error handling and rate limits
Most compatible servers keep the OpenAI error body shape (an
error object with message, type,
code), but status-code semantics and rate-limit headers
differ. Read the new provider's published error contract and wire the
retry policy to it; the
retries and timeouts
guide covers a classification that ports across providers.
The first request after idle
Hosted APIs running on always-warm fleets never show a cold path.
Providers that scale GPU capacity to zero do: the first request to an idle
tier can trigger a hardware wake measured in minutes, not milliseconds. On
this fleet a wake is budgeted at 8 minutes and hard-capped at 10 minutes,
after which the request fails honestly with 504 rather than
hanging forever; the budget and its enforcement are documented at
Reliability: wake budgets. Whatever
provider terminates the migration, find this number before the first
production timeout finds it for you.
A cutover checklist
- Smoke-test with curl: one completion, one streamed completion, one tool call against the new endpoint.
- Swap
base_urland key in a staging configuration; leave model-name mapping behind a single config value. - Pin all sampling parameters explicitly.
- Replace client-side token estimates with server-reported
usage. - Probe each non-core feature in use (logprobs, seed, multimodal) and record the result.
- Re-run the application's golden prompts and compare outputs, token counts, and tool-call rates side by side.
- Size timeouts to the new latency profile, including the cold path.
Verifying parity with a golden set
A migration is verified with data, not vibes: a fixed set of 20 to 50 representative prompts, run against both providers, with outputs, token counts, latencies, and tool-call success rates recorded. Throughput belongs in that comparison too. As a calibration point, this fleet's published short-context decode measurements are 118.9 tokens per second on a dual-H100 tier and 106.6 on a dual-A100 tier; the measured-throughput article describes the harness behind those numbers, and the same canned traffic pattern works as a migration check: same prompts, both endpoints, compare the tail.