Migrating from OpenAI to an OpenAI-compatible API

The mechanical migration is two fields: point the SDK's base_url at the new endpoint and swap the API key. The request shape, the streaming framing, and the tool-call envelope stay exactly as they are. The real work is auditing everything that was silently tied to one vendor: model names, tokenizer-dependent token counts, sampling defaults, and the latency profile of the new provider.

The two-line change

Every official OpenAI SDK accepts a base_url override, and the major frameworks (LangChain, LlamaIndex, the Vercel AI SDK) expose the same knob. Pointing the Python SDK at an OpenAI-compatible API looks like this:

from openai import OpenAI

client = OpenAI(
    base_url="https://spotinference.com/v1",
    api_key="YOUR_KEY",
)

resp = client.chat.completions.create(
    model="qwen3-coder",
    messages=[{"role": "user", "content": "Say hello."}],
)
print(resp.choices[0].message.content)

The same call as plain HTTP, useful as a smoke test before touching application code:

curl https://spotinference.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $YOUR_KEY" \
  -d '{"model": "qwen3-coder",
       "messages": [{"role": "user", "content": "Say hello."}]}'

Authentication is unchanged: the standard Authorization: Bearer header, in the same place the OpenAI SDK already puts it. Request and response bodies follow the Chat Completions shape that every compatible server shares.

What actually needs auditing

Model names

The model field is the one request parameter with no cross-vendor meaning. gpt-4o does not exist on any other provider; the new endpoint serves its own catalogue under its own names. Centralise the mapping in one configuration value rather than scattering model strings through the codebase, because the second migration (or the first A/B test between providers) reuses that seam.

Token counts and tokenizers

Token counts come from the served model's tokenizer, not from the client. Code that estimates budgets with tiktoken is counting in the wrong currency after the switch: a prompt that measured 900 tokens under one tokenizer can be 700 or 1,100 under another. Trust the usage object in each response (prompt_tokens, completion_tokens) for accounting, and re-derive any context-window guardrails against the new model's limits instead of porting hard-coded thresholds.

Sampling defaults

Defaults for temperature, top_p, and the penalty fields are server-side and vary across serving stacks and model families. A workload tuned against one vendor's implicit defaults can shift in tone or determinism after migration with no code change at all. Pin every sampling parameter explicitly in the request; explicit values travel, defaults do not.

Feature edges

The core contract (messages in, choices out, SSE streaming, tool calls, structured output) is broadly portable. The edges are not: logprobs, seed, n > 1, multimodal content parts, and vendor-specific extensions need a support check before the cutover. Compatible servers differ in how they reject what they do not implement: some ignore unknown fields, others return 400. Send one probe request per feature in use and read the actual behavior.

Tool calls and structured output

The tool-call envelope is standard; the quality of emission is model-specific. A prompt-and-schema combination that produced clean invocations from one model family needs a re-test against the new one, especially under streaming where arguments arrive as fragments. The same applies to structured output: confirm which enforcement mechanism the server provides and whether it guarantees schema-valid output or merely encourages it. The practical patterns are in the structured output and tool calling guide.

Error handling and rate limits

Most compatible servers keep the OpenAI error body shape (an error object with message, type, code), but status-code semantics and rate-limit headers differ. Read the new provider's published error contract and wire the retry policy to it; the retries and timeouts guide covers a classification that ports across providers.

The first request after idle

Hosted APIs running on always-warm fleets never show a cold path. Providers that scale GPU capacity to zero do: the first request to an idle tier can trigger a hardware wake measured in minutes, not milliseconds. On this fleet a wake is budgeted at 8 minutes and hard-capped at 10 minutes, after which the request fails honestly with 504 rather than hanging forever; the budget and its enforcement are documented at Reliability: wake budgets. Whatever provider terminates the migration, find this number before the first production timeout finds it for you.

A cutover checklist

Smoke-test with curl: one completion, one streamed completion, one tool call against the new endpoint.
Swap base_url and key in a staging configuration; leave model-name mapping behind a single config value.
Pin all sampling parameters explicitly.
Replace client-side token estimates with server-reported usage.
Probe each non-core feature in use (logprobs, seed, multimodal) and record the result.
Re-run the application's golden prompts and compare outputs, token counts, and tool-call rates side by side.
Size timeouts to the new latency profile, including the cold path.

Verifying parity with a golden set

A migration is verified with data, not vibes: a fixed set of 20 to 50 representative prompts, run against both providers, with outputs, token counts, latencies, and tool-call success rates recorded. Throughput belongs in that comparison too. As a calibration point, this fleet's published short-context decode measurements are 118.9 tokens per second on a dual-H100 tier and 106.6 on a dual-A100 tier; the measured-throughput article describes the harness behind those numbers, and the same canned traffic pattern works as a migration check: same prompts, both endpoints, compare the tail.