Retries, timeouts, and backoff for LLM API calls
An LLM API call is a long-lived, expensive POST whose normal tail latency would page an on-call engineer anywhere else in the stack. Generic REST retry policy mishandles it in both directions: flat 30-second timeouts kill healthy requests, and naive retry loops double the bill. The fix is three separate timeouts, a failure taxonomy decided before the first retry, and exponential backoff with jitter.
Why generic retry policy fails here
Three properties separate inference calls from ordinary API traffic. Duration: a 2,000-token completion takes tens of seconds even on fast hardware, so any flat timeout tuned for microservices fires on healthy requests. Cost: a retried completion bills its tokens twice, so retry count is a budget decision, not a resilience freebie. Recovery shape: when a GPU backend comes back from a fault it warms up, and a thundering herd of synchronized retries arrives exactly when the backend is least able to absorb it. Chat completions are stateless, which is the one mercy: a failed request can always be reissued from scratch without server-side cleanup.
Classify the failure before retrying
Each failure class gets its own policy. A useful classification, using this gateway's published error contract as the worked example:
- Connection errors, DNS failures, TLS resets. Retry with backoff. The request never reached the server, so no tokens were billed and no duplicate work exists.
- 429 rate limited. Retry after waiting. Honor a
Retry-Afterheader when the server sends one; back off exponentially when it does not. - 502
upstream_unreachable. The gateway accepted the request but the GPU backend did not answer. Transient by construction: retry with backoff. - 503
no_upstream. No tier is configured to serve the requested model. This is configuration, not weather; retrying in a tight loop converts one mistake into a log flood. Alert instead. - 504
wake_timeout. The request arrived while the fleet was waking hardware and the wake exceeded the cap. The machine is likely warm by the next attempt, so a single delayed retry is the right move. The wake itself is budgeted at 8 minutes with a 10-minute hard cap, documented at Reliability: wake budgets. - 400, 401, 403, 422. Never retry. The request is malformed, unauthenticated, or oversized, and will fail identically every time.
- Mid-stream disconnect. A severed SSE stream cannot be resumed; the protocol has no cursor. Discard the partial output and reissue the whole request. The streaming guide covers how to make that replacement invisible in a UI.
Three timeouts, not one
Connect timeout. Small and strict: seconds. TCP and TLS setup does not get slower because the model is large.
Time-to-first-token timeout. Covers queueing, prompt
prefill, and any cold path. This is the timeout that scale-to-zero
providers stress: a first request to an idle tier can legitimately take
minutes while hardware wakes, where a warm request answers in under a
second. Size it from the provider's published worst case, not from
optimism. On this fleet the number to design against is the 8-minute wake
budget; the gateway gives up with 504 wake_timeout at 10
minutes, so a client first-token timeout slightly above that cap never
races the server.
TTFT is the metric to log here.
Total (read) timeout. Derive it from output length and
decode speed instead of guessing. At this fleet's measured short-context
decode rate of 118.9 tokens per second on the dual-H100 tier, a
4,000-token completion runs about 34 seconds of pure decode; a flat
30-second read timeout would kill it at 90 percent complete and bill every
token it discarded. The honest formula is
max_tokens / decode_rate plus first-token headroom.
Backoff that behaves in a fleet
Exponential backoff with full jitter is the standard for a reason: it spreads synchronized retries across time so a recovering backend sees a trickle instead of a wave. Start around one second, double per attempt, cap the sleep at 30 to 60 seconds, and cap attempts at three to five. Budget retries in dollars as well as attempts: five retries of a 10,000-token request is a different decision from five retries of a 50-token one. For sustained failure, a circuit breaker that fails fast and probes occasionally beats a queue of patient clients all holding connections open.
A reference implementation
Stdlib Go, with the classification and jitter applied:
func completeWithRetry(ctx context.Context, c *http.Client, body []byte) (*http.Response, error) {
backoff := time.Second
for attempt := 0; attempt < 4; attempt++ {
req, err := http.NewRequestWithContext(ctx, "POST",
"https://spotinference.com/v1/chat/completions",
bytes.NewReader(body))
if err != nil {
return nil, err
}
req.Header.Set("Authorization", "Bearer "+apiKey)
req.Header.Set("Content-Type", "application/json")
resp, err := c.Do(req)
if err == nil && resp.StatusCode < 500 && resp.StatusCode != 429 {
return resp, nil // 2xx and non-retryable 4xx go to the caller
}
if resp != nil {
resp.Body.Close()
}
sleep := backoff + time.Duration(rand.Int64N(int64(backoff)))
backoff *= 2
select {
case <-time.After(sleep):
case <-ctx.Done():
return nil, ctx.Err()
}
}
return nil, errors.New("retry budget exhausted")
}
The Python OpenAI SDK ships most of this: connection errors, 429, and 5xx are retried with backoff out of the box, and the two knobs worth setting explicitly are the retry count and the timeout split:
client = OpenAI(
base_url="https://spotinference.com/v1",
api_key=key,
max_retries=3,
timeout=httpx.Timeout(connect=10.0, read=120.0,
write=10.0, pool=10.0),
)
Neither snippet retries mid-stream failures; that path needs the reissue-and-replace handling described in the streaming guide.
What to log
Per request: status code, error code string, attempt number, TTFT,
total duration, and token counts from usage. Retry policy is
tuned from this log, not from defaults: the difference between a healthy
p99 and a wake event is obvious in TTFT, invisible in averages. When the
provider publishes its own numbers (this one publishes decode rates and
the wake budget), the log doubles as a check that production behavior
matches the brochure.