Retries, timeouts, and backoff for LLM API calls

An LLM API call is a long-lived, expensive POST whose normal tail latency would page an on-call engineer anywhere else in the stack. Generic REST retry policy mishandles it in both directions: flat 30-second timeouts kill healthy requests, and naive retry loops double the bill. The fix is three separate timeouts, a failure taxonomy decided before the first retry, and exponential backoff with jitter.

Why generic retry policy fails here

Three properties separate inference calls from ordinary API traffic. Duration: a 2,000-token completion takes tens of seconds even on fast hardware, so any flat timeout tuned for microservices fires on healthy requests. Cost: a retried completion bills its tokens twice, so retry count is a budget decision, not a resilience freebie. Recovery shape: when a GPU backend comes back from a fault it warms up, and a thundering herd of synchronized retries arrives exactly when the backend is least able to absorb it. Chat completions are stateless, which is the one mercy: a failed request can always be reissued from scratch without server-side cleanup.

Classify the failure before retrying

Each failure class gets its own policy. A useful classification, using this gateway's published error contract as the worked example:

Connection errors, DNS failures, TLS resets. Retry with backoff. The request never reached the server, so no tokens were billed and no duplicate work exists.
429 rate limited. Retry after waiting. Honor a Retry-After header when the server sends one; back off exponentially when it does not.
502 upstream_unreachable. The gateway accepted the request but the GPU backend did not answer. Transient by construction: retry with backoff.
503 no_upstream. No tier is configured to serve the requested model. This is configuration, not weather; retrying in a tight loop converts one mistake into a log flood. Alert instead.
504 wake_timeout. The request arrived while the fleet was waking hardware and the wake exceeded the cap. The machine is likely warm by the next attempt, so a single delayed retry is the right move. The wake itself is budgeted at 8 minutes with a 10-minute hard cap, documented at Reliability: wake budgets.
400, 401, 403, 422. Never retry. The request is malformed, unauthenticated, or oversized, and will fail identically every time.
Mid-stream disconnect. A severed SSE stream cannot be resumed; the protocol has no cursor. Discard the partial output and reissue the whole request. The streaming guide covers how to make that replacement invisible in a UI.

Three timeouts, not one

Connect timeout. Small and strict: seconds. TCP and TLS setup does not get slower because the model is large.

Time-to-first-token timeout. Covers queueing, prompt prefill, and any cold path. This is the timeout that scale-to-zero providers stress: a first request to an idle tier can legitimately take minutes while hardware wakes, where a warm request answers in under a second. Size it from the provider's published worst case, not from optimism. On this fleet the number to design against is the 8-minute wake budget; the gateway gives up with 504 wake_timeout at 10 minutes, so a client first-token timeout slightly above that cap never races the server. TTFT is the metric to log here.

Total (read) timeout. Derive it from output length and decode speed instead of guessing. At this fleet's measured short-context decode rate of 118.9 tokens per second on the dual-H100 tier, a 4,000-token completion runs about 34 seconds of pure decode; a flat 30-second read timeout would kill it at 90 percent complete and bill every token it discarded. The honest formula is max_tokens / decode_rate plus first-token headroom.

Backoff that behaves in a fleet

Exponential backoff with full jitter is the standard for a reason: it spreads synchronized retries across time so a recovering backend sees a trickle instead of a wave. Start around one second, double per attempt, cap the sleep at 30 to 60 seconds, and cap attempts at three to five. Budget retries in dollars as well as attempts: five retries of a 10,000-token request is a different decision from five retries of a 50-token one. For sustained failure, a circuit breaker that fails fast and probes occasionally beats a queue of patient clients all holding connections open.

A reference implementation

Stdlib Go, with the classification and jitter applied:

func completeWithRetry(ctx context.Context, c *http.Client, body []byte) (*http.Response, error) {
    backoff := time.Second
    for attempt := 0; attempt < 4; attempt++ {
        req, err := http.NewRequestWithContext(ctx, "POST",
            "https://spotinference.com/v1/chat/completions",
            bytes.NewReader(body))
        if err != nil {
            return nil, err
        }
        req.Header.Set("Authorization", "Bearer "+apiKey)
        req.Header.Set("Content-Type", "application/json")

        resp, err := c.Do(req)
        if err == nil && resp.StatusCode < 500 && resp.StatusCode != 429 {
            return resp, nil // 2xx and non-retryable 4xx go to the caller
        }
        if resp != nil {
            resp.Body.Close()
        }
        sleep := backoff + time.Duration(rand.Int64N(int64(backoff)))
        backoff *= 2
        select {
        case <-time.After(sleep):
        case <-ctx.Done():
            return nil, ctx.Err()
        }
    }
    return nil, errors.New("retry budget exhausted")
}

The Python OpenAI SDK ships most of this: connection errors, 429, and 5xx are retried with backoff out of the box, and the two knobs worth setting explicitly are the retry count and the timeout split:

client = OpenAI(
    base_url="https://spotinference.com/v1",
    api_key=key,
    max_retries=3,
    timeout=httpx.Timeout(connect=10.0, read=120.0,
                          write=10.0, pool=10.0),
)

Neither snippet retries mid-stream failures; that path needs the reissue-and-replace handling described in the streaming guide.

What to log

Per request: status code, error code string, attempt number, TTFT, total duration, and token counts from usage. Retry policy is tuned from this log, not from defaults: the difference between a healthy p99 and a wake event is obvious in TTFT, invisible in averages. When the provider publishes its own numbers (this one publishes decode rates and the wake budget), the log doubles as a check that production behavior matches the brochure.