Streaming LLM responses over SSE

Set stream: true and a chat completion arrives as Server-Sent Events: one small JSON chunk per token batch instead of one document at the end. Parsing takes three rules (read data: lines, append delta fragments, stop at [DONE]), and the payoff is that perceived latency collapses from total generation time to time to first token.

What streaming buys, in numbers

Decode speed is finite and visible. At this fleet's measured short-context rate of 118.9 tokens per second on the dual-H100 tier, a 500-token answer takes a little over four seconds of decode; a 2,000-token answer takes around 17. Without streaming the user stares at a spinner for all of it. With streaming the first words land as soon as prompt prefill finishes, and reading speed, not decode speed, becomes the experienced latency. The metric that captures the difference is time to first token; the engine mechanics behind it are covered at How engines work: TTFT.

The wire format, exactly

The response carries Content-Type: text/event-stream. Each event is a data: line holding one chat-completion-chunk object, terminated by a blank line. A trimmed transcript of the standard shape:

data: {"choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}

data: {"choices":[{"index":0,"delta":{"content":"Hello"}}]}

data: {"choices":[{"index":0,"delta":{"content":" there"}}]}

data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Three structural facts matter. The first delta carries the role and usually no content. Content arrives as fragments that follow tokenizer boundaries, so words and even multi-byte characters can split across chunks; fragments concatenate into the final string and mean nothing individually. The terminator is the literal sentinel data: [DONE], not a closed connection: a connection that closes without the sentinel is a failed stream, not a finished one.

Parsing rules that survive production

Buffer by event, not by read. TCP gives no guarantee that one read returns one event; a chunk can arrive split across reads or glued to its neighbor. Accumulate bytes, split on blank lines, then parse.
Append fragments verbatim. No trimming, no whitespace normalisation; the spaces at fragment boundaries are content.
Key on index. Chunks carry a choices array; with n = 1 it has one entry, but keying on choices[0] without checking the array is the classic crash when a final usage chunk arrives with an empty choices list.
Accumulate tool-call deltas. Streamed tool calls deliver function.arguments as string fragments keyed by tool-call index; the JSON only parses after the last fragment, so collect first and parse at finish_reason.
Ignore unknown fields. Servers add chunk fields over time; a strict parser is a self-inflicted outage.

With the OpenAI SDK most of this is handled, and the loop reduces to:

stream = client.chat.completions.create(
    model="qwen3-coder",
    messages=messages,
    stream=True,
)
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

For a raw smoke test, curl -N disables curl's own buffering and prints events as they arrive:

curl -N https://spotinference.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $YOUR_KEY" \
  -d '{"model": "qwen3-coder", "stream": true,
       "messages": [{"role": "user", "content": "Count to ten."}]}'

Where streams get buffered anyway

The most common streaming bug is not in the parser: it is a middlebox collecting the whole response and delivering it in one lump, which silently converts streaming back into batch with extra steps. The usual suspects, in order of frequency: a reverse proxy buffering upstream responses (Nginx needs X-Accel-Buffering: no or proxy_buffering off); a CDN or load balancer configured to buffer; response compression applied to the event stream; a serverless platform that buffers entire function responses by design; and client HTTP libraries that read to completion unless explicitly asked to stream. The diagnostic is always the same: time-stamp chunk arrivals at the client. Smooth millisecond spacing means the path is clean; one burst after seconds of silence means something between the engine and the client is hoarding bytes. The contract-level treatment of this gotcha is at the API page on streaming.

Timeouts and disconnects

Streams need a different timeout shape than blocking calls: a first-token timeout sized to the provider's cold path, then an inter-chunk timeout rather than a total deadline. A healthy stream at around 100 tokens per second never goes quiet for more than a fraction of a second, so multi-second inter-chunk silence is a strong failure signal long before a total timeout would fire. One caveat applies on scale-to-zero fleets: the first request after idle can spend minutes in hardware wake before its first token. This fleet bounds that wake at 8 minutes with a 10-minute hard cap and fails honestly with 504 past the cap, numbers documented at Reliability: wake budgets. A severed stream cannot be resumed; the recovery is to reissue the request and replace the partial output, and the retry mechanics (which failures are worth reissuing, with what backoff) are in the retries and timeouts guide.

Usage accounting in streams

A plain stream ends without token counts. Request stream_options: {"include_usage": true} and the server appends a final chunk before [DONE] whose usage object carries prompt_tokens and completion_tokens and whose choices array is empty (the crash case from the parsing rules above). For metering, prefer these server-reported counts over client-side token estimates for the reasons the migration guide covers: the server's tokenizer is the one that gets billed.