Streaming LLM responses over SSE
Set stream: true and a chat completion
arrives as Server-Sent Events: one small JSON chunk per token batch
instead of one document at the end. Parsing takes three rules (read
data: lines, append delta fragments, stop at
[DONE]), and the payoff is that perceived latency collapses
from total generation time to time to first token.
What streaming buys, in numbers
Decode speed is finite and visible. At this fleet's measured short-context rate of 118.9 tokens per second on the dual-H100 tier, a 500-token answer takes a little over four seconds of decode; a 2,000-token answer takes around 17. Without streaming the user stares at a spinner for all of it. With streaming the first words land as soon as prompt prefill finishes, and reading speed, not decode speed, becomes the experienced latency. The metric that captures the difference is time to first token; the engine mechanics behind it are covered at How engines work: TTFT.
The wire format, exactly
The response carries Content-Type: text/event-stream.
Each event is a data: line holding one chat-completion-chunk
object, terminated by a blank line. A trimmed transcript of the standard
shape:
data: {"choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}
data: {"choices":[{"index":0,"delta":{"content":"Hello"}}]}
data: {"choices":[{"index":0,"delta":{"content":" there"}}]}
data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Three structural facts matter. The first delta carries the
role and usually no content. Content arrives as fragments
that follow tokenizer boundaries, so words and even multi-byte characters
can split across chunks; fragments concatenate into the final string and
mean nothing individually. The terminator is the literal sentinel
data: [DONE], not a closed connection: a connection that
closes without the sentinel is a failed stream, not a finished one.
Parsing rules that survive production
- Buffer by event, not by read. TCP gives no guarantee that one read returns one event; a chunk can arrive split across reads or glued to its neighbor. Accumulate bytes, split on blank lines, then parse.
- Append fragments verbatim. No trimming, no whitespace normalisation; the spaces at fragment boundaries are content.
- Key on
index. Chunks carry achoicesarray; withn = 1it has one entry, but keying onchoices[0]without checking the array is the classic crash when a final usage chunk arrives with an emptychoiceslist. - Accumulate tool-call deltas. Streamed
tool calls deliver
function.argumentsas string fragments keyed by tool-call index; the JSON only parses after the last fragment, so collect first and parse atfinish_reason. - Ignore unknown fields. Servers add chunk fields over time; a strict parser is a self-inflicted outage.
With the OpenAI SDK most of this is handled, and the loop reduces to:
stream = client.chat.completions.create(
model="qwen3-coder",
messages=messages,
stream=True,
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
For a raw smoke test, curl -N disables curl's own
buffering and prints events as they arrive:
curl -N https://spotinference.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_KEY" \
-d '{"model": "qwen3-coder", "stream": true,
"messages": [{"role": "user", "content": "Count to ten."}]}'
Where streams get buffered anyway
The most common streaming bug is not in the parser: it is a middlebox
collecting the whole response and delivering it in one lump, which
silently converts streaming back into batch with extra steps. The usual
suspects, in order of frequency: a reverse proxy buffering upstream
responses (Nginx needs X-Accel-Buffering: no or
proxy_buffering off); a CDN or load balancer configured to
buffer; response compression applied to the event stream; a serverless
platform that buffers entire function responses by design; and client
HTTP libraries that read to completion unless explicitly asked to stream.
The diagnostic is always the same: time-stamp chunk arrivals at the
client. Smooth millisecond spacing means the path is clean; one burst
after seconds of silence means something between the engine and the
client is hoarding bytes. The contract-level treatment of this gotcha is
at the API page on streaming.
Timeouts and disconnects
Streams need a different timeout shape than blocking calls: a
first-token timeout sized to the provider's cold path, then an
inter-chunk timeout rather than a total deadline. A healthy stream at
around 100 tokens per second never goes quiet for more than a fraction of
a second, so multi-second inter-chunk silence is a strong failure signal
long before a total timeout would fire. One caveat applies on
scale-to-zero fleets: the first request after idle can spend minutes in
hardware wake before its first token. This fleet bounds that wake at
8 minutes with a 10-minute hard cap and fails honestly with
504 past the cap, numbers documented at
Reliability: wake budgets. A
severed stream cannot be resumed; the recovery is to reissue the request
and replace the partial output, and the retry mechanics (which failures
are worth reissuing, with what backoff) are in the
retries and timeouts
guide.
Usage accounting in streams
A plain stream ends without token counts. Request
stream_options: {"include_usage": true} and the server
appends a final chunk before [DONE] whose
usage object carries prompt_tokens and
completion_tokens and whose choices array is
empty (the crash case from the parsing rules above). For metering,
prefer these server-reported counts over client-side token estimates for
the reasons the
migration guide covers:
the server's tokenizer is the one that gets billed.