The OpenAI Chat Completions API became the de facto standard for LLM inference in 2023 and stayed there; every modern serving stack speaks it, and most extensions are still backward-compatible with the original 2023 shape.
The Chat Completions shape
OpenAI's Chat Completions endpoint, documented at
developers.openai.com/api/reference/resources/chat,
defines a small, regular request body. A model field names
the target weights. A messages array carries the
conversation as a sequence of turns, each with a role
(system, user, assistant, or
tool) and a content string. A handful of
decoding-control fields ride alongside: temperature for
sampling sharpness, top_p for nucleus sampling,
max_tokens for the response cap, stop for
early termination strings, and stream for incremental
delivery.
The non-streaming response is a JSON object with a
choices array. The generated text lives at
choices[0].message.content; the role is fixed to
assistant, and a finish_reason field reports
whether the generation stopped on an end-of-sequence token, the
max_tokens cap, a stop string, or a tool-call invocation. A
usage object reports prompt_tokens,
completion_tokens, and total_tokens for
accounting.
The shape is small enough to implement in a weekend, which is a
large part of why it spread. Path dependence and SDK gravity did the
rest. By 2026, shipping a non-standard inference API has a real cost: a
new vendor has to ship its own SDK, document its own quirks, and
explain why the OpenAI SDK cannot just point at a different
base_url. Open-source serving engines (vLLM, TGI,
llama.cpp, Ollama, TensorRT-LLM) all expose at least a compatible
subset, and most commercial competitors do the same.
Streaming and Server-Sent Events
Setting stream: true in the request body switches the
response from a single JSON document to a Server-Sent Events stream.
The HTTP response carries Content-Type: text/event-stream;
each event is a line prefixed data: followed by a
chat-completion-chunk JSON object, with a blank line terminating the
event. Each chunk carries a delta field containing the new
content fragment, and a client reconstructs the full response by
concatenating deltas as they arrive. A final data: [DONE]
sentinel signals the end of the stream.
One practical gotcha: SSE responses are normally buffered by reverse
proxies and CDNs. Production deployments need an explicit no-buffer
header (Nginx X-Accel-Buffering: no) and a server that
flushes after every chunk write. Without those, tokens arrive in clumps
and the latency advantage of streaming disappears.
Tool calls (function calling)
Tool calling, sometimes called function calling, is a protocol convention layered on top of a chat-completion request. The caller advertises a catalogue of functions the model is permitted to invoke; the model chooses between answering in natural language or returning one or more structured invocations that name a function and supply an arguments object conforming to that function's declared JSON Schema.
The serving layer delivers those invocations to the caller, which executes the functions, captures the results, and feeds them back into a follow-up turn so the model can continue the conversation with grounded data. The model itself does not run the functions; the protocol is a contract for structured delegation.
The request carries an optional tools field. Each
element describes one callable function with a name, a
human-readable description, and a parameters
object that is a full JSON Schema document specifying the argument
payload. An accompanying tool_choice field controls
whether a tool call is allowed, forced, or restricted to a specific
named tool.
When the model elects to invoke a function, the assistant
message in the response carries a tool_calls array. Each
entry has a unique id, a type of
function, and a function object containing
the chosen name and an arguments string. To
return a result, the next request appends a message with
role: "tool", a tool_call_id matching the
assistant's invocation id, and a content field carrying
the result text.
OpenAI shipped the original format in mid-2023 as a single
function_call field on the assistant message. A 2024
revision generalised the shape to a list of tool_calls so
one response could request multiple parallel invocations, and
introduced the symmetric role: "tool" message for
returning results. The 2024 shape is what stabilised across the
ecosystem. vLLM, Hugging Face TGI, and llama.cpp's server mode all
expose the OpenAI-compatible tools and
tool_calls fields on their Chat Completions endpoints.
Different model families emit tool calls in different on-the-wire
syntaxes. Some families wrap invocations in custom XML tags (for
example <tool_call> with a JSON body inside).
Hermes-style fine-tunes use a tagged JSON convention; Mistral's
instruction-tuned models use a bracketed token sequence; the Llama 3
instruct line emits a JSON object with a fixed top-level shape; Qwen3
coder variants split between an XML form and a JSON form depending on
the sub-family.
The serving engine takes raw model output, detects
which syntax was emitted, and translates it into the canonical OpenAI
tool_calls shape via a pluggable component known as the
tool-call parser. vLLM and TGI ship a small library of
parsers named for the family each understands: hermes,
mistral, llama3_json,
qwen3_coder, qwen3_xml, and others.
Parser selection is consequential: a Hermes-trained model paired with a
Mistral parser produces invocations the parser cannot recognise, and
the raw text leaks through as ordinary prose or, worse, emits malformed
tool_calls objects with truncated arguments.
Structured output
Structured output is the generalisation of tool calling. The caller supplies, alongside the prompt, a schema that the response must satisfy. The schema can be a JSON Schema document, a regular expression, a context-free grammar in EBNF or Lark, or a Pydantic-style class definition compiled down to one of the above. The server returns a string that, when parsed, is guaranteed to validate against the supplied schema. The guarantee is provided by the server, not the client.
The contrast is with the unconstrained mode in which the
prompt asks the model to "respond in JSON," the model emits a string
that looks like JSON, and the client wraps json.loads in
a try/except to recover from missing commas, hallucinated fields,
smart quotes, and trailing prose.
The constraint can be enforced at two points in the generation pipeline. Constrained decoding at the token level intercepts the logits vector at each decode step: a grammar engine, given the prefix generated so far, computes the set of vocabulary tokens whose addition would keep the prefix consistent with at least one accepting path through the grammar, and logits for every other token are masked to negative infinity before the sampler runs.
The dominant open-source libraries are
xgrammar,
outlines, and lm-format-enforcer; vLLM exposes them through the
guided_json, guided_regex, and
guided_grammar request fields. The xgrammar paper reports
per-token mask construction times in the low-microsecond range for
typical JSON Schema workloads, which keeps overhead well below the
per-step decode cost of the model itself.
Retry on violation is the fallback: the model emits free text, the server parses against the schema, and a validation failure triggers a regeneration, optionally with the parser's error message fed back into the prompt as a hint. Integration cost is near zero, but runtime cost grows with the violation rate.
Tool calling is a specific case of this broader machinery. The schema is fixed by the function-call envelope (a function name plus a JSON-Schema-typed arguments object), and the model is constrained to emit either a normal text response or a well-formed tool-call object. General structured output generalises the pattern to arbitrary caller-supplied schemas, and most serving stacks share a single implementation across both features: the tool-call parser is a thin adapter on top of the structured-output engine.
The structural-violation rate drops to zero by construction under constrained decoding; only the semantic error rate (the schema is satisfied but the values are wrong) remains, and that error class is amenable to evaluation in a way that "the response was not parseable" is not.
References
- OpenAI. Chat Completions API reference. The canonical request and response shape for the endpoint that became the industry standard.
- OpenAI.
Function calling guide.
The original protocol description for the
toolsandtool_callsfields and therole: "tool"response message. - OpenAI.
Structured Outputs guide.
Describes
response_formatwith JSON Schema and the constrained-decoding guarantee that the response will validate. - vLLM project.
Tool calling and parser plugins.
The catalogue of per-family tool-call parsers
(
hermes,mistral,llama3_json,qwen3_coder,qwen3_xml) and the--tool-call-parserlaunch flag. - Dong, Ruan, Cai, Lai, Chen. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. arXiv 2411.15100, 2024. The token-mask construction algorithm behind low-microsecond grammar enforcement at decode time.