Structured output and tool calling in practice
The reliable way to get machine-parseable output from a
chat endpoint is to declare the shape and let the server enforce it: a
tools catalogue when the model should decide whether and what
to call, schema-constrained output when every response must match one
shape. Done right, the remaining failures are schema design and token
truncation, not JSON parsing.
Two mechanisms, one decision
Tool calling hands the model a menu and lets it choose: answer in prose, or emit one or more structured invocations for the caller to execute. It fits agents, routers, and any flow where calling is conditional. Structured output removes the choice: the response always validates against a caller-supplied schema. It fits extraction, classification, and every pipeline stage whose consumer is a parser rather than a person. The two share machinery server-side, and the deciding question is simply whether "no call, just text" is a valid outcome. The contract-level reference for both lives at the API page; this guide is the practice layer on top.
A complete tool-call loop
The protocol is a two-request conversation: the model asks for an
invocation, the caller executes it and returns the result as a
tool message, and the model answers with grounded data.
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
"additionalProperties": False,
},
},
}]
messages = [{"role": "user", "content": "How warm is Lisbon right now?"}]
first = client.chat.completions.create(
model="qwen3-coder", messages=messages, tools=tools)
call = first.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)
messages.append(first.choices[0].message)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(lookup_weather(args["city"])),
})
final = client.chat.completions.create(
model="qwen3-coder", messages=messages, tools=tools)
print(final.choices[0].message.content)
Two production details the happy path hides. First,
tool_calls is a list: models may request several invocations
in one turn, and each needs its own tool reply matched by
tool_call_id. Second, the assistant message containing the
invocation must be appended to the history verbatim before the tool
result; dropping it breaks the pairing and produces confused second
turns.
Schema design that parses on the first try
Most "the model returned bad JSON" reports are schema problems wearing a trench coat. Field-tested rules:
- Flat beats nested. Every level of nesting multiplies the ways a generation can wander; two levels is a sensible ceiling for tool arguments.
- Enums for closed sets. A
statusfield typed as a five-value enum cannot drift into synonyms; a free string can and will. - Mark everything required and set
additionalProperties: false. Optional fields invite omission; open objects invite invention. - Short, concrete descriptions. The
descriptionstrings are prompt text; one sentence stating units, format, and an example outperforms a paragraph. - Avoid
anyOftangles. Union types are where both constrained decoders and models go to struggle; if a field can be two shapes, split it into two fields.
Truncation: the silent killer
A generation that hits the max_tokens cap stops mid-string
and the arguments object simply ends, unparseable, with
finish_reason: "length" as the only tell. Check that field
before parsing, always. Budget generously for structured payloads and keep
them small by design: return identifiers and references, not document
bodies, through tool arguments. Decode speed makes the second point
concrete: at this fleet's measured 106.6 tokens per second on the
dual-A100 tier, a 1,000-token arguments object spends over nine seconds
just being generated; the same reference passed as an ID costs a dozen
tokens. Generation-side constraints execute inside the engine's decode
loop (the mechanics are at How engines
work: vLLM), so schema enforcement adds little latency, but no
enforcement mechanism can save an output the token budget cut in
half.
Streaming tool calls
Under stream: true, tool-call invocations arrive as
deltas: the function name early, then arguments as string
fragments keyed by tool-call index. The fragments are not JSON until the
last one lands, so the streaming-safe pattern is accumulate, watch
finish_reason, then parse once. The general fragment-handling
rules (buffer by event, key on index, tolerate empty
choices) are in the
streaming guide.
When output still fails to parse
Three failure modes cover nearly everything seen in practice.
Family mismatch: models emit tool calls in
family-specific wire syntaxes that the server translates to the standard
envelope; when that translation misfires, invocations leak into
content as tagged text. The fix is server-side configuration,
not client parsing; flag it to the provider rather than regexing the
leakage. Truncation: covered above; check
finish_reason first. Semantic misses: the
JSON validates but the values are wrong (a city where a country was
expected). Schema enforcement cannot fix meaning; the remedies are
tighter enums, better descriptions, and a validate-and-reprompt loop that
feeds the validator's error back as a correction. Log the violation rate
per schema: it is the one metric that tells schema problems apart from
model problems, and after a provider
migration it is the first
number worth re-baselining.