spotinference Sign in with GitHub

To get machine-parseable output from a chat endpoint, declare the shape server-side: a tools catalogue when calling is conditional, a response schema when every answer must validate. The complete loop is two requests (invoke, then return a tool message). Remaining failure modes are schema design, max_tokens truncation, and family-specific emission quirks, not JSON parsing.

Structured output and tool calling in practice

The reliable way to get machine-parseable output from a chat endpoint is to declare the shape and let the server enforce it: a tools catalogue when the model should decide whether and what to call, schema-constrained output when every response must match one shape. Done right, the remaining failures are schema design and token truncation, not JSON parsing.

Two mechanisms, one decision

Tool calling hands the model a menu and lets it choose: answer in prose, or emit one or more structured invocations for the caller to execute. It fits agents, routers, and any flow where calling is conditional. Structured output removes the choice: the response always validates against a caller-supplied schema. It fits extraction, classification, and every pipeline stage whose consumer is a parser rather than a person. The two share machinery server-side, and the deciding question is simply whether "no call, just text" is a valid outcome. The contract-level reference for both lives at the API page; this guide is the practice layer on top.

A complete tool-call loop

The protocol is a two-request conversation: the model asks for an invocation, the caller executes it and returns the result as a tool message, and the model answers with grounded data.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
            "additionalProperties": False,
        },
    },
}]

messages = [{"role": "user", "content": "How warm is Lisbon right now?"}]

first = client.chat.completions.create(
    model="qwen3-coder", messages=messages, tools=tools)

call = first.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)

messages.append(first.choices[0].message)
messages.append({
    "role": "tool",
    "tool_call_id": call.id,
    "content": json.dumps(lookup_weather(args["city"])),
})

final = client.chat.completions.create(
    model="qwen3-coder", messages=messages, tools=tools)
print(final.choices[0].message.content)

Two production details the happy path hides. First, tool_calls is a list: models may request several invocations in one turn, and each needs its own tool reply matched by tool_call_id. Second, the assistant message containing the invocation must be appended to the history verbatim before the tool result; dropping it breaks the pairing and produces confused second turns.

Schema design that parses on the first try

Most "the model returned bad JSON" reports are schema problems wearing a trench coat. Field-tested rules:

  • Flat beats nested. Every level of nesting multiplies the ways a generation can wander; two levels is a sensible ceiling for tool arguments.
  • Enums for closed sets. A status field typed as a five-value enum cannot drift into synonyms; a free string can and will.
  • Mark everything required and set additionalProperties: false. Optional fields invite omission; open objects invite invention.
  • Short, concrete descriptions. The description strings are prompt text; one sentence stating units, format, and an example outperforms a paragraph.
  • Avoid anyOf tangles. Union types are where both constrained decoders and models go to struggle; if a field can be two shapes, split it into two fields.

Truncation: the silent killer

A generation that hits the max_tokens cap stops mid-string and the arguments object simply ends, unparseable, with finish_reason: "length" as the only tell. Check that field before parsing, always. Budget generously for structured payloads and keep them small by design: return identifiers and references, not document bodies, through tool arguments. Decode speed makes the second point concrete: at this fleet's measured 106.6 tokens per second on the dual-A100 tier, a 1,000-token arguments object spends over nine seconds just being generated; the same reference passed as an ID costs a dozen tokens. Generation-side constraints execute inside the engine's decode loop (the mechanics are at How engines work: vLLM), so schema enforcement adds little latency, but no enforcement mechanism can save an output the token budget cut in half.

Streaming tool calls

Under stream: true, tool-call invocations arrive as deltas: the function name early, then arguments as string fragments keyed by tool-call index. The fragments are not JSON until the last one lands, so the streaming-safe pattern is accumulate, watch finish_reason, then parse once. The general fragment-handling rules (buffer by event, key on index, tolerate empty choices) are in the streaming guide.

When output still fails to parse

Three failure modes cover nearly everything seen in practice. Family mismatch: models emit tool calls in family-specific wire syntaxes that the server translates to the standard envelope; when that translation misfires, invocations leak into content as tagged text. The fix is server-side configuration, not client parsing; flag it to the provider rather than regexing the leakage. Truncation: covered above; check finish_reason first. Semantic misses: the JSON validates but the values are wrong (a city where a country was expected). Schema enforcement cannot fix meaning; the remedies are tighter enums, better descriptions, and a validate-and-reprompt loop that feeds the validator's error back as a correction. Log the violation rate per schema: it is the one metric that tells schema problems apart from model problems, and after a provider migration it is the first number worth re-baselining.

Methodology

The tool-call loop follows the envelope documented in the references, the same contract the production gateway serves. The latency arithmetic for oversized payloads uses the fleet's published 106.6 tokens per second dual-A100 decode rate from the 2026-04-19 run. Schema-design rules are stated as field guidance, not measurements; no violation-rate figure is claimed.

Part of OpenAI-compatible integration on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.