Structured output is a request mode that constrains a model's response to match a formal grammar, typically a caller-supplied JSON Schema, with the serving stack enforcing the constraint during token generation. Enforcement at the decoder drives the structural-violation rate to zero by construction, replacing client-side parsing retries with a single validated response.

Structured output

Structured output is a request shape that constrains a model's response to match a formal grammar, typically a JSON Schema; the serving stack enforces the constraint during generation rather than leaving the client to parse free text and recover from violations.

Server-side enforcement drives the structural-violation rate to zero by construction, replacing client-side try/except parsing and schema-repair retries with a single validated response.

For the longer treatment, see API: structured output.

Cost and reliability implications

Constrained decoding trades a small per-step masking overhead for the elimination of an entire failure class: malformed JSON that breaks downstream parsers. The retry loop it replaces is the real cost line, since every schema-repair round trip re-bills prefill tokens and adds seconds of latency; enforcement at the decoder makes the marginal cost of validity near zero.

Part of OpenAI-compatible integration on the learn hub.

See also

tool-calls

References

Dong et al., XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models (arXiv 2411.15100, 2024). Describes the token-mask construction algorithm that keeps grammar-constrained decoding within the per-step decode budget; the reference implementation underpins vLLM's guided-decoding fast path.
Outlines: structured text generation library. Open-source library that compiles JSON Schema, regex, and EBNF grammars into the per-step logit masks consumed by inference engines including vLLM and TGI.
OpenAI: Structured Outputs guide. Reference documentation for the JSON-Schema-typed response_format API; useful as the de-facto contract that open-source servers target for compatibility.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.