spotinference

Sign in with GitHub

The OpenAI Chat Completions API became the de facto standard for LLM inference in 2023 and stayed there; every modern serving stack speaks it, and most extensions are still backward-compatible with the original 2023 shape.

The Chat Completions shape

OpenAI's Chat Completions endpoint, documented at developers.openai.com/api/reference/resources/chat, defines a small, regular request body. A model field names the target weights. A messages array carries the conversation as a sequence of turns, each with a role (system, user, assistant, or tool) and a content string. A handful of decoding-control fields ride alongside: temperature for sampling sharpness, top_p for nucleus sampling, max_tokens for the response cap, stop for early termination strings, and stream for incremental delivery.

The non-streaming response is a JSON object with a choices array. The generated text lives at choices[0].message.content; the role is fixed to assistant, and a finish_reason field reports whether the generation stopped on an end-of-sequence token, the max_tokens cap, a stop string, or a tool-call invocation. A usage object reports prompt_tokens, completion_tokens, and total_tokens for accounting.

The shape is small enough to implement in a weekend, which is a large part of why it spread. Path dependence and SDK gravity did the rest. By 2026, shipping a non-standard inference API has a real cost: a new vendor has to ship its own SDK, document its own quirks, and explain why the OpenAI SDK cannot just point at a different base_url. Open-source serving engines (vLLM, TGI, llama.cpp, Ollama, TensorRT-LLM) all expose at least a compatible subset, and most commercial competitors do the same.

Streaming and Server-Sent Events

Setting stream: true in the request body switches the response from a single JSON document to a Server-Sent Events stream. The HTTP response carries Content-Type: text/event-stream; each event is a line prefixed data: followed by a chat-completion-chunk JSON object, with a blank line terminating the event. Each chunk carries a delta field containing the new content fragment, and a client reconstructs the full response by concatenating deltas as they arrive. A final data: [DONE] sentinel signals the end of the stream.

One practical gotcha: SSE responses are normally buffered by reverse proxies and CDNs. Production deployments need an explicit no-buffer header (Nginx X-Accel-Buffering: no) and a server that flushes after every chunk write. Without those, tokens arrive in clumps and the latency advantage of streaming disappears.

Tool calls (function calling)

Tool calling, sometimes called function calling, is a protocol convention layered on top of a chat-completion request. The caller advertises a catalogue of functions the model is permitted to invoke; the model chooses between answering in natural language or returning one or more structured invocations that name a function and supply an arguments object conforming to that function's declared JSON Schema.

The serving layer delivers those invocations to the caller, which executes the functions, captures the results, and feeds them back into a follow-up turn so the model can continue the conversation with grounded data. The model itself does not run the functions; the protocol is a contract for structured delegation.

The request carries an optional tools field. Each element describes one callable function with a name, a human-readable description, and a parameters object that is a full JSON Schema document specifying the argument payload. An accompanying tool_choice field controls whether a tool call is allowed, forced, or restricted to a specific named tool.

When the model elects to invoke a function, the assistant message in the response carries a tool_calls array. Each entry has a unique id, a type of function, and a function object containing the chosen name and an arguments string. To return a result, the next request appends a message with role: "tool", a tool_call_id matching the assistant's invocation id, and a content field carrying the result text.

OpenAI shipped the original format in mid-2023 as a single function_call field on the assistant message. A 2024 revision generalised the shape to a list of tool_calls so one response could request multiple parallel invocations, and introduced the symmetric role: "tool" message for returning results. The 2024 shape is what stabilised across the ecosystem. vLLM, Hugging Face TGI, and llama.cpp's server mode all expose the OpenAI-compatible tools and tool_calls fields on their Chat Completions endpoints.

Different model families emit tool calls in different on-the-wire syntaxes. Some families wrap invocations in custom XML tags (for example <tool_call> with a JSON body inside). Hermes-style fine-tunes use a tagged JSON convention; Mistral's instruction-tuned models use a bracketed token sequence; the Llama 3 instruct line emits a JSON object with a fixed top-level shape; Qwen3 coder variants split between an XML form and a JSON form depending on the sub-family.

The serving engine takes raw model output, detects which syntax was emitted, and translates it into the canonical OpenAI tool_calls shape via a pluggable component known as the tool-call parser. vLLM and TGI ship a small library of parsers named for the family each understands: hermes, mistral, llama3_json, qwen3_coder, qwen3_xml, and others.

Parser selection is consequential: a Hermes-trained model paired with a Mistral parser produces invocations the parser cannot recognise, and the raw text leaks through as ordinary prose or, worse, emits malformed tool_calls objects with truncated arguments.

Structured output

Structured output is the generalisation of tool calling. The caller supplies, alongside the prompt, a schema that the response must satisfy. The schema can be a JSON Schema document, a regular expression, a context-free grammar in EBNF or Lark, or a Pydantic-style class definition compiled down to one of the above. The server returns a string that, when parsed, is guaranteed to validate against the supplied schema. The guarantee is provided by the server, not the client.

The contrast is with the unconstrained mode in which the prompt asks the model to "respond in JSON," the model emits a string that looks like JSON, and the client wraps json.loads in a try/except to recover from missing commas, hallucinated fields, smart quotes, and trailing prose.

The constraint can be enforced at two points in the generation pipeline. Constrained decoding at the token level intercepts the logits vector at each decode step: a grammar engine, given the prefix generated so far, computes the set of vocabulary tokens whose addition would keep the prefix consistent with at least one accepting path through the grammar, and logits for every other token are masked to negative infinity before the sampler runs.

The dominant open-source libraries are xgrammar, outlines, and lm-format-enforcer; vLLM exposes them through the guided_json, guided_regex, and guided_grammar request fields. The xgrammar paper reports per-token mask construction times in the low-microsecond range for typical JSON Schema workloads, which keeps overhead well below the per-step decode cost of the model itself.

Retry on violation is the fallback: the model emits free text, the server parses against the schema, and a validation failure triggers a regeneration, optionally with the parser's error message fed back into the prompt as a hint. Integration cost is near zero, but runtime cost grows with the violation rate.

Tool calling is a specific case of this broader machinery. The schema is fixed by the function-call envelope (a function name plus a JSON-Schema-typed arguments object), and the model is constrained to emit either a normal text response or a well-formed tool-call object. General structured output generalises the pattern to arbitrary caller-supplied schemas, and most serving stacks share a single implementation across both features: the tool-call parser is a thin adapter on top of the structured-output engine.

The structural-violation rate drops to zero by construction under constrained decoding; only the semantic error rate (the schema is satisfied but the values are wrong) remains, and that error class is amenable to evaluation in a way that "the response was not parseable" is not.

References