spotinference Sign in with GitHub

The context window is the maximum number of tokens a model can attend over in one request: prompt, conversation history, tool results, and the generated answer all share it. It is a hard server-side limit set by the served model; requests that exceed it fail with a validation error rather than degrading silently.

Context window

The window is a property of the served model and its serving configuration, and it is shared: system prompt, conversation history, tool schemas, tool results, and the tokens the model generates all draw from the same budget. A request whose prompt plus max_tokens exceeds the window fails validation at the Chat Completions boundary; nothing is generated and nothing useful is charged.

Practical budgeting runs on two numbers: the window size of the served model, and the token counts the server reports in each response's usage object. Client-side estimates from a different vendor's tokenizer drift from the real count, which is one of the audit items in the migration guide. The standing fix is structural: keep the system prompt lean, reserve max_tokens explicitly, and summarise or truncate history instead of replaying whole transcripts every turn.

The window is also paid for, not just enforced: every token in it is prefilled per request and held in the KV cache for the request's lifetime. The cost mechanics live at How engines work: the KV cache and, in field-note form, in how the KV cache sets latency and cost.

Cost and reliability implications

Long windows are paid for twice: every token in the window is prefilled on each request, growing time to first token, and the KV cache holds the whole window in GPU memory for the request's lifetime, crowding out concurrent work. Replaying entire conversation histories each turn is the most common silent cost multiplier in production integrations.

Part of OpenAI-compatible integration on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.