Context window
The window is a property of the served model and its serving
configuration, and it is shared: system prompt, conversation history,
tool schemas, tool results, and the tokens the model generates all draw
from the same budget. A request whose prompt plus max_tokens
exceeds the window fails validation at the
Chat Completions boundary; nothing is
generated and nothing useful is charged.
Practical budgeting runs on two numbers: the window size of the served
model, and the token counts the server reports in each response's
usage object. Client-side estimates from a different
vendor's tokenizer drift from the real count, which is one of the audit
items in the migration
guide. The standing fix is structural: keep the system prompt lean,
reserve max_tokens explicitly, and summarise or truncate
history instead of replaying whole transcripts every turn.
The window is also paid for, not just enforced: every token in it is prefilled per request and held in the KV cache for the request's lifetime. The cost mechanics live at How engines work: the KV cache and, in field-note form, in how the KV cache sets latency and cost.