spotinference Sign in with GitHub

Time to first token (TTFT) is the elapsed wall-clock interval between a client submitting an inference request and receiving the first response byte that carries generated content. It is the latency a human perceives in streaming chat, and it bundles queueing, any cold start or wake, prefill compute, and the first decode step.

TTFT, time to first token

Time to first token is the elapsed wall-clock interval between a client submitting an inference request and that client receiving the first response byte carrying generated content.

TTFT is the latency dimension a human reader perceives in interactive workloads, so it is the primary tuning target for chat and agent traffic where the response streams back token by token.

For the longer treatment, see How engines work: what an engine is measured on.

Cost and reliability implications

TTFT is where infrastructure economics become visible to the end user: a hibernated fleet trades idle GPU spend for a wake penalty measured in minutes, and a request that arrives cold pays the full restore plus engine start before prefill even begins. Sizing the wake timeout, and deciding what spend level justifies a warm replica, is a budget decision expressed as latency.

Part of Performance and latency on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.