Time to first token (TTFT) is the elapsed wall-clock interval between a client submitting an inference request and receiving the first response byte that carries generated content. It is the latency a human perceives in streaming chat, and it bundles queueing, any cold start or wake, prefill compute, and the first decode step.

TTFT, time to first token

Time to first token is the elapsed wall-clock interval between a client submitting an inference request and that client receiving the first response byte carrying generated content.

TTFT is the latency dimension a human reader perceives in interactive workloads, so it is the primary tuning target for chat and agent traffic where the response streams back token by token.

For the longer treatment, see How engines work: what an engine is measured on.

Cost and reliability implications

TTFT is where infrastructure economics become visible to the end user: a hibernated fleet trades idle GPU spend for a wake penalty measured in minutes, and a request that arrives cold pays the full restore plus engine start before prefill even begins. Sizing the wake timeout, and deciding what spend level justifies a warm replica, is a budget decision expressed as latency.

Part of Performance and latency on the learn hub.

See also

References

Yu, Jeong, Kim, Shin, Wee. Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022). Introduces iteration-level scheduling, which bounds TTFT inflation under concurrent load by admitting new requests into the next decode step instead of waiting for the current batch to drain.
Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023). vLLM's foundational paper; the paged KV cache is what allows continuous batching to keep prefill and decode interleaved without memory fragmentation.

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.