info snack
Time to first token
How long the user waits before streaming begins.
companion card
Time to first token is how long the user waits before streaming begins; retrieval, queueing, and model startup all affect it.
What it means
Time to first token measures the delay between request start and the first streamed output token. It includes work before generation begins.
Why product teams care
Retrieval, queueing, cold starts, long prompts, model size, and safety checks can all increase TTFT. Streaming hides some latency, but not this first wait.
Understudy angle
Understudy can track TTFT by workload so routing choices account for perceived responsiveness, not only final quality.
take this with you
TTFT controls whether a system feels responsive before the final answer is ready.
Measure TTFT separately from total response time for one agent workflow.