info snack
Tokens per second
How quickly a model generates after output begins.
companion card
Tokens per second measures generation speed after output starts; it depends on model size, hardware, batching, and output length.
What it means
Tokens per second measures output generation speed once decoding starts. It is affected by model architecture, parameter count, GPU memory bandwidth, batching, quantization, and serving software.
Why product teams care
Low TPS makes long answers feel slow even if TTFT is good. High TPS helps chat, coding, and agent workflows where users watch output stream.
Understudy angle
Understudy can compare route candidates on quality, TTFT, TPS, and total cost instead of relying on one latency number.
take this with you
TPS is the throughput side of latency.
Record TTFT and TPS for the same request so you can see where time goes.