info snack

Tokens per second

How quickly a model generates after output begins.

companion card

Tokens per second measures generation speed after output starts; it depends on model size, hardware, batching, and output length.

What it means

Tokens per second measures output generation speed once decoding starts. It is affected by model architecture, parameter count, GPU memory bandwidth, batching, quantization, and serving software.

Why product teams care

Low TPS makes long answers feel slow even if TTFT is good. High TPS helps chat, coding, and agent workflows where users watch output stream.

Understudy angle

Understudy can compare route candidates on quality, TTFT, TPS, and total cost instead of relying on one latency number.

take this with you

TPS is the throughput side of latency.

Record TTFT and TPS for the same request so you can see where time goes.