Demo

KV cache conveyor

Watch a repeated prefix become cached keys and values, then see a later generation continue from that saved state instead of rereading the whole prompt.

1 / prefix

Fresh input enters the model token by token.

The first request still pays to read the shared instructions and policy context. Each token produces internal attention state.

Prefix conveyor

0 / 12 cached
t1
system
t2
answer
t3
with
t4
policy
t5
refunds
t6
within
t7
30
t8
days
t9
receipt
t10
manager
t11
review
t12
$500+
new turn
generated continuation
scenario size
Economics

Cached input is not free, but repeated prefixes stop dominating the bill.

The numbers below are relative units, not a provider price sheet. They show the shape of the trade: first pass still builds the cache, later continuations pay much less for the same prefix.

fresh every turn
input work1,924 tokens
relative cost
2,488u

Prefix, new question, and output are all recomputed.

latency
688 ms

The long prefix sits on the critical path again.

continue from cache
fresh input work74 tokens
relative cost
916u

The prefix is billed as cached input in this simplified model.

latency
529 ms

The model still attends to context, but avoids most prefix recompute.