Demo

KV cache conveyor

Watch a repeated prefix become cached keys and values, then see a later generation continue from that saved state instead of rereading the whole prompt.

1 / prefix

Fresh input enters the model token by token.

The first request still pays to read the shared instructions and policy context. Each token produces internal attention state.

Prefix conveyor

0 / 12 cached

t1

system

t2

answer

t3

with

t4

policy

t5

refunds

t6

within

t7

30

t8

days

t9

receipt

t10

manager

t11

review

t12

$500+

new turn

generated continuation

scenario size

Economics

Cached input is not free, but repeated prefixes stop dominating the bill.

The numbers below are relative units, not a provider price sheet. They show the shape of the trade: first pass still builds the cache, later continuations pay much less for the same prefix.

fresh every turn

input work1,924 tokens

relative cost

2,488u

Prefix, new question, and output are all recomputed.

latency

688 ms

The long prefix sits on the critical path again.

continue from cache

fresh input work74 tokens

relative cost

916u

The prefix is billed as cached input in this simplified model.

latency

529 ms

The model still attends to context, but avoids most prefix recompute.