When StreamingLLM beats us — and why
At K/T = 1/16, the dumbest policy in the comparison — 'always keep the last 64 tokens, no exceptions' — outperforms our learned policy on perplexity by 17 points. We dug in to figure out where it was spending its budget.
You build a learned eviction policy. You sweep budgets. You watch it win on P@K at every budget below K/T = 1/4. You feel good.
Then you measure perplexity. And at the tightest budget — the one you cared about most, K/T = 1/16, the on-device deployment regime — your learned policy loses to StreamingLLM by 17 percentage points of ΔPPL.
StreamingLLM, to be clear, is hardcoded. It is: "keep the four attention sinks at the front, then keep the most-recent K−4 tokens, evict everything in between." There is no learning. There are no features. There isn't really a policy. And on next-token PPL at K=128 in a 2K-token context, it beats us.
| Policy at K=128 in 2K context | PPL | ΔPPL vs full-KV |
|---|---|---|
| Full KV (baseline) | 4.26 | — |
| StreamingLLM | 5.82 | +33.7% |
| TemporalKV | 6.27 | +50.6% |
| SnapKV (proxy) | 44.15 | +1,111% |
| H2O (proxy) | 409.53 | +11,776% |
| KeyDiff | 571.21 | +17,499% |
We needed to know what was going on. So we instrumented both policies to record, for every dialogue and every step, the position indices each one was keeping. Then we asked the simple question: how many of those kept slots are within the last 64 tokens?
The 18-token gap
We verified this empirically. Across 10 test dialogues, the fraction of kept budget that lands in the last 64 positions looks like this:
| Budget | StreamingLLM | TemporalKV | Difference |
|---|---|---|---|
| K=128 | 50.0% | 36.3% | StreamingLLM keeps ~18 more recent tokens |
| K=256 | 25.0% | 23.2% | nearly tied (both saturate the window) |
| K=512 | 12.5% | 12.4% | tied |
| K=1024 | 6.2% | 6.2% | tied |
That 18-token gap at K=128 is the PPL story. Those 18 stolen recent slots carry more next-token-prediction signal than the 18 scattered "smart picks" TemporalKV substitutes for them. At K=256 the budget allows both policies to fully cover the 64-token window, the gap evaporates, and TemporalKV's smart-elsewhere picks start to pay off. That's the crossover — we measured it and it's at the expected location.
What PPL is measuring vs what P@K is measuring
This is where you have to stop and think about what your evaluation loss is actually rewarding.
P@K asks: of the K tokens the attention oracle would attend to most at step t, how many did you keep? It's a ranking-agreement score. At K/T = 1/16 in a 2K-token dialogue, the oracle picks 128 tokens across the full 2,048-token context. TemporalKV identifies them better than StreamingLLM does — +17pts of P@K. That number is real.
PPL asks: given the kept cache, what's the next-token NLL? And next-token prediction is dominated by attention to the immediately preceding tokens. The recent-64 window is where the largest attention mass lives, by a wide margin, for almost any prompt.
So you can construct two policies where one wins on P@K and the other wins on PPL, by varying their budget allocation between "recent window" and "elsewhere":
- StreamingLLM all-in on recent. Captures the high-mass region perfectly. Loses any token outside it.
- TemporalKV picks tokens by predicted-attention rank. Catches some recent (the soft recency feature does contribute), but trades for scattered correct picks in the older region.
P@K says: "ranking agreement with the oracle is better for the scattered picks." PPL says: "but the local attention mass you missed was bigger than the mass you found." Both are right. They're scoring different things.
This is on us as a community. Eviction papers that report only PPL miss the structural-retention story. Papers that report only P@K miss the local-mass story. The honest answer is "report both, plus the fraction of budget that lands in the recent window as a sanity check." We'll be doing that going forward.
The fix is obvious in hindsight
Add a hard recency floor to TemporalKV. Always keep the last 64 tokens. Use the MLP to pick the remaining K−64. You get StreamingLLM's PPL guarantee at extreme tight K and TemporalKV's smart scattering at K ≥ 256.
We did this. The hybrid policy is called TKv-H. It is strictly dominant across every (dataset, budget) cell we tested — it never loses to either parent on either metric. The numbers for the head-to-head are in the dataroom.
The mechanism story for why this works is in the insights doc; the cross-dataset replication (the pattern holds on UltraChat, OASST2, and MT-Eval) is in the same doc.
What was harder to admit
For a few weeks, the K/T = 1/16 PPL number sat in a separate file labelled "regression to investigate." We had a narrative ("TemporalKV wins below K/T = 1/8") and the number disagreed with it. The temptation was to attribute the loss to noise, the proxy issue (more on which below), or scope the headline to K/T = 1/8 and skip the 1/16 column.
The actual answer — "we have a soft recency feature, the problem rewards a hard one, here's the architectural fix" — was better than the headline we'd have kept. It also gave us the hybrid, which is the deployment artifact we now ship.
One caveat on the comparison
H2O and SnapKV in the PPL table above use label-derived attention proxies, not raw attention scores. That makes the +1,111% and +11,776% numbers worse than the real policies would achieve with a proper attention buffer. We flag this honestly — the "TemporalKV vastly better than SnapKV/H2O on PPL" framing has a proxy caveat. The headline of this post (StreamingLLM beats TemporalKV at 1/16) doesn't have that caveat: StreamingLLM is fully evaluated.
The open question on extracting real attention for H2O/SnapKV is flagged in the open questions doc.