The win grows with conversation depth
Same prompt. Same K=128 budget. Same hardware (a Pixel-class on-device runtime). Three policies. SlidingWindow forgot subprime mortgages. Plain TemporalKV forgot 2008. The hybrid kept both. Here's the trace, and why.
The first four posts have been about averages. Bulk numbers across hundreds of dialogues. AUC. Mean ΔPPL. Bootstrapped CIs. Those are the right summary statistics, but they don't feel like anything — they don't show you what the policies actually do to the text the model produces.
So here is one dialogue. Picked because it's the cleanest example of a phenomenon that shows up consistently in the bench: at K/T = 1/8, in a multi-turn conversation with a callback question, the three budgeted policies disagree on what to remember, and you can read the disagreement directly off the generated text.
The dialogue
This is from OASST2, deep enough into the conversation that any K=128 policy has been forced to evict roughly 80% of the prefilled cache.
The prompt
Can you analyze the causes and effects of the 2008 financial crisis, including the role of mortgage-backed securities, the subprime mortgage market, and the regulatory environment?
Certainly! The 2008 financial crisis was a severe global economic downturn... [the model explains MBS, subprime, regulatory environment in detail. ~600 tokens.]
Ok, thanks for the answer. Can you please summarize it in a way that can be explained in a conference/talk setup for undergraduates. Please format it in a way that would allow for dramatic pauses and emotional elements.
The interesting word is "it". The model has to remember what "it" refers to, across the budgeted policy's eviction. At K=128, the policy has 128 cache slots and a prefilled context that substantially exceeds that. So something gets dropped.
What each policy did
Where each piece came from
The cache animation below traces what happens in TKv-H's cache at the moment the model generates its response. The conversation is laid out left-to-right by token position. Each cell is one prefilled token. Eviction has run; the green cells survived and are visible to the model. Watch the three structural retention zones light up.
The mechanism — what we got right and what we got wrong
When we first wrote up this head-to-head we attributed TKv-H's "subprime mortgage market" recovery to the MLP's topic-relevance feature: the row containing "subprime mortgage market" must have scored high cosine similarity to the recent "summarize it" query, so the MLP kept it. Story made sense. We wrote it down.
Then we read our own code and discovered the bench harness was passing nullptr for the embedding function — meaning topic_relevance was identically zero on every entry in that run. The recovery was real. The mechanism we'd attributed it to was mechanically impossible.
The actual mechanism, the one our weight analysis from post 4 already implied: the MLP's recency × importance AND-gate systematically keeps turn 0 / turn 1 of any K=128 deployment. The gate fires for user-source tokens at moderate recency, which is exactly the slot the early-turn user question lives in. "Subprime mortgage market" got kept because it was in the user's opening question, and the MLP's h2 unit votes to keep user-typed content. Topic relevance had nothing to do with it.
The qualitative finding stands; the attribution doesn't. The bench output we observed is real, reproducible, and matches the §5 simulation prediction. What we initially wrote about why it happened required us to be more careful about what code path actually ran. We've replaced the topic-relevance narrative with the structural retention story it should always have been. The full correction is in the dataroom results doc.
Why the win grows with depth
This is the deployment-relevant axis. Real users have long conversations, not multiple different-sized caches. A policy whose advantage grows with conversation length is the right shipping target.
TemporalKV's recency feature is 0.995^(t − last_attended) — relative decay. Importance is per-source, constant per token. Neither depends on absolute T. As the conversation grows, the MLP's features stay calibrated. Sliding-window-based heuristics don't — "the most-recent 124 tokens" is a fundamentally different content set at depth 0.1 (almost all of the conversation) than at depth 0.9 (the tail of a long thread, with all the structural anchors gone).
The numerical version of this:
| Conversation depth | SlidingWindow | SnapKV | H2O | TemporalKV |
|---|---|---|---|---|
| 0.1 (early) | 85% | 87% | 77% | 94% |
| 0.3 | 54% | 59% | 37% | 75% |
| 0.5 | 55% | 56% | 29% | 74% |
| 0.7 | 51% | 51% | 24% | 71% |
| 0.9 (late) | 52% | 53% | 17% | 70% |
The structural-retention point applies here too: what's keeping the late-conversation P@K from collapsing is the MLP's preservation of turn 0/1 user-source tokens, not its semantic discrimination. Which is honestly the better story — it means the deployment design is robust to even mediocre embedding quality on the topic-relevance feature, because the policy isn't really using that feature.
What we'd want to test next
This whole walkthrough is one cell. Replicating it at scale — quantitative "specific-fact preservation" metrics across hundreds of multi-turn dialogues — is on the to-do list. RULER and LOCA-bench are the obvious targets; both are deferred on GPU access at the time of writing.
For now: the bench data is in the dataroom results doc; 412 runs across the four policies. The "win grows with depth" insight with the full quintile breakdown is in the insights doc.
Five posts in
The thing that surprised me most across this whole project is how much of the "learned" policy's behavior is structural. We trained an MLP on four features expecting it to learn semantic discrimination. It learned a recency × user-source AND-gate. That gate, hooked into a sinks + recency-floor hybrid, is the deployment artifact — 49 parameters of which roughly five matter, hardcoded heuristics doing the rest of the work, and a position-indexed cache primitive making the whole thing fast.
None of those four pieces, on its own, would have shipped. Together they hit a regime — K/T ≤ 1/8, multi-turn, on-device — that nothing in the literature was quite measuring. That's the paper draft. These five posts are the parts that were hard to fit there but useful to keep around.
Thanks for reading.