Brandon Miller

The first four posts have been about averages. Bulk numbers across hundreds of dialogues. AUC. Mean ΔPPL. Bootstrapped CIs. Those are the right summary statistics, but they don't feel like anything — they don't show you what the policies actually do to the text the model produces.

So here is one dialogue. Picked because it's the cleanest example of a phenomenon that shows up consistently in the bench: at K/T = 1/8, in a multi-turn conversation with a callback question, the three budgeted policies disagree on what to remember, and you can read the disagreement directly off the generated text.

The dialogue

This is from OASST2, deep enough into the conversation that any K=128 policy has been forced to evict roughly 80% of the prefilled cache.

Figure 1. Non-trivial P@K at K=128 in 2K context, binned by conversation depth (0.0 = early, 1.0 = late). Every policy starts strong while the cache hasn't filled. Once the cache saturates and eviction starts firing in earnest, the gap opens. By depth 0.9: TemporalKV leads SnapKV by +17pts, H2O by +53pts. The chartreuse line stays nearly flat — the policy's relative-step features stay calibrated as the conversation grows.

The prompt

Turn 0 · user

Can you analyze the causes and effects of the 2008 financial crisis, including the role of mortgage-backed securities, the subprime mortgage market, and the regulatory environment?

Turn 1 · assistant

Certainly! The 2008 financial crisis was a severe global economic downturn... [the model explains MBS, subprime, regulatory environment in detail. ~600 tokens.]

Turn 2 · user (the callback)

Ok, thanks for the answer. Can you please summarize it in a way that can be explained in a conference/talk setup for undergraduates. Please format it in a way that would allow for dramatic pauses and emotional elements.

The interesting word is "it". The model has to remember what "it" refers to, across the budgeted policy's eviction. At K=128, the policy has 128 cache slots and a prefilled context that substantially exceeds that. So something gets dropped.

What each policy did

SlidingWindowK=128 · sinks + last 124 tokens

"Certainly, here's a summary in a conference/talk format for undergraduates: [Audience silence] Speaker: 'Good afternoon, everyone. Today, I want to talk about a topic that has been on the minds of many of you: the financial crisis of 2008. This was a time of great turmoil and uncertainty...'"

✓ Topic anchor recovered. The phrase "financial crisis of 2008" sits in the recency window from turn 2's mention of "it" — SlidingWindow keeps it.
× Specific mechanism lost. "Subprime", "mortgage-backed securities", and "regulatory environment" lived in the middle of turn 0 / turn 1. FIFO evicted them.

Plain TemporalKVK=128 · no recency floor

"Certainly! Here's a script that could be used in a conference or talk setup for undergraduates: [The stage is set with a dramatic pause. The lights dim, and the stage is illuminated by a spotlight. The audience is quiet...]"

× Topic lost entirely. No mention of 2008. No mention of the crisis. No mention of finance at all. The MLP picked rows that matched the "conference/talk/dramatic" tokens from turn 2 but lost the anchor for what the talk is supposed to be about.

TemporalKV-H (hybrid)K=128 · sinks + recency floor + MLP middle picks

"Certainly! Here's a way to summarize the causes and effects of the 2008 financial crisis in a way that would be appropriate for a conference or talk setup for undergraduates: 'Imagine a world where the foundations of trust and stability in the financial system were crumbling. It was 2008, and the world was facing a crisis that would change everything. The causes of this crisis were complex and multifaceted, but they all came down to one thing: the subprime mortgage market.'"

✓ Both topic anchor and specific mechanism. The recency floor keeps "2008" alive in the kept window. The MLP's recency × importance ReLU systematically retains user-source tokens from turn 0 — which is where "subprime mortgage market" was first introduced.

Figure 2. Three policies on the same prompt, at the same K=128 budget, on the same model (Qwen2.5-1.5B-Instruct Q4_K_M on llama.cpp). The text is verbatim from the bench output, not edited. Highlighted phrases are tokens that the policy successfully preserved across eviction.

Where each piece came from

The cache animation below traces what happens in TKv-H's cache at the moment the model generates its response. The conversation is laid out left-to-right by token position. Each cell is one prefilled token. Eviction has run; the green cells survived and are visible to the model. Watch the three structural retention zones light up.

keptevictedcontains "2008"contains "subprime"

Figure 3. Cache occupancy at the moment of decoding turn 3. SlidingWindow's contiguous tail captures the recent callback but loses both anchors. Plain TemporalKV scatters its picks by score and happens to miss both. TKv-H's three retention zones — sinks at the front, recency floor at the end, MLP-rescued middle tokens — cover all of: (a) the "2008" mention in turn 0, (b) the "subprime mortgage market" mention in turn 1, (c) the "summarize it" callback in turn 2.

The mechanism — what we got right and what we got wrong

When we first wrote up this head-to-head we attributed TKv-H's "subprime mortgage market" recovery to the MLP's topic-relevance feature: the row containing "subprime mortgage market" must have scored high cosine similarity to the recent "summarize it" query, so the MLP kept it. Story made sense. We wrote it down.

Then we read our own code and discovered the bench harness was passing nullptr for the embedding function — meaning topic_relevance was identically zero on every entry in that run. The recovery was real. The mechanism we'd attributed it to was mechanically impossible.

The actual mechanism, the one our weight analysis from post 4 already implied: the MLP's recency × importance AND-gate systematically keeps turn 0 / turn 1 of any K=128 deployment. The gate fires for user-source tokens at moderate recency, which is exactly the slot the early-turn user question lives in. "Subprime mortgage market" got kept because it was in the user's opening question, and the MLP's h2 unit votes to keep user-typed content. Topic relevance had nothing to do with it.

The qualitative finding stands; the attribution doesn't. The bench output we observed is real, reproducible, and matches the §5 simulation prediction. What we initially wrote about why it happened required us to be more careful about what code path actually ran. We've replaced the topic-relevance narrative with the structural retention story it should always have been. The full correction is in the dataroom results doc.

Why the win grows with depth

This is the deployment-relevant axis. Real users have long conversations, not multiple different-sized caches. A policy whose advantage grows with conversation length is the right shipping target.

TemporalKV's recency feature is 0.995^(t − last_attended) — relative decay. Importance is per-source, constant per token. Neither depends on absolute T. As the conversation grows, the MLP's features stay calibrated. Sliding-window-based heuristics don't — "the most-recent 124 tokens" is a fundamentally different content set at depth 0.1 (almost all of the conversation) than at depth 0.9 (the tail of a long thread, with all the structural anchors gone).

The numerical version of this:

Conversation depth	SlidingWindow	SnapKV	H2O	TemporalKV
0.1 (early)	85%	87%	77%	94%
0.3	54%	59%	37%	75%
0.5	55%	56%	29%	74%
0.7	51%	51%	24%	71%
0.9 (late)	52%	53%	17%	70%

The structural-retention point applies here too: what's keeping the late-conversation P@K from collapsing is the MLP's preservation of turn 0/1 user-source tokens, not its semantic discrimination. Which is honestly the better story — it means the deployment design is robust to even mediocre embedding quality on the topic-relevance feature, because the policy isn't really using that feature.

What we'd want to test next

This whole walkthrough is one cell. Replicating it at scale — quantitative "specific-fact preservation" metrics across hundreds of multi-turn dialogues — is on the to-do list. RULER and LOCA-bench are the obvious targets; both are deferred on GPU access at the time of writing.

For now: the bench data is in the dataroom results doc; 412 runs across the four policies. The "win grows with depth" insight with the full quintile breakdown is in the insights doc.

Five posts in

The thing that surprised me most across this whole project is how much of the "learned" policy's behavior is structural. We trained an MLP on four features expecting it to learn semantic discrimination. It learned a recency × user-source AND-gate. That gate, hooked into a sinks + recency-floor hybrid, is the deployment artifact — 49 parameters of which roughly five matter, hardcoded heuristics doing the rest of the work, and a position-indexed cache primitive making the whole thing fast.

None of those four pieces, on its own, would have shipped. Together they hit a regime — K/T ≤ 1/8, multi-turn, on-device — that nothing in the literature was quite measuring. That's the paper draft. These five posts are the parts that were hard to fit there but useful to keep around.

Thanks for reading.