Brandon Miller

When we started reporting results from our policy, we made the mistake almost everyone makes in this field: we reported numbers at a single cache size, said "TemporalKV beats SnapKV by 16.6 points of P@K", and moved on.

Then we ran the same experiment with twice the context length, and the K=512 result — which had been a clean win — collapsed. At 1K context with K=256 (so K/T = 1/4) we beat the best baseline by +15.4. At 2K context with K=512 (also K/T = 1/4) we beat it by +1.4. Same ratio. Tenth the gap.

But K=256 in 2K context (K/T = 1/8) gave us +17.8, almost exactly matching K=128 in 1K context (also K/T = 1/8) at +16.6. And K=128 in 2K context (K/T = 1/16) gave +17.4.

The picture cleared up when we re-plotted everything against K/T, the ratio of cache budget to context length, and forgot about either variable on its own.

K/T ratio1/8

TemporalKVSnapKVStreamingLLMKeyDiffH2O

Figure 1. Non-trivial P@K (agreement with the attention oracle on which tokens are actually attended to) as cache budget gets squeezed. Drag the slider to change K/T. The chartreuse line is TemporalKV; the colored bars at each K/T let you read off the per-policy lift. The cliff sits at K/T = 1/4.

The +17pt cliff

Here are the consolidated numbers across both runs. The pattern that wasn't obvious from any single context length jumps out once you stack them by ratio:

K/T	1K Δ vs best baseline	2K Δ vs best baseline
1/16	—	+17.4
1/8	+16.6	+17.8
1/4	+15.4	+1.4
1/2	+1.1	−1.8

The K/T = 1/8 row is so flat across context lengths that for a while we thought it was a coincidence. Three independent measurements (1K K=128, 2K K=128, 2K K=256) all land at +16 to +18 points.

The K/T = 1/4 row is the interesting one. Same ratio, 10× difference. We'll come back to that.

Why K/T is the right unit

K/T is roughly "how much eviction pressure must the policy absorb?". At K/T = 1/2 the cache is still half-full of the prompt — even a sliding window has room to catch the important tokens by accident. At K/T = 1/16 you have to be surgical. Every kept slot has to earn its place.

The heuristic policies have failure modes that don't trigger when there's slack:

SlidingWindow loses any token older than its window. Catastrophic when the window is small relative to context.
H2O keeps tokens with high cumulative attention. At loose budgets this is fine; at tight budgets it locks onto early-popular tokens that may no longer be relevant.
SnapKV uses recent-window attention to score older tokens, and the score itself gets noisy at tight K because the ranking near the boundary is determined by ~zero attention weights.

A learned policy with relative-step features (recency = 0.995^(t−last_attended); importance = source-type) stays calibrated because the features don't care about absolute T. The longer the context, the more eviction pressure, the bigger the gap.

The K/T = 1/4 puzzle

Here's where the story gets careful. At K/T = 1/4 the same ratio gives wildly different results — +15.4 in the 1K run, +1.4 in the 2K run. K/T isn't sufficient on its own; absolute K matters too.

The mechanism is label noise at the eviction boundary. When K is small (=128), the 128th-ranked token in the oracle has meaningfully more attention than the 129th. The "kept" / "evicted" label is informative. When K is large (=512), the 512th token has near-zero attention — so does the 600th, and the 800th. The boundary is essentially random in the underlying attention scores, and no policy can clearly distinguish "top-K" from "next-K" because there's nothing to distinguish.

This is a methodological point worth keeping. Eviction benchmarks that report only at a single absolute K are undersampling the design space. A policy's "win at K/T = 1/4" depends on whether you tested at K=256 or K=512. Future work should sweep both axes.

The deployment regime sits squarely in the cliff zone. Desktop llama.cpp on 8 GB consumer GPUs gives ~8–16K of cache budget, multi-turn convos hit 32–64K (K/T ≈ 1/4 to 1/8). Cloud agents push 100K–500K against 128K limits (K/T → 1/8). Projected on-device assistants land at K/T ∈ [1/32, 1/8]. The +17pt regime is precisely where production wants to run.

What about PPL?

P@K (agreement with the attention oracle) is one metric. Next-token perplexity is another, and the two tell different stories in different regimes. At K/T ≤ 1/8 TemporalKV wins on P@K by +17pts and matches full-KV PPL within 1.3% — the strongest possible "shippable" combination.

At K/T = 1/16, P@K still says TemporalKV wins by +17pts. PPL says StreamingLLM wins. That's the next post.