The ReLU is doing all the work
A linear scorer on the same features scores AUC 0.859. Add one hidden layer of 8 ReLU units — 49 parameters total — and AUC jumps to 0.900. We opened up the trained weights to see what the nonlinearity actually bought us. It was not what we expected.
The TemporalKV scorer is small enough that you can write down all its weights on one screen. So we did.
The architecture: four input features per cached token — recency (decayed time since last attended), importance (source-type one-hot for user / system / assistant), topic relevance (cosine similarity to a query embedding), key diversity (how different this K-vector is from its neighbors). Project through a hidden layer of 8 ReLU units, project down to one sigmoid output. Forty-nine parameters. Four hundred and twelve cycles of inference per token on a Pixel 9.
It outperforms a linear scorer on the same inputs by a clean +4 points of AUC, reproducible at every budget. We spent a while believing the obvious story — "the ReLU lets the MLP learn AND-gates like recent AND on-topic, both must fire for the keep score to activate." Plausible. Architecturally consistent. Wrong.
The decision surface
Three things we found by looking at the weights
1. Five of the eight hidden units do the same thing
Here are the trained weights for the budget-128 MLP (the deployment target):
| hidden | W1.recency | W1.importance | W1.topic | W1.diversity | W2 |
|---|---|---|---|---|---|
| h0 | −1.880 | −0.004 | −0.014 | −0.013 | −44.95 |
| h1 | −2.781 | −0.005 | −0.012 | −0.012 | −48.67 |
| h2 | +0.324 | −0.483 | −0.889 | −0.865 | +8.99 |
| h3 | −2.559 | +0.001 | −0.031 | −0.024 | −41.12 |
| h4 | −2.754 | −0.004 | −0.011 | −0.011 | −47.33 |
| h5 | −0.431 | −0.403 | −0.165 | −0.505 | +0.25 |
| h6 | −0.354 | +0.205 | +0.006 | −0.395 | +0.23 |
| h7 | −3.231 | +0.007 | +0.001 | −0.012 | −47.82 |
Five units — h0, h1, h3, h4, h7 — have W1.recency between −2 and −3, near-zero weights on every other feature, and massive negative W2 around −45. They're functionally identical: each fires positively only when recency is low (token has been attended recently), and when they fire, they slam the output toward zero.
Wait. Toward zero?
Yes — what they actually encode is "this token is old, so push the keep-score toward zero." The MLP is asymmetric. The default keep-score is high, and these five units learned to suppress old tokens. ReLU units don't just gate things on; they gate things off, and five units doing that in slightly different shapes is more expressive than one big linear weight.
2. One unit gates on importance × recency (the actual moat)
The interesting unit is h2. It's the only one with a positive W2 (+8.99), and the only one with appreciable weight on multiple features:
h2.pre_activation = 0.505 + 0.324·recency − 0.483·importance
− 0.889·topic − 0.865·diversity
h2.fires_when = relu(h2.pre_activation) > 0
h2.contribution = +8.99 · relu(h2.pre_activation)
For typical mid-conversation tokens with importance ≈ 0.7 (model-generated), this expression is negative; h2 is at the ReLU floor and contributes 0. Drop importance to 0 (user-typed token) and the expression flips positive; h2 fires; the keep-score gets a positive shove.
This is the AND-gate. But it's not "recent AND on-topic." It's "recent AND user-source." The model learned that user-typed tokens tend to anchor the conversation thread, and it wants to preserve them even after they've aged out of the recency-suppression units' zone.
3. Topic-relevance and key-diversity are inert at inference
This is the finding that re-arranged our story. Look at the W1 columns for topic and diversity in the table above. In five of the eight units they're essentially zero. The two units where they are non-trivially weighted (h2 and h5) need both features to be small for the unit to fire — meaning the MLP punishes high topic/diversity, the opposite of what we'd naively expect.
We checked this empirically on-device. Three queries to the same 8-turn prefilled cache, deliberately picked to maximize semantic difference: "tell me about food", "tell me about temples", "purple elephants in outer space dance". The kept positions inside each turn of the conversation? Identical across all three queries. Down to the individual token index.
Topic-relevance varies with the query (we logged the values; they swing by 0.1+ between queries). It just doesn't propagate to the final score. The recency-suppression units dominate, h2 fires or doesn't based on recency × importance, and the topic input might as well have been zero.
| Variant | Params | AUC @ K=128 | AUC @ K=256 |
|---|---|---|---|
| recency only (1-feature) | 25 | 0.4985 | 0.4980 |
| importance only | 25 | 0.4985 | 0.4980 |
| topic only | 25 | 0.5010 | 0.5001 |
| diversity only | 25 | 0.4985 | 0.4980 |
| recency + importance (2-feature) | 33 | 0.9565 | 0.9412 |
| recency + topic (2-feature) | 33 | 0.9581 | 0.9431 |
| full (4-feature, shipping) | 49 | 0.9598 | 0.9426 |
| linear (no ReLU, 4-feature) | 5 | 0.9285 | 0.9237 |
The 2-feature MLP scores within 0.3 AUC points of the 4-feature variant. The 4-feature design is, empirically, unnecessary. We trained it that way because the optimizer needs degrees of freedom to find the recency × importance nonlinearity reliably — single-feature variants get stuck in a degenerate minimum — but the extra 16 weights add no measurable inference signal.
Bootstrap CIs, the honest answer
Before we ship a "2 features are enough" claim, the right thing to do is check the confidence interval. The point estimates from the ablation made it look like dropping topic & diversity might even improve things at K=128, but with n=10 dialogues the CIs overlap heavily:
| Variant (UltraChat, K=128) | ΔPPL % | 95% CI |
|---|---|---|
| 4-feature TKv-H | +19.55% | [+9.12, +31.50] |
| 2-feature (recency + importance only) | +11.24% | [+4.94, +19.22] |
The 8-point point-estimate gap is not statistically significant at n=10. The defensible claim is equivalence, not improvement. What the data does support, with disjoint CIs, is that both variants strictly beat StreamingLLM at K=128 by a wide margin.
AUC is not feature attribution. Reading the 4-feature AUC win over the 2-feature variants as "all features must contribute" was the wrong inference from the right number. AUC measures joint discrimination; it doesn't tell you which features the model uses. The architectural test is a feature-zeroing inference counterfactual — a five-line ablation we should have run on day one and didn't until month four.
What this changes
The paper's mechanism story used to be "a 4-feature MLP combining recency, importance, topic-relevance, and key-diversity learns to preserve semantically-relevant tokens." That sentence is wrong in detail and overstated in scope.
The honest version: TemporalKV-H is a 2-input × 8-hidden × 1-output ReLU MLP over recency and importance, wrapped in a sinks-plus-recency-floor hybrid. The recency × importance ReLU AND-gate captures all the eviction-relevant signal we can measure. Adding semantic features doesn't help. The architectural moat is the nonlinearity, but only because that nonlinearity expresses an AND-gate the linear model can't represent — not because it integrates four features that the linear model integrates worse.
This is a more interesting finding than the one we set out to write up. A learned eviction policy needs only structural features (position, role) to capture the bulk of the achievable lift. Semantic features (topic, diversity) — which our original architecture treated as co-equal inputs — turn out to be either redundant or unused.
The full re-interpretation, with bootstrap CIs and on-device diagnostic logs proving topic-relevance is inert at inference, is in the dataroom insights doc.