Brandon Miller

You write a paper. The paper says: here is a learned KV-cache eviction policy. It beats heuristic baselines on next-token PPL across a swept range of budgets. Our numbers are gathered with HuggingFace Transformers and a PyTorch attention mask trick.

Then you take it on-device, port the policy to the production inference runtime, and the first benchmark run measures: 3.40 tokens per second at K=128. The no-eviction baseline runs at 34.4 tok/s. Your "speedup" is a 10× regression.

This is what happened to us when we moved off the offline pipeline onto llama.cpp on an x86_64 server (with the same binary cross-compiled to arm64 for the Pixel 9). And it taught us something about how production caches actually work that I haven't seen written down anywhere.

Two ways to "remove" a token from a KV cache

Every published eviction paper models the operation the same way. You have a cache tensor of shape [layers, heads, T, d]. You pick the indices you want to keep. You slice:

# PyTorch StreamingLLM / H2O / SnapKV
kv_cache = kv_cache[keep_indices]
position_ids = arange(len(keep_indices))   # re-number 0..N-1

That second line — renumbering the positions — is the critical one. The K vectors in a transformer cache have already been rotated by RoPE for whichever absolute position the token occupied when it was first written. Once you compact, the surviving K's are sitting in slots [0, 1, 2, …] with rotations meant for their old positions. Attention math against future Q's now gives the wrong relative phase. So you re-rotate. We call this compact-and-shift.

Naively, you do the same thing on llama.cpp. The API even has two functions that look like they map cleanly:

llama_memory_seq_rm(mem, seq, p0, p1);        // remove positions in [p0, p1)
llama_memory_seq_add(mem, seq, p0, p1, -d);   // shift positions down by d

You wire your policy up, run the bench, and watch decode latency collapse.

Figure 1. Each cell is one cached token; the number inside is its stored llama_pos. The yellow halo on the right indicates the per-decode RoPE re-rotation pass triggered by any seq_add call. Compact-and-shift renumbers the survivors and triggers build_graph_shift() on the next decode — ~135 ms/token on Qwen2.5-1.5B Q4_K_M. Sparse just marks cells inactive; nothing else moves.

The cache is position-indexed, not slot-indexed

Here's the thing we didn't expect. In llama.cpp, the cache cells don't have stable indices. Each cell carries its own llama_pos, written at insertion time. Attention iterates over the set of active cells and reads their stored positions to compute Q·K relative phase. There's nothing to defragment. Gaps in position numbers are the native data model.

Once you internalize that, the API makes sense:

Call	Cost	What it actually does
`seq_rm(p0, p1)`	metadata only	Marks cells in [p0, p1) inactive. Their K and V buffers stay allocated; attention just stops iterating over them.
`seq_add(p0, p1, Δ)`	metadata + RoPE rebuild next decode	Sets `do_shift = true`. Next `llama_decode` runs `build_graph_shift()`, a dedicated compute graph that re-rotates every kept K vector across every layer for the new position.

For a 1.5B Q4_K_M model, that shift graph costs roughly 135 ms per token. It runs every decode step that follows an eviction. That's the 5× slowdown. And it's completely unnecessary — the K's that survived already have the correct rotation for their original positions, and that rotation gives correct relative phase against future Q's at whatever positions they end up at.

The shift, in other words, is solving a problem the PyTorch runtime has (slot indices are the only positions) that the llama.cpp runtime doesn't have (positions are stored per-cell, independent of slot).

The numbers from a 412-run sweep

This was measured head-to-head, identical policy logic, identical configs, only the cache primitive changed:

Strategy	tok/s @ K=128	vs no-eviction
`seq_rm` + `seq_add` (compact & shift)	3.40	5.06× slower
No eviction (baseline)	34.37	1.00×
`seq_rm` only (sparse)	38.05	1.11× faster

Drop the seq_add call and decoding gets faster than the no-eviction baseline. The expected attention-compute savings show up: attention iterates over 128 active rows instead of 800–2000 rows of full context.

Decoded text quality stays identical. We verified this with Jaccard overlap and reading hundreds of outputs side-by-side — the compact-and-shift path and the sparse path produce the same tokens. The shift is pure overhead.

So why does PyTorch make everyone do compact-and-shift?

Look at what PyTorch is willing to expose. A KV cache is a torch.Tensor indexed by slot. There is no seq_rm. "Remove positions [p0, p1)" means literally:

cache = torch.cat([cache[:p0], cache[p1:]])

After that, the surviving rows are at slot indices p0, p0+1, … — which means their RoPE rotation was applied for their old positions, and the new slot index is wrong. The shift is real. You actually need it.

The StreamingLLM, H2O, and SnapKV authors weren't wrong. Their runtimes required compact-and-shift. They just didn't know that the right answer on llama.cpp is different, because nobody benchmarked their policies on llama.cpp.

The lesson generalises. Before you publish an "X% latency reduction" number for any KV-cache method, ask: is the cache I'm measuring against position-indexed or slot-indexed? The answer determines whether you need a shift step — and the shift step can dominate your benchmark.

What about MLC, ExecuTorch, MediaPipe LLM?

The on-device runtimes we've looked at have a mix. Some expose position-indexed primitives, some don't. Some have a position-indexed underlying cache but no public API to write into the gaps. When evaluating an inference runtime for cache-eviction research, the first question is: can I mark a cell inactive without renumbering its neighbors? If yes, you have a fast deployment story. If not, you're back to compact-and-shift and you'll pay for it.

Our v0.1.5 demo shipped on LiteRT-LM, which has the opposite problem — the executor compiles a static computation graph and doesn't expose any cache mutation at all. We pivoted off LiteRT-LM specifically because eviction was structurally impossible there.

What this changes for the paper

Section 6 (deployment) used to be a one-paragraph appendix. It's now its own results section. The headline is no longer "TemporalKV preserves quality at K=128"; it's "the right primitive for TemporalKV on llama.cpp is sparse seq_rm, and it's an 11% free speed win on top of the quality preservation."

The full ablation, including the failed compact-and-shift baseline that triggered this whole investigation, is in the dataroom. The deeper insight on why sparse is the correct primitive is in a separate insights doc.

Next post: a tour of when learned eviction actually helps. The short version is, only when your cache budget is tight enough that the heuristics start picking the wrong tokens — and that threshold lives at K/T = 1/8.