arxiv: 2605.09490 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.AR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning

Aojie Yuan , Tianqi Shen , Dajun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:23 UTC · model grok-4.3

classification 💻 cs.CL cs.ARcs.LG

keywords LLM reasoningKV cachememory hierarchyattention scoringoffloadingchain-of-thoughtGPU memory optimization

0 comments

The pith

LLM reasoning accuracy depends only on the fraction of tokens permanently evicted, not on how many remain in GPU HBM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that chain-of-thought reasoning in LLMs loses accuracy only when tokens are thrown away forever, not when they are moved out of scarce GPU memory and later restored exactly. A tiered memory system scores tokens by cumulative attention, sends low-scoring ones to CPU or compressed storage, and prefetches them back at full precision right before each attention step so the computation is identical to full residency. Experiments that independently vary HBM size and eviction ratio across model scales and benchmarks show accuracy stays high whenever eviction stays low. This decouples memory reduction from the accuracy collapse that occurs with standard eviction methods.

Core claim

The authors formalize zero-approximation-error offloading, in which tokens outside HBM contribute exactly the same attention terms as if they had never left. Under this condition, a controlled 3x3 grid of HBM occupancy versus eviction ratio demonstrates that performance is governed solely by the permanent eviction ratio. At 3 percent eviction the hierarchy recovers 91 percent of full-cache accuracy on GSM8K and 71 percent on MATH-500; at 14B scale it matches the uncompressed baseline while using half the HBM.

What carries the argument

A semantics-aware four-tier memory hierarchy (HBM, DDR, compressed storage, evicted) driven by cumulative attention scoring, with exact on-demand prefetching that guarantees identical numerical results to in-HBM residency.

If this is right

Only 3 percent permanent eviction retains 91 percent of full-cache GSM8K accuracy and 71 percent on MATH-500.
At 14B scale the method matches baseline accuracy while halving HBM occupancy.
A real GPU-CPU prototype incurs 5-7 percent transfer overhead.
The same budget yields 0-32 percent accuracy under the prior SOTA eviction method R-KV.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same eviction-versus-offload distinction could be tested on non-reasoning tasks where long contexts also strain memory.
Combining the hierarchy with existing quantization or sparsity methods might yield further HBM reductions without additional eviction.
Production systems could allocate saved HBM to larger batch sizes or longer context windows while keeping the same eviction budget.

Load-bearing premise

That cumulative attention scores can identify tokens whose removal never changes the reasoning trajectory, and that prefetching those tokens from CPU or compressed storage always arrives in time with full precision and no bandwidth penalty.

What would settle it

Repeat the 3x3 grid experiment but replace exact prefetching with delayed or approximated tokens; if accuracy then varies with HBM size instead of eviction ratio alone, the central claim is false.

Figures

Figures reproduced from arXiv: 2605.09490 by Aojie Yuan, Dajun Zhang, Tianqi Shen.

**Figure 2.** Figure 2: Streaming eviction accuracy vs. KV-cache budget. Attention-based scoring outperforms [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Memory hierarchy vs. pure eviction. At equivalent eviction ratios, the hierarchy (which [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Eviction ratio sweep. The sweet spot is 3–5% eviction. 8-bit quantization of offloaded [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: PCIe transfer latency across all 28 KV layers (RTX 5080, PCIe Gen5 x16, fp16 KV cache). [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: PCIe bandwidth utilization. Throughput saturates above 64 tokens at [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Latency breakdown for hierarchy configurations. Transfer overhead (GPU [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of rank correlations between attention-based and gradient-based importance [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Token tier distribution during generation. The hierarchy dynamically assigns tokens to tiers [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: End-to-end accuracy and throughput comparison. Hierarchy configurations (green) achieve [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Latency breakdown per sample (averaged). Transfer overhead (GPU [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV -- the current SOTA eviction method -- on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest -- 5-7% transfer overhead -- and scaling analysis projects 2-48 GB HBM savings at production batch sizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that reasoning accuracy tracks only the permanent eviction ratio of KV cache tokens, not HBM occupancy, via a four-tier offloading scheme with zero-error prefetching.

read the letter

The thing to know is that this paper claims and tests that LLM reasoning accuracy depends only on the permanent eviction ratio of KV cache tokens, not on HBM occupancy, by using a four-tier semantics-aware hierarchy with prefetching that incurs zero approximation error. Their 3x3 grid isolates the two variables cleanly across model scales and benchmarks, which is the core empirical contribution. They also reproduce R-KV on the same setup to show the difference and ship a real prototype that moves data between GPU and CPU with 5-7% overhead. The zero-approximation-error framing helps explain why the math stays identical for non-evicted tokens. What they do well is keep the experiment focused on separating eviction from residency instead of lumping them together as prior work often does. The results on GSM8K and MATH-500 at 7B-32B scales give a practical sense of the trade-off. The soft spots are mostly around the scoring heuristic. Cumulative attention has to pick the right tokens for offloading or eviction, and while the grid supports the claim on the tested data, reasoning paths can be brittle enough that a different prompt distribution might expose failures. Prefetch timing and bandwidth are shown to work in the prototype but could tighten up at larger batches or longer chains than they measured. The accuracy drop on MATH-500 even at low eviction also reminds you that the method still has task-dependent costs. This is aimed at people building inference systems for chain-of-thought models where HBM is the limiter. A reader who cares about concrete memory management will find usable ideas and data here. It has enough experimental isolation and system-level checks to deserve serious referee time rather than a quick pass.

Referee Report

1 major / 2 minor

Summary. The paper proposes a semantics-aware four-tier memory hierarchy (HBM, DDR, compressed, evicted) for KV caches in LLM chain-of-thought reasoning. Tokens are ranked by cumulative attention scores; low-importance tokens are offloaded rather than evicted and prefetched back at full precision before each attention step, yielding zero-approximation-error offloading. The central empirical claim is that final accuracy is determined exclusively by the permanent eviction ratio and is invariant to HBM occupancy for non-evicted tokens; this is tested via a controlled 3x3 grid over HBM and eviction ratios on 7B–32B models across GSM8K, MATH-500 and two other benchmarks, with a head-to-head reproduction of R-KV and a GPU-CPU prototype reporting 5–7% overhead.

Significance. If the zero-error prefetch and the eviction-ratio invariance hold, the work offers a practical route to large HBM savings (projected 2–48 GB at production batches) while preserving reasoning accuracy that collapses under pure eviction. The explicit 3x3 isolation of the eviction ratio, the external R-KV baseline, and the working prototype constitute concrete strengths that move the field beyond eviction-only heuristics.

major comments (1)

[3x3 grid experiment] § on the 3x3 grid (and associated methods): the invariance claim is load-bearing and rests on the grid varying HBM occupancy while holding cumulative-attention tier thresholds fixed. The manuscript must state the precise HBM occupancy percentages and eviction ratios for each of the nine cells and confirm that the four-tier thresholds were not re-tuned per cell; without these values the isolation of eviction ratio as the sole determinant cannot be fully verified.

minor comments (2)

[Abstract and results] The abstract reports 91% retention at 3% eviction on GSM8K and 71% on MATH-500 (n=200); the main text should tabulate the corresponding full-cache baselines and all three model scales for direct comparison.
[System prototype] The prototype section states 5–7% transfer overhead; a per-layer or per-batch-size breakdown, together with the exact prefetch timing window relative to attention, would clarify whether the zero-error assumption holds under realistic bandwidth constraints.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive comment on the 3x3 grid experiment. We address the point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [3x3 grid experiment] § on the 3x3 grid (and associated methods): the invariance claim is load-bearing and rests on the grid varying HBM occupancy while holding cumulative-attention tier thresholds fixed. The manuscript must state the precise HBM occupancy percentages and eviction ratios for each of the nine cells and confirm that the four-tier thresholds were not re-tuned per cell; without these values the isolation of eviction ratio as the sole determinant cannot be fully verified.

Authors: We agree that explicit numerical values are required for independent verification of the invariance result. In the revised manuscript we will insert a table (or expanded methods paragraph) that lists the exact HBM occupancy percentages and eviction ratios used for each of the nine cells. The cumulative-attention tier thresholds were held strictly fixed across the entire grid; token ranking and tier assignment follow the same sorted cumulative-attention procedure and fixed ratio cut-offs in every cell, with no per-cell re-tuning of thresholds. Consequently, the only controlled variables are HBM occupancy and the permanent eviction ratio, directly supporting the claim that accuracy depends solely on the latter. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claim that accuracy depends solely on the permanent eviction ratio (and is invariant to HBM occupancy for non-evicted tokens) is established by an explicit 3x3 experimental grid that independently varies both quantities while holding cumulative-attention tiering fixed, across three model scales and four benchmarks. This is further supported by a real-system prototype measuring 5-7% transfer overhead and by direct reproduction of the external R-KV baseline. The zero-approximation-error offloading is realized through actual GPU-CPU prefetching rather than assumed by definition, and no load-bearing step reduces to a self-referential equation, fitted input renamed as prediction, or self-citation chain; the derivation remains self-contained against the reported empirical controls.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption of perfect prefetching and on the empirical observation from the 3x3 grid; no new physical entities are introduced, but the tier cutoffs and attention scoring function are system parameters whose exact values are not detailed in the abstract.

free parameters (1)

cumulative attention scoring thresholds for HBM/DDR/compressed/evicted tiers
The paper sorts tokens into four tiers using cumulative attention; specific cutoff values or scoring formula are required to implement the hierarchy but are not stated in the abstract.

axioms (1)

domain assumption Prefetching offloaded tokens back to GPU at full precision produces identical attention outputs and model behavior as if the tokens had remained in HBM throughout
This is the explicit basis for the zero-approximation-error claim and is invoked whenever offloaded tokens participate in attention.

pith-pipeline@v0.9.0 · 5633 in / 1383 out tokens · 54618 ms · 2026-05-12T05:23:35.174898+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A controlled 3×3 grid over HBM and eviction ratios confirms this across three model scales (7B–32B) and four benchmarks.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 5 internal anchors

[1]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review arXiv 2024
[2]

R-KV: Redundancy-aware kv cache compression for reasoning models

Xiaoxin Cai, Yijun Xu, Haotian Chen, Yiqi Gu, Siyuan Huang, and Hongxia Xu. R-KV: Redundancy-aware kv cache compression for reasoning models. In NeurIPS, 2025

work page 2025
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS, 2021

work page 2021
[7]

GEAR: An efficient kv cache compression recipe for near-lossless generative inference of LLM

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. GEAR: An efficient kv cache compression recipe for near-lossless generative inference of LLM. In NeurIPS, 2024

work page 2024
[8]

InfiniGen: Efficient generative inference of large language models with dynamic kv cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Hwisoo Sim. InfiniGen: Efficient generative inference of large language models with dynamic kv cache management. In OSDI, 2024

work page 2024
[9]

ArkVale: Efficient generative LLM inference with recallable key-value eviction

Renze Li, Shi Chen, Jian Li, Chenguang Wang, and Zhaozhuo Xu. ArkVale: Efficient generative LLM inference with recallable key-value eviction. arXiv preprint arXiv:2404.14484, 2024

work page arXiv 2024
[10]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In NeurIPS, 2024

work page 2024
[11]

MiniCache: KV cache compression in depth dimension for large language models

Akide Liu, Jing Zhao, Nan Lu, Kai Dang, Shuang Chen, Chenghao Yan, Hai-Tao Xie, Zhi-Hong Wu, and Jian Gao. MiniCache: KV cache compression in depth dimension for large language models. In NeurIPS, 2024

work page 2024
[12]

ScissorHands: Exploiting the persistence of importance hypothesis for LLM kv cache compression at test time

Zichang Liu, Aashiq Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. ScissorHands: Exploiting the persistence of importance hypothesis for LLM kv cache compression at test time. In NeurIPS, 2024

work page 2024
[13]

KIVI: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for kv cache. In ICML, 2024

work page 2024
[14]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Yifei Xu, Xinlei Huang, Jiachen Chen, and Wenqiang Zhang. TriAttention: Efficient long reasoning with trigonometric kv compression. arXiv preprint arXiv:2604.04921, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Learning to reason with LLMs

OpenAI. Learning to reason with LLMs. OpenAI Blog, 2024

work page 2024
[16]

Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters

Zhiyuan Park, Jinhyuk Song, Sangmin Bae, and Joonhyuk Lee. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. In EMNLP, 2024

work page 2024
[17]

Headinfer: Memory-efficient llm inference by head-wise offloading

Cheng Sun, Xinlei Huang, Yuanbo Chang, Yifan Gao, Zhi Wang, and Hai Luo. HeadInfer: Memory-efficient LLM inference by head-wise offloading. arXiv preprint arXiv:2502.12574, 2025

work page arXiv 2025
[18]

ScoutAttention: Small kernels with large effective receptive fields for efficient kv cache offloading

Ke Tang, Ziteng Wu, Yi Xu, Yilong Zhan, Chengruidong Li, and Xuming Chen. ScoutAttention: Small kernels with large effective receptive fields for efficient kv cache offloading. arXiv preprint arXiv:2502.17606, 2026

work page arXiv 2026
[19]

Hold onto that thought: Assessing kv cache compression on reasoning

Sam Weston, Alice Chen, and Parth Shah. Hold onto that thought: Assessing kv cache compression on reasoning. arXiv preprint arXiv:2512.12008, 2025

work page arXiv 2025
[20]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In ICLR, 2024

work page 2024
[21]

Memory operations in large language models: A survey

Keqin Xu, Huanqi Zhang, Lu Hou, Lifeng Shang, Xin Jiang, and Qun Liu. ThinKV: Thought-adaptive kv cache compression for reasoning models. arXiv preprint arXiv:2505.00675, 2025

work page arXiv 2025
[22]

H2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. In NeurIPS, 2023

work page 2023