arxiv: 2604.16983 · v1 · submitted 2026-04-18 · 📡 eess.SP

Recognition: unknown

Graph-Guided Adaptive Channel Elimination for KV Cache Compression

Enwei Tong, Kai Wang, Xiangyang Ji, Xianming Liu, Yao Zhu, Yuanchao Bai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:02 UTC · model grok-4.3

classification 📡 eess.SP

keywords KV cache compressionchannel pruninggraph optimizationlarge language modelsattention mechanismsmemory reductionautoregressive decoding

0 comments

The pith

GRACE reduces KV cache size by 60 percent by modeling channels as a graph whose edges capture interactions and then pruning to minimize attention-matrix reconstruction error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard channel pruning for KV caches fails because it scores channels independently and misses how they jointly shape attention outputs. GRACE instead builds an explicit graph with channels as nodes and weighted edges for their interactions, then solves for a subset whose removal leaves the attention weight matrix nearly unchanged. An added adaptive guard keeps the most salient key channels untouched during pruning. Experiments across models show the resulting 60 percent memory cut produces only negligible drops in generation quality and beats prior pruning baselines. This matters because long-context inference is currently limited by KV cache memory rather than compute.

Core claim

GRACE reframes KV cache compression as a graph optimization task in which channels become nodes and their pairwise interactions become weighted edges; the algorithm finds a near-optimal pruning set by minimizing reconstruction error of the attention weight matrix while an adaptive protection step shields salient key channels from removal, thereby preserving stable autoregressive decoding.

What carries the argument

A graph in which each channel is a node and inter-channel interactions are encoded as weighted edges; the graph is used to select a pruning subset that minimizes attention-weight-matrix reconstruction error, together with an adaptive protection rule for salient key channels.

If this is right

KV cache memory can be cut to roughly 40 percent of its original size in long-context inference without retraining the model.
Pruning decisions improve when collective channel interactions are modeled rather than when importance is scored in isolation.
Autoregressive decoding remains stable because the adaptive protection step retains critical key channels throughout generation.
The same graph-guided selection procedure can be applied to any transformer-based model that maintains a KV cache.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graph construction could be reused for other compression targets such as activation pruning or attention-head removal.
Combining the pruning mask with quantization might allow even higher total compression ratios while keeping the same reconstruction guarantee.
If the reconstruction-error objective correlates with downstream metrics, the method might generalize to non-language sequence models that use similar caches.

Load-bearing premise

That minimizing reconstruction error of the attention weight matrix on the learned graph will produce a pruned channel set that still supports full model performance and that protecting only the salient key channels is enough to keep autoregressive generation stable.

What would settle it

Measure the drop in perplexity or downstream task accuracy after 60 percent pruning on a held-out long-context benchmark; if the degradation exceeds the negligible threshold reported or if generation becomes unstable on sequences longer than those tested, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.16983 by Enwei Tong, Kai Wang, Xiangyang Ji, Xianming Liu, Yao Zhu, Yuanchao Bai.

**Figure 2.** Figure 2: An overview of the proposed GRACE framework. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmap of interaction terms between channels, where [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: Reduction in reconstruction error for our method versus the THINK [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

Large Language Models have revolutionized natural language processing, achieving unprecedented success across a vast range of tasks. However, their practical application in long-context scenarios is severely hampered by the formidable memory footprint of the Key-Value cache. While channel pruning has emerged as a promising compression strategy, existing methods evaluate channel importance in isolation, fundamentally ignoring the inter-channel interactions that collectively dictate model performance. This oversight leads to suboptimal pruning decisions. To address this, we introduce \textbf{GRACE} (\textbf{GR}aph-guided \textbf{A}daptive \textbf{C}hannel \textbf{E}limination), a novel framework that reframes KV cache compression as a graph-based optimization problem. GRACE models channels as nodes and their interactions as weighted edges, enabling the identification of a near-optimal channel subset for pruning by minimizing the reconstruction error of the attention weight matrix. Furthermore, GRACE incorporates an adaptive protection mechanism that shields salient key channels from removal, ensuring a robust autoregressive decoding process. Extensive experiments show that GRACE can reduce KV cache size by 60\% with negligible performance degradation, consistently outperforming the state-of-the-art method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRACE reframes KV cache channel pruning as graph optimization to capture inter-channel effects, but the attention-weight reconstruction proxy lacks a clear link to stable long-context autoregression.

read the letter

The paper's core move is to model KV cache channels as graph nodes with interaction edges, then prune by minimizing reconstruction error on the attention weight matrix while adding an adaptive shield for salient keys. That framing is new compared to per-channel scoring methods, and it directly targets the collective behavior that isolated importance scores ignore. The 60% reduction claim with negligible degradation is the headline result, and if the experiments hold, it would be practically useful for stretching context on current hardware.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GRACE, a graph-guided framework for KV cache compression in LLMs. It models attention channels as graph nodes with weighted interaction edges and selects a pruning subset by minimizing reconstruction error of the attention weight matrix, while adding an adaptive protection mechanism for salient key channels. The central claim is that this yields up to 60% KV cache reduction with negligible performance loss and consistent outperformance of prior state-of-the-art channel pruning methods.

Significance. If the core claims hold under scrutiny, the work offers a principled alternative to isolated channel-importance scoring by explicitly modeling inter-channel dependencies via graph optimization. This could meaningfully advance practical long-context LLM deployment by reducing memory footprint without sacrificing autoregressive stability.

major comments (3)

[Abstract] Abstract and method description: no derivation or explicit construction of the graph edge weights is supplied, nor is the optimization procedure (objective, solver, convergence criteria) detailed; without these the central claim that the graph-guided minimization identifies a near-optimal pruning subset cannot be verified or reproduced.
[Abstract] Abstract and §4 (experiments): the reported 60% cache reduction and outperformance lack error bars, statistical significance tests, or ablations isolating the adaptive protection mechanism; the heuristic, dataset-dependent threshold for salient-channel shielding is not characterized, leaving open whether the safeguard suffices when graph optimization removes channels relevant to future attention patterns.
[Abstract] Abstract: the proxy objective of minimizing attention-weight reconstruction error is presented without any theoretical bound or analysis relating this quantity to output divergence or perplexity under long-horizon autoregressive generation; small per-step perturbations can accumulate over thousands of tokens, yet no such stability argument or counter-example analysis is provided.

minor comments (2)

Notation for the graph Laplacian or adjacency matrix is introduced without an explicit equation reference, making the reconstruction-error objective harder to follow.
[Abstract] The abstract states 'consistently outperforming the state-of-the-art method' but does not name the specific baselines or cite their original papers in the provided summary.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to improve clarity on the method, strengthen the experimental section with additional statistical analysis and ablations, and expand the discussion of the proxy objective with empirical evidence. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract and method description: no derivation or explicit construction of the graph edge weights is supplied, nor is the optimization procedure (objective, solver, convergence criteria) detailed; without these the central claim that the graph-guided minimization identifies a near-optimal pruning subset cannot be verified or reproduced.

Authors: We thank the referee for highlighting the need for greater detail. The original Section 3 describes the graph construction (channels as nodes, edge weights derived from pairwise attention correlation on a calibration set) and the objective (minimize Frobenius-norm reconstruction error of the attention matrix via greedy selection). To address the concern, we have added explicit formulas for edge-weight computation, pseudocode of the solver, and convergence criteria (stop when relative error reduction < 1e-4) in the revised manuscript. These changes make the near-optimal claim verifiable and reproducible. revision: yes
Referee: [Abstract] Abstract and §4 (experiments): the reported 60% cache reduction and outperformance lack error bars, statistical significance tests, or ablations isolating the adaptive protection mechanism; the heuristic, dataset-dependent threshold for salient-channel shielding is not characterized, leaving open whether the safeguard suffices when graph optimization removes channels relevant to future attention patterns.

Authors: We agree that additional rigor is required. The revised §4 now reports results with error bars over five independent runs, includes paired t-tests confirming statistical significance versus baselines, and adds an ablation that disables the adaptive protection to isolate its effect. The salient-channel threshold (top 10% by key-norm importance) is now explicitly stated and accompanied by a sensitivity study across thresholds (5–20%) demonstrating robustness on long-context tasks. These additions directly address concerns about future attention patterns. revision: yes
Referee: [Abstract] Abstract: the proxy objective of minimizing attention-weight reconstruction error is presented without any theoretical bound or analysis relating this quantity to output divergence or perplexity under long-horizon autoregressive generation; small per-step perturbations can accumulate over thousands of tokens, yet no such stability argument or counter-example analysis is provided.

Authors: We acknowledge that a formal theoretical bound relating per-step reconstruction error to long-horizon output divergence is difficult to derive given the nonlinear autoregressive dynamics. In the revision we have added an empirical stability analysis: we measure correlation between reconstruction error and perplexity on sequences up to 8k tokens, include counter-example cases where error accumulation remains bounded, and discuss the safeguard’s role in preventing drift. While this does not constitute a proof, it provides concrete evidence supporting the proxy’s practical validity and notes the theoretical gap as a limitation for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; independent graph optimization framework

full rationale

The paper frames GRACE as a new graph-based optimization that models channels as nodes with weighted edges and minimizes attention-weight reconstruction error, plus an adaptive salient-channel protection step. No step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the central claim is an empirical method whose performance is evaluated on downstream tasks rather than being tautological with its inputs. The derivation chain is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from the authors' prior work as load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the graph construction and reconstruction-error objective are introduced without stated derivation or external grounding.

pith-pipeline@v0.9.0 · 5505 in / 1071 out tokens · 42711 ms · 2026-05-10T07:02:00.064755+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 13 canonical work pages · 7 internal anchors

[1]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Mistral 7b,

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, , et al., “Mistral 7b,” 2023

2023
[3]

Large language model (llm) ai text generation detection based on transformer deep learning algorithm,

Yuhong Mo, Hao Qin, Yushan Dong, Ziyi Zhu, and Zhenglin Li, “Large language model (llm) ai text generation detection based on transformer deep learning algorithm,” 2024

2024
[4]

Ad- vancing llm reasoning generalists with preference trees,

Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun, “Ad- vancing llm reasoning generalists with preference trees,” 2024

2024
[5]

Hibert: Document level pre-training of hierarchical bidirectional transformers for document summarization,

Xingxing Zhang, Furu Wei, and Ming Zhou, “Hibert: Document level pre-training of hierarchical bidirectional transformers for document summarization,” 2019

2019
[6]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[7]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review arXiv 2024
[8]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer, “Fast transformer decoding: One write-head is all you need,”arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review arXiv 1911
[9]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr ´on, and Sumit Sanghai, “Gqa: Training generalized multi- query transformer models from multi-head checkpoints,”arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review arXiv 2023
[10]

Reducing transformer key-value cache size with cross-layer attention,

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan-Kelley, “Reducing transformer key-value cache size with cross-layer attention,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[11]

H2o: Heavy-hitter oracle for efficient generative inference of large language models,

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R ´e, Clark Barrett, et al., “H2o: Heavy-hitter oracle for efficient generative inference of large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 34661–34710, 2023

2023
[12]

Snapkv: Llm knows what you are looking for before generation,

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen, “Snapkv: Llm knows what you are looking for before generation,” Advances in Neural Information Processing Systems, vol. 37, pp. 22947– 22970, 2024

2024
[13]

Efficient streaming language models with attention sinks,

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis, “Efficient streaming language models with attention sinks,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[14]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al., “Pyra- midkv: Dynamic kv cache compression based on pyramidal information funneling,”arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review arXiv 2024
[15]

arXiv preprint arXiv:2406.10774 , year=

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han, “Quest: Query-aware sparsity for efficient long-context llm inference,”arXiv preprint arXiv:2406.10774, 2024

work page arXiv 2024
[16]

Mag- icpig: Lsh sampling for efficient llm generation.arXiv preprint arXiv:2410.16179,

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, et al., “Magicpig: Lsh sampling for efficient llm generation,” arXiv preprint arXiv:2410.16179, 2024

work page arXiv 2024
[17]

Re- trievalattention: Accelerating long-context llm inference via vector retrieval.arXiv preprint arXiv:2409.10516,

Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, et al., “Retrievalattention: Accelerating long-context llm inference via vector retrieval,”arXiv preprint arXiv:2409.10516, 2024

work page arXiv 2024
[18]

Smoothquant: Accurate and efficient post-training quantization for large language models,

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 38087–38099

2023
[19]

Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation,

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami, “Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation,”Advances in Neural Information Processing Systems, vol. 37, pp. 1270–1303, 2024

2024
[20]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu, “Kivi: A tuning- free asymmetric 2bit quantization for kv cache,”arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review arXiv 2024
[21]

Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache,

Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, and Atlas Wang, “Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache,”Proceedings of Machine Learning and Systems, vol. 6, pp. 381–394, 2024

2024
[22]

ThinK: Thinner key cache by query-driven pruning.arXiv preprint arXiv:2407.21018, 2024

Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, and Doyen Sahoo, “Think: Thinner key cache by query-driven pruning,”arXiv preprint arXiv:2407.21018, 2024

work page arXiv 2024
[23]

Leank: Learnable k cache channel pruning for efficient decoding,

Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, and Lili Qiu, “Leank: Learnable k cache channel pruning for efficient decoding,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 31110–31125

2025
[24]

On the token distance modeling ability of higher rope attention dimension,

Xiangyu Hong, Che Jiang, Biqing Qi, Fandong Meng, Mo Yu, Bowen Zhou, and Jie Zhou, “On the token distance modeling ability of higher rope attention dimension,”arXiv preprint arXiv:2410.08703, 2024

work page arXiv 2024
[25]

Z., and Liu, Z

Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu, “Massive activations in large language models,”arXiv preprint arXiv:2402.17762, 2024

work page arXiv 2024
[26]

Llm.int8(): 8-bit matrix multiplication for transformers at scale,

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” 2022

2022
[27]

Awq: Activation-aware weight quantization for llm compression and acceleration,

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han, “Awq: Activation-aware weight quantization for llm compression and acceleration,” 2024

2024
[28]

Hugging- face’s transformers: State-of-the-art natural language processing,

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush, “Hugging- face’s transformer...

2020
[29]

Longbench: A bilingual, multitask benchmark for long context understanding,

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li, “Longbench: A bilingual, multitask benchmark for long context understanding,” 2024

2024
[30]

Llmtest needle in a haystack-pressure testing llms,

Gregory Kamradt, “Llmtest needle in a haystack-pressure testing llms,” 2023

2023