Recognition: unknown
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3
The pith
IceCache maintains 99 percent of full KV-cache accuracy on long tasks using just 256 tokens by clustering semantically related tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IceCache integrates semantic token clustering with PagedAttention by organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure. This enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. On LongBench, a 256-token budget maintains 99% of the full KV cache accuracy, and the method achieves competitive or superior latency and accuracy while using only 25% of the KV cache token budget compared to prior offloading methods.
What carries the argument
Semantic token clustering integrated with PagedAttention through a hierarchical, dynamically updatable data structure that groups related tokens for efficient KV cache selection and transfers.
If this is right
- Long-generation tasks such as chain-of-thought reasoning become practical on GPUs with limited memory.
- KV cache memory use can be reduced to 25 percent of standard sizes while keeping latency and accuracy competitive.
- CPU-GPU bandwidth is used more efficiently because semantically grouped tokens allow better transfer patterns.
- The approach supports scaling inference to longer contexts without proportional memory growth.
Where Pith is reading between the lines
- The clustering could be combined with other compression techniques to achieve even smaller cache footprints.
- The method may help maintain coherence in multi-turn dialogues where token relevance changes gradually.
- Hardware with higher CPU-GPU bandwidth would amplify the latency gains from the improved transfer efficiency.
- Similar hierarchical grouping ideas might apply to managing memory in other transformer components like activations.
Load-bearing premise
Semantic token clustering can reliably identify the tokens most important for future generation steps without introducing selection errors that compound over long autoregressive sequences.
What would settle it
A long chain-of-thought generation task where the clustering fails to retain early tokens needed for the correct final answer, causing accuracy to drop well below 99 percent of the full-cache baseline on LongBench.
Figures
read the original abstract
Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes IceCache, a KV-cache management method for long-sequence LLMs that combines semantic token clustering with PagedAttention and a hierarchical dynamically updatable data structure. The central claim is that this enables efficient CPU-GPU token selection and transfers, allowing a 256-token budget (25% of full cache) to retain 99% of full-KV accuracy on LongBench while matching or exceeding the latency and accuracy of prior offloading baselines, especially on chain-of-thought tasks.
Significance. If the empirical results hold under rigorous controls, the work would be significant for practical long-context inference on memory-constrained hardware. The integration of semantic clustering for contiguous memory regions offers a plausible engineering improvement over purely attention-score or recency-based eviction. The public code release supports reproducibility and is a clear strength.
major comments (3)
- [Abstract / §4] Abstract and experimental evaluation: the headline result (99% accuracy retention at 256-token budget) is reported without any mention of run-to-run variance, number of random seeds, or how the token-budget threshold and clustering hyperparameters were selected or tuned. This directly affects verifiability of the central accuracy claim.
- [§3 / §4] §4 (Experiments) and §3 (Method): no ablation isolates the contribution of semantic clustering to error propagation across autoregressive steps, nor is there a comparison against an oracle attention-based selector. Given that the paper itself notes prior offloading methods degrade on CoT tasks, the absence of such controls leaves the weakest assumption (reliable identification of future-relevant tokens) untested and load-bearing for the long-sequence claims.
- [§3.2] §3.2 (hierarchical structure) and PagedAttention integration: the manuscript provides no analysis of how cluster boundaries interact with page-level CPU-GPU transfers or whether mis-clustered tokens can force additional page faults that compound latency. This interaction is central to the claimed bandwidth-efficiency advantage.
minor comments (2)
- [§3] Figure 2 (or equivalent architecture diagram) would benefit from explicit annotation of the dynamic update rules and cluster-to-page mapping to improve clarity of the hierarchical data structure.
- [§3.1] Notation for the clustering objective and eviction policy could be formalized with a short equation or pseudocode; the current prose description leaves the exact similarity metric and update frequency ambiguous.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to improve the clarity and rigor of the paper.
read point-by-point responses
-
Referee: [Abstract / §4] Abstract and experimental evaluation: the headline result (99% accuracy retention at 256-token budget) is reported without any mention of run-to-run variance, number of random seeds, or how the token-budget threshold and clustering hyperparameters were selected or tuned. This directly affects verifiability of the central accuracy claim.
Authors: We agree that reporting run-to-run variance and details on hyperparameter selection is essential for reproducibility and verifiability. In the revised manuscript, we will update the experimental section to include results averaged over three independent random seeds, along with standard deviations for the reported accuracy metrics. Additionally, we will add a description of how the 256-token budget was chosen (to represent 25% of the full cache capacity for the evaluated sequence lengths) and the process for selecting clustering hyperparameters, which involved a grid search on a held-out validation set from LongBench. A sensitivity analysis will also be included to show robustness. revision: yes
-
Referee: [§3 / §4] §4 (Experiments) and §3 (Method): no ablation isolates the contribution of semantic clustering to error propagation across autoregressive steps, nor is there a comparison against an oracle attention-based selector. Given that the paper itself notes prior offloading methods degrade on CoT tasks, the absence of such controls leaves the weakest assumption (reliable identification of future-relevant tokens) untested and load-bearing for the long-sequence claims.
Authors: We recognize the importance of isolating the effect of semantic clustering and providing stronger controls. In the revision, we will include new ablation experiments that compare IceCache against recency-based and random selection baselines within the same PagedAttention setup. These will quantify the impact on error accumulation over multiple autoregressive steps, with particular focus on chain-of-thought tasks. An exact oracle attention-based selector is not feasible to implement without incurring the full computational cost of the complete KV cache, as it would require access to future attention scores. Instead, we will benchmark against attention-score-based eviction strategies from existing literature to contextualize the benefits of semantic clustering. We will also provide an analysis of token relevance prediction accuracy and its effect on long-sequence performance. revision: partial
-
Referee: [§3.2] §3.2 (hierarchical structure) and PagedAttention integration: the manuscript provides no analysis of how cluster boundaries interact with page-level CPU-GPU transfers or whether mis-clustered tokens can force additional page faults that compound latency. This interaction is central to the claimed bandwidth-efficiency advantage.
Authors: We agree that a detailed examination of the interplay between semantic clusters and PagedAttention's paging mechanism is necessary to substantiate the efficiency claims. In the revised §3.2 and experimental results, we will incorporate an analysis of page fault occurrences and CPU-GPU transfer volumes for clustered versus unclustered token management. The hierarchical data structure is designed to preserve contiguity within clusters to reduce fragmentation and associated page faults; we will present empirical measurements demonstrating that mis-clustering effects are mitigated by the dynamic update mechanism, resulting in minimal additional latency overhead. revision: yes
Circularity Check
No circularity in claimed derivation or results
full rationale
The paper describes an engineering method (semantic clustering + hierarchical PagedAttention structure) whose performance is measured empirically on external benchmarks such as LongBench. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described approach. The central claim (99% accuracy retention at 256-token budget) is presented as an experimental outcome rather than a quantity forced by construction from the method's own inputs. This is the expected non-finding for an applied systems paper without mathematical reduction steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,
work page internal anchor Pith review arXiv
-
[2]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508,
work page internal anchor Pith review arXiv
-
[3]
Mag- icpig: Lsh sampling for efficient llm generation.arXiv preprint arXiv:2410.16179,
Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. Arkvale: Efficient generative llm inference with recallable key-value eviction. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024a. Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang...
-
[4]
Memory-efficient transformers via top-kattention.arXiv preprint arXiv:2106.06899,
Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, and Jonathan Berant. Memory-efficient transformers via top-kattention.arXiv preprint arXiv:2106.06899,
-
[5]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,
work page internal anchor Pith review arXiv
-
[6]
Fast k-nearest neighbour search via prioritized dci
Ke Li and Jitendra Malik. Fast k-nearest neighbour search via prioritized dci. InInternational conference on machine learning, pp. 2081–2090. PMLR,
2081
-
[7]
arXiv preprint arXiv:2412.10319 , year =
Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Scbench: A kv cache-centric analysis of long-context methods.arXiv preprint arXiv:2412.10319, 2024a. Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, ...
-
[8]
Landmark attention: Random-access infinite context length for transformers
Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers.arXiv preprint arXiv:2305.16300,
-
[9]
arXiv preprint arXiv:2406.10774 , year=
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,
-
[10]
Yuhao Wu, Ming Shan Hee, Zhiqing Hu, and Roy Ka-Wei Lee. Longgenbench: Benchmarking long-form generation in long context llms.arXiv preprint arXiv:2409.02076,
-
[11]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review arXiv
-
[12]
Pqcache: Product quantization-based kvcache for long context llm inference,
Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference.arXiv preprint arXiv:2407.12820, 2024a. Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett,...
-
[13]
As shown in Table 6, IceCache substantially outperforms PQCache while maintaining accuracies on par with the Full-KV baseline
using Llama-3.1-8B-Instruct with a 256-token budget. As shown in Table 6, IceCache substantially outperforms PQCache while maintaining accuracies on par with the Full-KV baseline. Table 6: Accuracy comparison on LongGenBench for Llama-3.1-8B-Instruct. Method Completion Rate Accuracy Once Accuracy Range Accuracy Periodic Avg. Accuracy Full KV 97.627 0.349 ...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.