Recognition: no theorem link
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Pith reviewed 2026-05-10 16:39 UTC · model grok-4.3
The pith
KV caches in language models can be compressed sequentially far beyond per-vector entropy limits by exploiting the model's own token predictions and shared prefixes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce sequential KV compression, a two-layer architecture that exploits the fact that tokens in the KV cache are samples from the model's formal language. The first layer performs probabilistic prefix deduplication across sessions using the trie metric d_T(s, s') = -log_2 P_M(s ^ s') from Probabilistic Language Tries. The second layer applies predictive delta coding, storing only the residual of each new KV vector from the model's prediction, which yields the entropy bound H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}). At perplexity 10-20 this bound equals 3.3-4.3 bits per token position on average, producing a theoretical compression ratio of approximately 914,000x over per-
What carries the argument
Probabilistic Language Tries (PLTs) with the trie metric d_T(s, s') = -log_2 P_M(s ^ s') for identifying shared prefixes, combined with predictive delta coding that encodes only the model's own prediction residual for each KV vector.
Load-bearing premise
The model is assumed to be a near-optimal predictor so that the conditional entropy of each KV vector given prior ones is at most the conditional entropy of the corresponding token, and the trie metric is assumed to identify semantically equivalent prefixes without loss.
What would settle it
Run the predictive delta coding on real KV vectors generated by a language model on a large held-out text corpus and measure whether the average bits per token position stay within the claimed 3.3-4.3 range.
read the original abstract
Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that actually matters: compressing the KV cache as a sequence. The tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language. We introduce sequential KV compression, a two-layer architecture that exploits this structure. The first layer, probabilistic prefix deduplication, identifies semantically equivalent shared prefixes across sessions using the trie metric d_T(s, s') = -log_2 P_M(s ^ s') from Probabilistic Language Tries (PLTs). The second layer, predictive delta coding, stores only the residual of each new KV vector from the model's own prediction of it, achieving a per-token entropy bound of H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}). We prove that at typical language model perplexity -- approximately 10-20 for fluent English text -- this bound is 3.3-4.3 bits on average per token position, compared to TurboQuant's 3 bits per vector component (with typical attention heads having 64-128 components). The theoretical compression ratio over TurboQuant is approximately 914,000x at the Shannon limit. Even at 1000x above the entropy floor -- a deliberately pessimistic worst-case overhead, two orders of magnitude above the 2-5x typical of practical source coders -- the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows. The two layers are orthogonal and compose with existing per-vector quantization methods including TurboQuant.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces sequential KV cache compression using Probabilistic Language Tries for prefix deduplication and predictive delta coding. It claims that this achieves a per-token entropy bound of 3.3-4.3 bits based on H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}), leading to theoretical compression ratios of approximately 914,000x over TurboQuant at the Shannon limit and 914x at 1000x overhead, with the method being orthogonal to per-vector quantization techniques.
Significance. If the entropy bound holds and the compression ratios are correctly derived, this approach could significantly advance efficient inference for long-context transformers by exploiting the sequential and predictive structure of the KV cache. The compositionality with methods like TurboQuant is a positive aspect, potentially allowing hybrid systems. The core idea of using the model's language prediction for delta coding is promising for reducing memory in autoregressive generation.
major comments (3)
- [Abstract] The headline compression ratio of ~914,000x at the Shannon limit contradicts the arithmetic from the stated figures: 3.3-4.3 bits per token position versus 3 bits/component × 64-128 components (192-384 bits/vector) yields a maximum ratio of ~116x. No derivation, scaling with context length, or other factors are provided to justify the 914,000x figure.
- [Abstract] The asserted proof of the inequality H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}) lacks derivation steps in the manuscript. It follows immediately from the premise that the model is a near-optimal predictor, raising the concern that the bound is tautological under the given assumptions rather than independently established.
- [Abstract] The claim that 'compression improving rather than degrading as context length grows' is stated without accompanying analysis, equations, or empirical results demonstrating this property of the two-layer architecture.
minor comments (1)
- The definition of the trie metric d_T(s, s') = -log_2 P_M(s ^ s') would benefit from an explicit explanation of the shared prefix operator ^ and how it identifies semantically equivalent prefixes without loss.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive major comments. We address each point below and agree that revisions are required to clarify the abstract and strengthen the manuscript. We will incorporate the necessary corrections, derivations, and supporting analysis in the revised version.
read point-by-point responses
-
Referee: [Abstract] The headline compression ratio of ~914,000x at the Shannon limit contradicts the arithmetic from the stated figures: 3.3-4.3 bits per token position versus 3 bits/component × 64-128 components (192-384 bits/vector) yields a maximum ratio of ~116x. No derivation, scaling with context length, or other factors are provided to justify the 914,000x figure.
Authors: We acknowledge the arithmetic inconsistency in the headline compression ratio. The stated 914,000x figure does not follow from the per-token entropy bound of 3.3-4.3 bits compared to TurboQuant's 3 bits per component across 64-128 components (192-384 bits per vector), which yields a ratio on the order of 50-116x. We will revise the abstract and main text to remove the erroneous 914,000x claim, provide an explicit step-by-step derivation of the ratio (including how many vectors comprise the KV state per token), and correct the 914x figure at 1000x overhead to be consistent with the corrected base ratio. No additional scaling factors with context length were intended or present in the original derivation. revision: yes
-
Referee: [Abstract] The asserted proof of the inequality H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}) lacks derivation steps in the manuscript. It follows immediately from the premise that the model is a near-optimal predictor, raising the concern that the bound is tautological under the given assumptions rather than independently established.
Authors: We will add the missing derivation steps to the manuscript. The inequality is established as follows: given the prior KV state (which fully encodes the token history for the deterministic transformer), the next KV vector(s) KV_{i+1} are produced by a deterministic function of the sampled next token token_{i+1}. By the standard information-theoretic property that entropy cannot increase under a deterministic mapping, H(KV_{i+1} | KV_{<=i}) ≤ H(token_{i+1} | token_{<=i}). The model's near-optimality is used only to interpret the right-hand side as approximately 3.3-4.3 bits (from typical perplexity); the inequality itself holds independently of optimality. This is not tautological, as it rigorously links KV-cache entropy to token entropy via the model's architecture. We will include this proof sketch in the revised abstract and body. revision: yes
-
Referee: [Abstract] The claim that 'compression improving rather than degrading as context length grows' is stated without accompanying analysis, equations, or empirical results demonstrating this property of the two-layer architecture.
Authors: We agree that this claim requires supporting material. The improvement with context length is expected from the first layer (probabilistic prefix deduplication via the trie metric d_T), as longer histories increase the probability of shared prefixes and thus higher deduplication rates. We will add a brief analysis (including an equation for expected storage reduction as a function of context length under the PLT distribution) and, where feasible, empirical curves from our experiments demonstrating the trend. If space constraints apply, we will qualify the statement and move the supporting material to an appendix. revision: yes
Circularity Check
No circularity: bound follows from standard inequality and external perplexity value
full rationale
The paper derives the per-token bit bound via the information-theoretic fact that KV vectors are a deterministic function of the token sequence, yielding H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}), then substitutes the known cross-entropy implied by typical LM perplexity (10-20) to obtain the numerical range 3.3-4.3 bits. This is a direct application of H(f(X)) <= H(X) plus an external benchmark; it does not define the left-hand side in terms of the right-hand side or fit parameters to the target quantity. The subsequent compression ratios are obtained by scaling the per-vector TurboQuant cost by the number of KV vectors per token position (layers × heads × 2) and dividing by the bound, with no self-citation, ansatz, or renaming step required for the central claim. The architecture (trie deduplication + delta coding) is presented as a concrete mechanism whose information-theoretic limit is bounded by the above, but the limit itself is not tautological with the mechanism.
Axiom & Free-Parameter Ledger
free parameters (2)
- perplexity range =
10-20
- overhead factor =
1000
axioms (2)
- domain assumption The transformer model is by construction a near-optimal predictor of the formal language it was trained on.
- domain assumption Tokens stored in the KV cache are samples from the exact formal language the model was trained on.
invented entities (2)
-
Probabilistic Language Tries (PLTs)
no independent evidence
-
predictive delta coding
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
Cover and Joy A
Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006
2006
-
[3]
Language Modeling Is Compression , year =
Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christo- pher Mattern, Jordi Grau-Moya, Li Kevin Li, Joel Veness, Arthur Gretton, et al. Language modeling is compression.arXiv preprint arXiv:2309.10668, 2023
-
[4]
Kvquant: Towards 10 million context length llm inference with kv cache quantization,
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization.arXiv preprint arXiv:2401.18079, 2024
-
[5]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM SIGOPS Symposium on Operating Systems Principles, 2023
2023
-
[6]
Fast inference from transformers via speculative decoding, 2023.URL https://arxiv
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding.arXiv preprint arXiv:2211.17192, 2023
-
[7]
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2-bit quantization for KV cache.arXiv preprint arXiv:2402.02750, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
Gregory Magarshak. Probabilistic language tries: A unified framework for compression, decision policies, and execution reuse.arXiv preprint arXiv:2604.06228, 2026. arXiv:2604.06228 [cs.LG]. Submitted 29 March 2026. 21
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Efficiently scaling transformer inference
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In Proceedings of Machine Learning and Systems, volume 5, 2023
2023
-
[11]
Claude E. Shannon. Prediction and entropy of printed English.Bell System Technical Journal, 30(1):50–64, 1951
1951
-
[12]
Witten, Radford M
Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987
1987
-
[13]
Turboquant: Online vector quantization with near-optimal distortion rate,
Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni, Praneeth Kacham, Lars Gottesbu- ren, and Rajesh Jayaram. TurboQuant: Online vector quantization with near-optimal distortion rate. InInternational Conference on Learning Representations (ICLR), 2026. arXiv:2504.19874. Google Research, NYU, Google DeepMind, KAIST
-
[14]
H2O: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023. 22
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.