pith. machine review for the scientific record. sign in

arxiv: 2604.15356 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.AI· cs.IT· cs.NE· math.IT

Recognition: no theorem link

Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITcs.NEmath.IT
keywords KV cache compressionprobabilistic language triessequential compressionpredictive delta codingtransformer inferenceentropy boundsprefix deduplicationlanguage model predictions
0
0 comments X

The pith

KV caches in language models can be compressed sequentially far beyond per-vector entropy limits by exploiting the model's own token predictions and shared prefixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that KV caches store sequences drawn from the exact language a model was trained on, and since the model is a near-optimal predictor of that language, the cache itself can be compressed as a sequence rather than as independent vectors. This is achieved with a two-layer method: probabilistic prefix deduplication that finds shared prefixes across sessions using a trie-based distance, followed by predictive delta coding that records only the residual between each new KV vector and the model's prediction of it. At typical perplexities of 10-20, the resulting per-token bound is 3.3-4.3 bits, orders of magnitude below the cost of quantizing dozens of vector components separately. If the bound holds, KV cache memory use drops dramatically while preserving all information, and the savings grow with longer contexts instead of shrinking. The approach is orthogonal to existing per-vector quantization and can be stacked with it.

Core claim

We introduce sequential KV compression, a two-layer architecture that exploits the fact that tokens in the KV cache are samples from the model's formal language. The first layer performs probabilistic prefix deduplication across sessions using the trie metric d_T(s, s') = -log_2 P_M(s ^ s') from Probabilistic Language Tries. The second layer applies predictive delta coding, storing only the residual of each new KV vector from the model's prediction, which yields the entropy bound H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}). At perplexity 10-20 this bound equals 3.3-4.3 bits per token position on average, producing a theoretical compression ratio of approximately 914,000x over per-

What carries the argument

Probabilistic Language Tries (PLTs) with the trie metric d_T(s, s') = -log_2 P_M(s ^ s') for identifying shared prefixes, combined with predictive delta coding that encodes only the model's own prediction residual for each KV vector.

Load-bearing premise

The model is assumed to be a near-optimal predictor so that the conditional entropy of each KV vector given prior ones is at most the conditional entropy of the corresponding token, and the trie metric is assumed to identify semantically equivalent prefixes without loss.

What would settle it

Run the predictive delta coding on real KV vectors generated by a language model on a large held-out text corpus and measure whether the average bits per token position stay within the claimed 3.3-4.3 range.

read the original abstract

Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that actually matters: compressing the KV cache as a sequence. The tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language. We introduce sequential KV compression, a two-layer architecture that exploits this structure. The first layer, probabilistic prefix deduplication, identifies semantically equivalent shared prefixes across sessions using the trie metric d_T(s, s') = -log_2 P_M(s ^ s') from Probabilistic Language Tries (PLTs). The second layer, predictive delta coding, stores only the residual of each new KV vector from the model's own prediction of it, achieving a per-token entropy bound of H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}). We prove that at typical language model perplexity -- approximately 10-20 for fluent English text -- this bound is 3.3-4.3 bits on average per token position, compared to TurboQuant's 3 bits per vector component (with typical attention heads having 64-128 components). The theoretical compression ratio over TurboQuant is approximately 914,000x at the Shannon limit. Even at 1000x above the entropy floor -- a deliberately pessimistic worst-case overhead, two orders of magnitude above the 2-5x typical of practical source coders -- the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows. The two layers are orthogonal and compose with existing per-vector quantization methods including TurboQuant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces sequential KV cache compression using Probabilistic Language Tries for prefix deduplication and predictive delta coding. It claims that this achieves a per-token entropy bound of 3.3-4.3 bits based on H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}), leading to theoretical compression ratios of approximately 914,000x over TurboQuant at the Shannon limit and 914x at 1000x overhead, with the method being orthogonal to per-vector quantization techniques.

Significance. If the entropy bound holds and the compression ratios are correctly derived, this approach could significantly advance efficient inference for long-context transformers by exploiting the sequential and predictive structure of the KV cache. The compositionality with methods like TurboQuant is a positive aspect, potentially allowing hybrid systems. The core idea of using the model's language prediction for delta coding is promising for reducing memory in autoregressive generation.

major comments (3)
  1. [Abstract] The headline compression ratio of ~914,000x at the Shannon limit contradicts the arithmetic from the stated figures: 3.3-4.3 bits per token position versus 3 bits/component × 64-128 components (192-384 bits/vector) yields a maximum ratio of ~116x. No derivation, scaling with context length, or other factors are provided to justify the 914,000x figure.
  2. [Abstract] The asserted proof of the inequality H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}) lacks derivation steps in the manuscript. It follows immediately from the premise that the model is a near-optimal predictor, raising the concern that the bound is tautological under the given assumptions rather than independently established.
  3. [Abstract] The claim that 'compression improving rather than degrading as context length grows' is stated without accompanying analysis, equations, or empirical results demonstrating this property of the two-layer architecture.
minor comments (1)
  1. The definition of the trie metric d_T(s, s') = -log_2 P_M(s ^ s') would benefit from an explicit explanation of the shared prefix operator ^ and how it identifies semantically equivalent prefixes without loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and constructive major comments. We address each point below and agree that revisions are required to clarify the abstract and strengthen the manuscript. We will incorporate the necessary corrections, derivations, and supporting analysis in the revised version.

read point-by-point responses
  1. Referee: [Abstract] The headline compression ratio of ~914,000x at the Shannon limit contradicts the arithmetic from the stated figures: 3.3-4.3 bits per token position versus 3 bits/component × 64-128 components (192-384 bits/vector) yields a maximum ratio of ~116x. No derivation, scaling with context length, or other factors are provided to justify the 914,000x figure.

    Authors: We acknowledge the arithmetic inconsistency in the headline compression ratio. The stated 914,000x figure does not follow from the per-token entropy bound of 3.3-4.3 bits compared to TurboQuant's 3 bits per component across 64-128 components (192-384 bits per vector), which yields a ratio on the order of 50-116x. We will revise the abstract and main text to remove the erroneous 914,000x claim, provide an explicit step-by-step derivation of the ratio (including how many vectors comprise the KV state per token), and correct the 914x figure at 1000x overhead to be consistent with the corrected base ratio. No additional scaling factors with context length were intended or present in the original derivation. revision: yes

  2. Referee: [Abstract] The asserted proof of the inequality H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}) lacks derivation steps in the manuscript. It follows immediately from the premise that the model is a near-optimal predictor, raising the concern that the bound is tautological under the given assumptions rather than independently established.

    Authors: We will add the missing derivation steps to the manuscript. The inequality is established as follows: given the prior KV state (which fully encodes the token history for the deterministic transformer), the next KV vector(s) KV_{i+1} are produced by a deterministic function of the sampled next token token_{i+1}. By the standard information-theoretic property that entropy cannot increase under a deterministic mapping, H(KV_{i+1} | KV_{<=i}) ≤ H(token_{i+1} | token_{<=i}). The model's near-optimality is used only to interpret the right-hand side as approximately 3.3-4.3 bits (from typical perplexity); the inequality itself holds independently of optimality. This is not tautological, as it rigorously links KV-cache entropy to token entropy via the model's architecture. We will include this proof sketch in the revised abstract and body. revision: yes

  3. Referee: [Abstract] The claim that 'compression improving rather than degrading as context length grows' is stated without accompanying analysis, equations, or empirical results demonstrating this property of the two-layer architecture.

    Authors: We agree that this claim requires supporting material. The improvement with context length is expected from the first layer (probabilistic prefix deduplication via the trie metric d_T), as longer histories increase the probability of shared prefixes and thus higher deduplication rates. We will add a brief analysis (including an equation for expected storage reduction as a function of context length under the PLT distribution) and, where feasible, empirical curves from our experiments demonstrating the trend. If space constraints apply, we will qualify the statement and move the supporting material to an appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: bound follows from standard inequality and external perplexity value

full rationale

The paper derives the per-token bit bound via the information-theoretic fact that KV vectors are a deterministic function of the token sequence, yielding H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}), then substitutes the known cross-entropy implied by typical LM perplexity (10-20) to obtain the numerical range 3.3-4.3 bits. This is a direct application of H(f(X)) <= H(X) plus an external benchmark; it does not define the left-hand side in terms of the right-hand side or fit parameters to the target quantity. The subsequent compression ratios are obtained by scaling the per-vector TurboQuant cost by the number of KV vectors per token position (layers × heads × 2) and dividing by the bound, with no self-citation, ansatz, or renaming step required for the central claim. The architecture (trie deduplication + delta coding) is presented as a concrete mechanism whose information-theoretic limit is bounded by the above, but the limit itself is not tautological with the mechanism.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the domain assumption that the model's predictive power applies directly to KV vectors and on the new entities introduced for the compression layers, with no independent evidence supplied for those entities.

free parameters (2)
  • perplexity range = 10-20
    Used to compute the 3.3-4.3 bits per token bound from typical LM perplexity of 10-20.
  • overhead factor = 1000
    Chosen as a deliberately pessimistic 1000x worst-case multiplier for practical source coders.
axioms (2)
  • domain assumption The transformer model is by construction a near-optimal predictor of the formal language it was trained on.
    Invoked to justify the entropy bound H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}).
  • domain assumption Tokens stored in the KV cache are samples from the exact formal language the model was trained on.
    Basis for treating the KV cache as language data rather than arbitrary floating-point values.
invented entities (2)
  • Probabilistic Language Tries (PLTs) no independent evidence
    purpose: Identify semantically equivalent shared prefixes across sessions using the trie metric d_T(s, s') = -log_2 P_M(s ^ s').
    First layer of the proposed architecture for prefix deduplication.
  • predictive delta coding no independent evidence
    purpose: Store only the residual between each new KV vector and the model's own prediction of it.
    Second layer achieving the per-token entropy bound.

pith-pipeline@v0.9.0 · 5653 in / 1915 out tokens · 76132 ms · 2026-05-10T16:39:51.169565+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

  2. [2]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006

  3. [3]

    Language Modeling Is Compression , year =

    Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christo- pher Mattern, Jordi Grau-Moya, Li Kevin Li, Joel Veness, Arthur Gretton, et al. Language modeling is compression.arXiv preprint arXiv:2309.10668, 2023

  4. [4]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization,

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization.arXiv preprint arXiv:2401.18079, 2024

  5. [5]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM SIGOPS Symposium on Operating Systems Principles, 2023

  6. [6]

    Fast inference from transformers via speculative decoding, 2023.URL https://arxiv

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding.arXiv preprint arXiv:2211.17192, 2023

  7. [7]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

  8. [8]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2-bit quantization for KV cache.arXiv preprint arXiv:2402.02750, 2024

  9. [9]

    Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

    Gregory Magarshak. Probabilistic language tries: A unified framework for compression, decision policies, and execution reuse.arXiv preprint arXiv:2604.06228, 2026. arXiv:2604.06228 [cs.LG]. Submitted 29 March 2026. 21

  10. [10]

    Efficiently scaling transformer inference

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In Proceedings of Machine Learning and Systems, volume 5, 2023

  11. [11]

    Claude E. Shannon. Prediction and entropy of printed English.Bell System Technical Journal, 30(1):50–64, 1951

  12. [12]

    Witten, Radford M

    Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987

  13. [13]

    Turboquant: Online vector quantization with near-optimal distortion rate,

    Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni, Praneeth Kacham, Lars Gottesbu- ren, and Rajesh Jayaram. TurboQuant: Online vector quantization with near-optimal distortion rate. InInternational Conference on Learning Representations (ICLR), 2026. arXiv:2504.19874. Google Research, NYU, Google DeepMind, KAIST

  14. [14]

    H2O: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023. 22