pith. machine review for the scientific record. sign in

arxiv: 2605.09877 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords key-value meansblock recurrencetransformer memory compressionlong context modelinglinear attentionkv cacherecurrent transformerchunked training
0
0 comments X

The pith

A transformer using averaged key-value blocks over fixed segments maintains competitive long-context accuracy with subquadratic prefill and sublinear state growth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Key-Value Means as a block-recurrence mechanism that replaces sequences of key-value pairs with their averages, creating a state that can remain fixed-size or expand as needed. This turns a standard transformer into a chunked recurrent model that trains and prefills in linear time per chunk while still supporting growing memory for longer contexts. The authors show that a version with expandable KVM cache matches full-attention baselines on long-context benchmarks, all implemented with ordinary matrix operations and no custom kernels. Readers care because the approach offers a continuous trade-off between memory use and speed without discarding the expandability that distinguishes transformers from fixed-state linear RNNs.

Core claim

Key-Value Means compresses the attention cache by replacing blocks of key-value pairs with their vector means, supporting either constant-size state or an expandable growing cache. When inserted into a transformer, the fixed-size version produces an O(N) chunked recurrent network with negligible extra parameters. The growable version, trained end-to-end, reaches competitive scores on long-context tasks while requiring only subquadratic prefill time and sublinear state growth during decoding.

What carries the argument

Key-Value Means (KVM), the replacement of key-value sequences by their block-wise averages to form a compressed recurrent attention state.

If this is right

  • KVM layers can replace standard attention on every layer to reduce KV-cache memory footprint.
  • Prefill complexity can be set anywhere between O(N) and O(N^2) by choosing block size, allowing tunable speed-memory trade-offs.
  • Hybrid models that combine KVM layers with linear RNN layers gain sublinear memory growth for extended context lengths.
  • Chunk-wise parallel training and prefill remain available without custom kernels or changes to the optimizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same averaging idea could be applied to other attention variants to test whether block compression generalizes beyond standard softmax attention.
  • Because state growth is sublinear, the method may enable practical deployment of longer contexts on memory-limited hardware without full retraining.
  • If block averaging works well, future work could explore learned or adaptive block sizes that preserve more signal in critical regions.

Load-bearing premise

Averaging keys and values over blocks preserves enough information for the model to stay competitive with full attention on long-context tasks.

What would settle it

A clear performance gap versus the full-attention baseline on a long-context retrieval benchmark such as needle-in-haystack when block size is increased would show that the averaging discards necessary details.

Figures

Figures reproduced from arXiv: 2605.09877 by Daniel Goldstein, Eugene Cheah.

Figure 1
Figure 1. Figure 1: KVM Attention mask across both causal BSWA and growing KVM state [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of fixed, power-law, and saturating KVM state-budget schedules. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TextbookChapters mean loss per 1024 token block [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: TextbookChapters GPTAlpha-2/SWA mean loss per 1024 token block [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

We present Key-Value Means ("KVM"), a novel block-recurrence for attention that can accommodate either fixed-size or growing state. Equipping a strong transformer baseline with fixed-size KVM attention layers yields a strong $O(N)$ chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long-context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk-wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk-wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV-cache memory, and allowing a continuous range of choices of prefill time complexity between $O(N)$ and $O(N^2)$. It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/recursal/KVM-paper and trained models at https://huggingface.co/collections/recursal/key-value-means under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Key-Value Means (KVM), a block-recurrent attention mechanism that replaces per-token key-value pairs with block-wise averages to support either fixed-size or growable compressed memory states in transformers. It claims this yields O(N) chunked RNN behavior with subquadratic prefill, sublinear state growth, competitive long-context performance, and compatibility with standard operations, chunk-parallel training/prefill, and hybrid LRNN use.

Significance. If the performance claims are validated, KVM would provide a practical, parameter-light bridge between full-attention transformers and linear RNNs, enabling tunable memory-compute trade-offs and reduced KV-cache footprint while preserving expandability. The public release of code and trained models is a clear strength for reproducibility.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim that a growable KVM cache 'performs competitively on long-context tests' is unsupported by any reported metrics, baselines, ablation tables, or error bars; without these the competitiveness assertion cannot be evaluated.
  2. [§3.2] §3.2 (KVM definition): block-wise averaging of keys and values is presented without any analysis, bound, or empirical measurement of accumulated approximation error across layers or cache merges; this directly bears on whether sublinear state growth preserves the information needed for the claimed long-context accuracy.
  3. [§3.1 and §4] §3.1 and §4: the stated O(N) prefill and sublinear growth are asserted but not accompanied by wall-clock timings, memory scaling plots, or comparisons against full attention and other compressed-memory baselines, leaving the complexity claims unverified.
minor comments (2)
  1. [§3] Notation for block size and merge operations could be made more explicit in the equations to aid re-implementation.
  2. [Figures] Figure captions should explicitly state which baseline each curve corresponds to.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight areas where additional empirical support and analysis will strengthen the manuscript. We will revise the paper to include the requested metrics, error analysis, and scaling measurements while preserving the core technical contributions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that a growable KVM cache 'performs competitively on long-context tests' is unsupported by any reported metrics, baselines, ablation tables, or error bars; without these the competitiveness assertion cannot be evaluated.

    Authors: We agree that the current version does not include sufficient quantitative results to fully support the competitiveness claim for the growable KVM variant. In the revision we will add a dedicated long-context evaluation section with tables reporting perplexity or accuracy on standard benchmarks (e.g., LongBench, PG-19), direct comparisons against full-attention baselines and recent compressed-memory methods, ablation studies on block size and merge frequency, and error bars from at least three random seeds. These additions will allow readers to evaluate the claim directly. revision: yes

  2. Referee: [§3.2] §3.2 (KVM definition): block-wise averaging of keys and values is presented without any analysis, bound, or empirical measurement of accumulated approximation error across layers or cache merges; this directly bears on whether sublinear state growth preserves the information needed for the claimed long-context accuracy.

    Authors: The observation is correct: §3.2 currently presents the averaging operation without accompanying error analysis. We will expand this section with (1) a simple analytic bound on the per-merge approximation error under standard assumptions on key/value distributions, (2) empirical measurements of cosine similarity and downstream task degradation after repeated merges across multiple layers, and (3) a short ablation showing how error accumulates with cache size. These additions will clarify the information-preservation properties of sublinear growth. revision: yes

  3. Referee: [§3.1 and §4] §3.1 and §4: the stated O(N) prefill and sublinear growth are asserted but not accompanied by wall-clock timings, memory scaling plots, or comparisons against full attention and other compressed-memory baselines, leaving the complexity claims unverified.

    Authors: We acknowledge the absence of concrete runtime and memory measurements. The revised manuscript will include (a) wall-clock prefill timings on sequences up to 128k tokens for KVM versus full attention, (b) memory-footprint scaling plots demonstrating sublinear growth, and (c) side-by-side comparisons against both full attention and representative linear/compressed baselines. All measurements will be obtained on the same hardware to ensure fair verification of the claimed complexity benefits. revision: yes

Circularity Check

0 steps flagged

KVM proposal is an independent architectural construction with no load-bearing reductions to self-defined quantities

full rationale

The paper introduces Key-Value Means as a block-recurrence attention mechanism that replaces per-token KV pairs with block averages, enabling fixed or growable state. No equations are presented that define a quantity in terms of itself or rename a fitted parameter as a prediction. Central claims rest on empirical training results and external code release rather than any self-citation chain or uniqueness theorem imported from prior author work. The derivation is therefore self-contained as a design choice whose validity is tested against long-context benchmarks outside the method's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that mean-based block compression retains sufficient information for attention; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Averaging keys and values over blocks preserves enough information for competitive attention performance
    This assumption underpins the claim that the compressed recurrent state can replace full attention without major capability loss.

pith-pipeline@v0.9.0 · 5537 in / 1146 out tokens · 42349 ms · 2026-05-14T20:55:55.678846+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  2. [2]

    2024 , eprint=

    GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression , author=. 2024 , eprint=

  3. [3]

    2023 , eprint=

    Attention Is All You Need , author=. 2023 , eprint=

  4. [4]

    International conference on machine learning , pages=

    Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

  5. [5]

    2022 , eprint=

    Transformer Quality in Linear Time , author=. 2022 , eprint=

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

  7. [7]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  8. [8]

    Communications of the ACM , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

  9. [9]

    2019 , eprint=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

  10. [10]

    2025 , eprint=

    RWKV-7 "Goose" with Expressive Dynamic State Evolution , author=. 2025 , eprint=

  11. [11]

    2024 , journal=

    DataComp-LM: In search of the next generation of training sets for language models , author=. 2024 , journal=

  12. [12]

    The Thirteenth International Conference on Learning Representations , year=

    Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. The Thirteenth International Conference on Learning Representations , year=

  13. [13]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  14. [14]

    2025 , eprint=

    Value Residual Learning , author=. 2025 , eprint=

  15. [15]

    2024 , eprint=

    TransformerFAM: Feedback attention is working memory , author=. 2024 , eprint=

  16. [16]

    2024 , eprint=

    Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=

  17. [17]

    2024 , eprint=

    Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters , author=. 2024 , eprint=

  18. [18]

    The Fourteenth International Conference on Learning Representations , year=

    Test-Time Training Done Right , author=. The Fourteenth International Conference on Learning Representations , year=

  19. [19]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Titans: Learning to Memorize at Test Time , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  20. [20]

    2025 , eprint=

    Kimi Linear: An Expressive, Efficient Attention Architecture , author=. 2025 , eprint=

  21. [21]

    Block-Recurrent Transformers , url =

    Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam , booktitle =. Block-Recurrent Transformers , url =

  22. [22]

    The Thirteenth International Conference on Learning Representations , year=

    TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters , author=. The Thirteenth International Conference on Learning Representations , year=

  23. [23]

    2026 , eprint=

    Online Vector Quantized Attention , author=. 2026 , eprint=

  24. [24]

    Learning to (Learn at Test Time):

    Yu Sun and Xinhao Li and Karan Dalal and Jiarui Xu and Arjun Vikram and Genghan Zhang and Yann Dubois and Xinlei Chen and Xiaolong Wang and Sanmi Koyejo and Tatsunori Hashimoto and Carlos Guestrin , booktitle=. Learning to (Learn at Test Time):. 2025 , url=

  25. [25]

    Transformer-

    Lucas Dax Lingle , booktitle=. Transformer-. 2024 , url=

  26. [26]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Rope to Nope and Back Again: A New Hybrid Attention Strategy , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  27. [27]

    International Conference on Learning Representations , year=

    Compressive Transformers for Long-Range Sequence Modelling , author=. International Conference on Learning Representations , year=

  28. [28]

    ACL , year=

    How to Train Long-Context Language Models (Effectively) , author=. ACL , year=

  29. [29]

    doi:10.5281/zenodo.12608602 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  30. [30]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Don't be lazy: CompleteP enables compute-efficient deep transformers , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  31. [31]

    2025 , eprint=

    Why Gradients Rapidly Increase Near the End of Training , author=. 2025 , eprint=

  32. [32]

    2023 , url=

    Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebron and Sumit Sanghai , booktitle=. 2023 , url=

  33. [33]

    2024 , eprint=

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

  34. [34]

    International Conference on Machine Learning , pages=

    Linear Transformers Are Secretly Fast Weight Programmers , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  35. [35]

    Neural Computation , volume=

    Learning to control fast-weight memories: An alternative to dynamic recurrent networks , author=. Neural Computation , volume=. 1992 , publisher=

  36. [36]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. arXiv preprint arXiv:1606.06031 , year=

  37. [37]

    2024 , url=

    Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg , booktitle=. 2024 , url=

  38. [38]

    2024 , eprint=

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author=. 2024 , eprint=

  39. [39]

    2025 , eprint=

    Test-time regression: a unifying framework for designing sequence models with associative memory , author=. 2025 , eprint=

  40. [40]

    2025 , eprint=

    Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs , author=. 2025 , eprint=

  41. [41]

    Forty-first International Conference on Machine Learning , year=

    Language Models as Science Tutors , author=. Forty-first International Conference on Machine Learning , year=

  42. [42]

    2023 , eprint=

    The Impact of Positional Encoding on Length Generalization in Transformers , author=. 2023 , eprint=

  43. [43]

    Transformer Language Models without Positional Encodings Still Learn Positional Information

    Haviv, Adi and Ram, Ori and Press, Ofir and Izsak, Peter and Levy, Omer. Transformer Language Models without Positional Encodings Still Learn Positional Information. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.99

  44. [44]

    2024 , eprint=

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels , author=. 2024 , eprint=