CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention
Pith reviewed 2026-06-30 09:40 UTC · model grok-4.3
The pith
Restricting erase to the key axis resolves memory-blind gating in recurrent models and validates the WY-form chunk solver.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CARVE resolves the three coupled defects in GDN-2 by erasing only on the key axis. This restriction is provably necessary and sufficient for the WY-form triangular chunk solver to remain valid. It supplies the recurrent output tensor as a content signal to the erase gate without extra cost and reduces the write gate to a single scalar per head. At 1.3B parameters trained on 100B tokens, the model reaches WikiText perplexity 15.72, leads recurrent baselines on nine common-sense reasoning tasks, and sets state of the art on every RULER retrieval probe while adding 0.4 percent throughput overhead, 13 percent lower peak memory, and 19 percent fewer parameters.
What carries the argument
The key-axis-only erase mask, which keeps the triangular structure required by the WY-form chunk solver while feeding the recurrent output tensor directly into the erase gate.
If this is right
- CARVE achieves WikiText perplexity 15.72, 0.18 lower than GDN-2 at 4.5-sigma significance.
- It leads every recurrent baseline on nine common-sense reasoning benchmarks.
- It sets state of the art on every RULER retrieval probe.
- It delivers these gains at 0.4 percent throughput overhead, 13 percent lower peak memory, and 19 percent fewer parameters.
Where Pith is reading between the lines
- If the key-axis restriction generalizes, other linear recurrent architectures could adopt the same reuse of existing tensors to cut parameter waste without retraining from scratch.
- The formal theorems on Lyapunov stability and gradient flow imply that CARVE-style models could be stacked deeper than prior recurrent baselines before instability appears.
- The Pareto-optimal chunk size result suggests testing CARVE at chunk lengths beyond the paper's experiments to measure any further efficiency trade-offs on longer sequences.
Load-bearing premise
Erase only on the key axis is provably necessary and sufficient for the WY-form triangular chunk solver to remain valid.
What would settle it
Run the WY-form solver on an otherwise identical model that performs erase on the value axis and check whether the triangular decomposition stays numerically stable or produces invalid results.
Figures
read the original abstract
Recurrent models must forget in order to remember, yet the state of the art decides what to erase without consulting what is stored -- the gate sees only the arriving token, not the memory it is about to modify. This memory-blind gating is one of three coupled defects in the leading delta-rule architecture (GDN-2): the value-axis erase mask wastes parameters at the scale of the value projection, and -- as we prove -- mathematically prevents the WY-form triangular chunk solver that makes recurrent training competitive with Transformers. We introduce CARVE (Content-Aware Recurrent with Value Efficiency), which resolves all three problems through one principle: erase only on the key axis. This is provably necessary and sufficient for the WY-form solver to remain valid. Within it, CARVE reuses the recurrent output tensor -- already written to GPU memory -- as a free content signal for the erase gate, and replaces the per-value write-gate projection with a single scalar per head. At initialisation CARVE is bit-identical to GDN-2; any quality difference emerges from what the content gate learns. At 1.3B parameters trained on 100B tokens, CARVE achieves WikiText perplexity 15.72 (minus 0.18 vs. GDN-2, a 4.5-sigma effect), leads every recurrent baseline on nine common-sense reasoning benchmarks, and sets state of the art on every RULER retrieval probe -- at 0.4% throughput overhead, 13% lower peak memory, and 19% fewer parameters. Six formal theorems cover memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, and hybrid optimality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CARVE, which modifies the GDN-2 delta-rule architecture by restricting erase operations to the key axis (claimed to be provably necessary and sufficient for preserving the WY-form triangular chunk solver), reuses the recurrent output tensor as a content signal for the erase gate, and replaces the per-value write-gate projection with a scalar per head. At 1.3B parameters trained on 100B tokens, it reports WikiText perplexity of 15.72 (0.18 lower than GDN-2), leads recurrent baselines on nine common-sense reasoning benchmarks, and achieves SOTA on all RULER retrieval probes, with 0.4% throughput overhead, 13% lower peak memory, and 19% fewer parameters. Six formal theorems are cited on memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, and hybrid optimality. The model initializes bit-identical to GDN-2.
Significance. If the necessity/sufficiency claim for key-axis erase holds and the empirical gains are reproducible, the work could strengthen the case for content-aware recurrent linear attention as a competitive alternative to Transformers for long-context tasks, with the bit-identical initialization providing a clean attribution to the learned content gate.
major comments (2)
- [Abstract] Abstract and theorem statements: the manuscript asserts six formal theorems (memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, hybrid optimality) and states that key-axis erase is 'provably necessary and sufficient' for WY-form solver validity, yet provides no theorem statements, derivations, or proof sketches. This is load-bearing for the architectural motivation and the interpretation of the reported 4.5-sigma perplexity gain.
- [Abstract and results section] Experimental reporting: the abstract claims a 4.5-sigma perplexity improvement and leadership on nine benchmarks plus SOTA on RULER, but supplies no details on random seeds, statistical testing procedure, data exclusion criteria, or hyperparameter controls, preventing assessment of whether the numbers support the central claim of a content-gate-driven improvement.
minor comments (2)
- [Methods] Clarify in the methods section how the content gate reuses the already-written recurrent output tensor without introducing additional memory traffic.
- [Background] Add a reference or brief derivation sketch for the WY-form triangular chunk solver to make the necessity claim self-contained.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and theorem statements: the manuscript asserts six formal theorems (memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, hybrid optimality) and states that key-axis erase is 'provably necessary and sufficient' for WY-form solver validity, yet provides no theorem statements, derivations, or proof sketches. This is load-bearing for the architectural motivation and the interpretation of the reported 4.5-sigma perplexity gain.
Authors: We agree the abstract's reference to the theorems requires supporting detail to substantiate the necessity/sufficiency claim for key-axis erase. The full statements and derivations appear in Sections 3.2–3.4 and Appendix B of the manuscript, with Theorem 3 establishing that value-axis erase violates the triangular structure of the WY-form solver while key-axis erase preserves it. To make this load-bearing argument self-contained, the revised version will include concise statements of all six theorems plus one-paragraph proof sketches in a new Appendix C, with explicit cross-references from the abstract and introduction. revision: yes
-
Referee: [Abstract and results section] Experimental reporting: the abstract claims a 4.5-sigma perplexity improvement and leadership on nine benchmarks plus SOTA on RULER, but supplies no details on random seeds, statistical testing procedure, data exclusion criteria, or hyperparameter controls, preventing assessment of whether the numbers support the central claim of a content-gate-driven improvement.
Authors: We concur that reproducibility details are essential for interpreting the 4.5-sigma claim. The current experimental section reports results from three independent runs with different seeds but does not document the testing procedure or controls. In the revision we will add a dedicated 'Reproducibility' subsection stating: (i) seeds {42, 43, 44}, (ii) paired t-test on per-run perplexities (p < 0.01), (iii) no data exclusion beyond standard tokenization, and (iv) all models share identical hyperparameters and training schedule except for the architectural modifications. This will clarify that the observed gain is attributable to the learned content gate. revision: yes
Circularity Check
No significant circularity; derivation self-contained with empirical results independent of theorems
full rationale
The paper justifies the key-axis erase choice via six formal theorems stated as proven within the manuscript (abstract: 'as we prove -- mathematically prevents the WY-form triangular chunk solver' and 'This is provably necessary and sufficient'). These are internal to the current work rather than self-citations from prior papers. The model is initialized bit-identical to GDN-2, with all reported gains (WikiText 15.72, 4.5-sigma effect, benchmark leads) attributed to the learned content gate rather than any equation or parameter that reduces to prior fitted quantities by construction. No fitted-input-called-prediction, self-definitional loop, or load-bearing self-citation chain exists. The performance claims rest on external training runs and benchmarks, making the derivation self-contained against the specified circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Erase only on the key axis is necessary and sufficient for the WY-form triangular chunk solver to remain valid.
Reference graph
Works this paper leans on
-
[1]
Just read twice: Closing the recall gap for recurrent language models
Simran Arora, Aman Timalsina, Anirudh Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashwin Rao, Atri Rudra, and Christopher Ré. Just read twice: Closing the recall gap for recurrent language models. InICML Workshop on Efficient Systems for Foundation Models, 2024
2024
-
[2]
PIQA: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical commonsense in natural language. InAAAI Conference on Artificial Intelligence, 2020
2020
-
[3]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019. 19
2019
-
[4]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024
2024
-
[6]
Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality
Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML), 2024
2024
-
[7]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[8]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Léonard Kadri, Robert Kundu, David Muraru, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2019
2019
-
[10]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations (ICLR), 2022
2022
-
[12]
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Ali Hatamizadeh, Yucheng Choi, and Jan Kautz. Gated DeltaNet-2: Decoupling erase and write in linear attention.arXiv preprint arXiv:2605.22791, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Weld, and Luke Zettlemoyer
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2017
2017
-
[15]
Transformers are RNNs: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020
2020
-
[16]
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019
2019
-
[18]
Mamba-3: Improved sequence modeling using state space principles
Adi Lahoti et al. Mamba-3: Improved sequence modeling using state space principles. In International Conference on Learning Representations (ICLR), 2026
2026
-
[19]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-Mamba language model.arXiv preprint arXiv:2403.19887, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Linderman
Bo Liu, Hamid Ramsundar, Xinlei Zhu, and Scott W. Linderman. Longhorn: State space models are amortized online learners. InInternational Conference on Learning Representations (ICLR), 2025. 20
2025
-
[21]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[22]
Can a suit of armor conduct electricity? A new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InEmpirical Methods in Natural Language Processing (EMNLP), 2018
2018
-
[23]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016
2016
-
[24]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobei- dli, Alessandro Cappelli, Baptiste Pannier, Erika Björn, Noam Shazeer, Julien Launay, et al. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , et al. RWKV: Reinventing RNNs for the transformer era.arXiv preprint arXiv:2305.13048, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
SQuAD: 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InEmpirical Methods in Natural Language Processing (EMNLP), 2016
2016
-
[27]
Hopfield networks is all you need
Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´c, Geir Kjetil Sandve, et al. Hopfield networks is all you need. InInternational Conference on Learning Representations (ICLR), 2021
2021
-
[28]
Samba: Simple hybrid state space models for efficient unlimited context language modeling
Liliang Ren, Shuohang Guo, Rui Zhao, Yilong Liu, Xinyun Lin, Liheng Hou, and Jianda Li. Samba: Simple hybrid state space models for efficient unlimited context language modeling. In International Conference on Learning Representations (ICLR), 2025
2025
-
[29]
WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
2021
-
[30]
Social IQa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InEmpirical Methods in Natural Language Processing (EMNLP), 2019
2019
-
[31]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational Conference on Machine Learning (ICML), 2021
2021
-
[32]
Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992
Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992
1992
-
[33]
A storage-efficient WY representation for products of Householder transformations.SIAM Journal on Scientific and Statistical Computing, 10(1): 53–57, 1989
Robert Schreiber and Charles Van Loan. A storage-efficient WY representation for products of Householder transformations.SIAM Journal on Scientific and Statistical Computing, 10(1): 53–57, 1989
1989
-
[34]
Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012
Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012
2012
-
[35]
Learning to (Learn at Test Time): RNNs with expressive hidden states
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Jian Wang, Sanmi Koyejo, Tengyu Ma, and Christopher Ré. Learning to (Learn at Test Time): RNNs with expressive hidden states. InInternational Conference on Machine Learning (ICML), 2025
2025
-
[36]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jian Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023. 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2017
2017
-
[38]
Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. InIRE WESCON Convention Record, pp. 96–104, 1960
1960
-
[39]
Parallelizing linear transformers with the delta rule over sequence length
Songlin Yang, Bailin Wang, Yu Shen, Rameswar Panda, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[40]
Gated linear attention transformers with hardware-efficient training
Songlin Yang, Bailin Wang, Yu Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning (ICML), 2024
2024
-
[41]
Gated delta networks: Improving Mamba2 with delta rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving Mamba2 with delta rule. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[42]
CA” = content-aware gating. “SKA
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics (ACL), 2019. 22 APPENDIXOVERVIEW The appendices provide supplementary material in five parts.Appendix Acontains complete proofs for all theoretical results stated in...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.