Recognition: 2 theorem links
· Lean TheoremRWKV: Reinventing RNNs for the Transformer Era
Pith reviewed 2026-05-13 10:48 UTC · model grok-4.3
The pith
RWKV architecture achieves Transformer-level performance at 14 billion parameters while scaling linearly for inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RWKV reformulates attention as a linear time-decay operation over receptance-weighted key-value pairs, allowing the identical set of weights to be computed either as a parallel Transformer during training or as a recurrent model during inference that maintains O(1) memory and compute per token regardless of sequence length. Models trained this way reach performance parity with Transformers when scaled to 14 billion parameters.
What carries the argument
Receptance Weighted Key Value (RWKV) linear attention, which computes each output token as an exponentially decayed weighted sum of all prior key-value pairs using a time-difference decay factor, enabling exact equivalence between the recurrent and parallel formulations.
If this is right
- Training remains fully parallelizable while inference cost stays constant per token, removing the need to trade one for the other.
- Memory usage during inference does not grow with sequence length, enabling arbitrarily long contexts at fixed hardware cost.
- The same trained weights support both batched training and single-stream deployment without architectural changes.
- Scaling behavior observed in Transformers appears to transfer to this linear-attention RNN form at least up to 14 billion parameters.
Where Pith is reading between the lines
- If the linear formulation generalizes, hardware optimized for recurrent computation could be reused for large language models without accuracy loss.
- The architecture may simplify serving of long-context models on memory-constrained devices by eliminating quadratic KV caches.
- Future work could test whether the same linear mechanism extends cleanly to non-text modalities while preserving the training-inference duality.
Load-bearing premise
The linear attention mechanism captures the same long-range dependencies as full quadratic attention without needing extra adjustments or task-specific changes.
What would settle it
A side-by-side evaluation of a 14-billion-parameter RWKV model and a Transformer of identical size showing a clear performance gap on standard language-modeling benchmarks would falsify the parity claim.
read the original abstract
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RWKV, a hybrid architecture that reformulates RNNs via a linear attention mechanism (Receptance Weighted Key Value) combining time-mixing with exponential decay and channel-mixing. This allows parallelizable training like Transformers while maintaining constant memory and compute during inference like RNNs. The central claim is that models scale to 14B parameters—the largest dense RNN trained—and achieve performance on par with similarly sized Transformers on NLP tasks.
Significance. If the empirical parity holds under matched conditions, the result would be significant: it offers a path to linear scaling in sequence length without sacrificing the modeling capacity of quadratic attention, potentially enabling more efficient large-scale language models and reducing the inference-memory trade-off that currently favors Transformers.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the headline claim that RWKV 'performs on par' with 14B-scale Transformers supplies no quantitative metrics, benchmark tables, or ablation details. Without these, it is impossible to verify whether identical data, optimizer, context lengths, or training steps were used, leaving open the possibility that parity depends on unstated post-hoc adjustments rather than the architecture itself.
- [§3.2] §3.2 (RWKV formulation): the receptance-weighted KV time-mixing with exponential decay imposes a fixed decay bias on long-range interactions. The paper must demonstrate—via controlled long-context benchmarks or comparison to full attention—that this bias does not degrade modeling capacity relative to quadratic attention at 14B scale; otherwise the generality argument is at risk.
minor comments (2)
- [§3] Notation for the linear attention recurrence should be clarified with explicit equations showing how the parallel training form reduces to the RNN inference form without additional approximations.
- [Figures] Figure captions and axis labels in the scaling plots should include exact model sizes, token counts, and baseline Transformer variants for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment point-by-point below and will revise the manuscript to incorporate additional details and clarifications as outlined.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim that RWKV 'performs on par' with 14B-scale Transformers supplies no quantitative metrics, benchmark tables, or ablation details. Without these, it is impossible to verify whether identical data, optimizer, context lengths, or training steps were used, leaving open the possibility that parity depends on unstated post-hoc adjustments rather than the architecture itself.
Authors: We agree that the current presentation of results in the abstract and §4 would benefit from greater explicitness. In the revised manuscript we will expand both sections with benchmark tables reporting perplexity and zero-shot accuracy for RWKV models up to 14B parameters against matched Transformer baselines (e.g., comparable GPT-style models), together with precise statements of training data, optimizer settings, context length, and total steps. These additions will make the parity claim directly verifiable from the text. revision: yes
-
Referee: [§3.2] §3.2 (RWKV formulation): the receptance-weighted KV time-mixing with exponential decay imposes a fixed decay bias on long-range interactions. The paper must demonstrate—via controlled long-context benchmarks or comparison to full attention—that this bias does not degrade modeling capacity relative to quadratic attention at 14B scale; otherwise the generality argument is at risk.
Authors: The per-channel decay rates in RWKV are learned rather than fixed, which in principle allows the model to retain long-range information when beneficial. Our scaling curves up to 14B already show continued gains on tasks that require long dependencies, but we accept that explicit controlled comparisons would strengthen the claim. We will add a dedicated subsection with long-context evaluations (e.g., extended-sequence perplexity and retrieval tasks) and direct head-to-head results against full-attention models at the largest feasible scale to demonstrate that modeling capacity is not materially degraded. revision: yes
Circularity Check
RWKV architecture derivation is self-contained with no circular reductions
full rationale
The paper introduces a new linear-attention formulation (receptance-weighted KV with time-mixing and channel-mixing blocks) and reports empirical scaling results up to 14B parameters. No equation in the provided text defines a quantity in terms of itself, renames a fitted parameter as a prediction, or imports a uniqueness theorem from prior self-citations that would force the reported parity. The central claim is an empirical outcome of training the proposed architecture, not a reduction to its own inputs by construction. External benchmarks and training details are presented as independent evidence rather than tautological restatements.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Receptance Weighted Key Value (RWKV) mechanism
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 32 Pith papers
-
When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models
Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
-
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
Winner-Take-All Spiking Transformer for Language Modeling
Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
-
RT-Transformer: The Transformer Block as a Spherical State Estimator
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
-
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
-
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...
-
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
-
Predicting Where Steering Vectors Succeed
The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
-
Attention to Mamba: A Recipe for Cross-Architecture Distillation
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
-
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.
-
Adaptive Spiking Neurons for Vision and Language Modeling
ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
-
Belief-State RWKV for Reinforcement Learning under Partial Observability
Belief-state RWKV maintains an uncertainty-aware recurrent state for RL policies in partial observability and shows modest gains over standard recurrent baselines in a pilot with observation noise.
-
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
Longformer: The long-document transformer. arXiv:2004.05150. Stella Biderman, Kieran Bicheno, and Leo Gao
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
A framework for few-shot language model evaluation
Datasheet for the pile. arXiv preprint arXiv:2201.07311. Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raf. 2023a. Emer- gent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158. Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley...
-
[3]
Gpt-neox-20b: An open-source autoregres- sive language model. In Proceedings of BigScience Episode\# 5–Workshop on Challenges & Perspec- tives in Creating Large Language Models, pages 95– 136. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural net- works. In ICLR. Tom Brown, Benjamin Mann, Nick Ryder, Melanie S...
work page 2017
-
[4]
Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev
Scaling transformer to 1m tokens and beyond with rmt. Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev
-
[5]
PaLM: Scaling Language Modeling with Pathways
Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079– 11091. Sahil Chaudhary. 2023. Code alpaca: An instruction- following llama model for code generation. https: //github.com/sahil280114/codealpaca. Joseph Cheung. 2023. Guanacodataset. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Ga...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Goemotions: A dataset of fine-grained emo- tions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pages 4040–4054. Association for Computational Linguistics. Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Ho- race He, Anish ...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[7]
Scaling Laws for Autoregressive Generative Modeling
Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701. Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem so- lutions. International Journal of Uncertainty, Fuzzi- ness and Knowledge-Based Systems, 6(02):107–116. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-ter...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
Data mining and knowledge discovery , 33(4):917–963
Deep learning for time series classification: a review. Data mining and knowledge discovery , 33(4):917–963. Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative atten- tion. In International conference on machine learn- ing, pages 4651–4664. PMLR. Hanhwi Jang, Joon...
-
[9]
Reformer: The Efficient Transformer
Reformer: The efficient transformer. ArXiv, abs/2001.04451. Jan Koco´n, Igor Cichecki, Oliwier Kaszyca, Mateusz Kochanek, Dominika Szydło, Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz, Anna Koco ´n, Bartłomiej Koptyra, Wik- toria Mieleszczenko-Kowszewicz, Piotr Miłkowski, Marcin Oleksy, Maciej Piasecki, Łukasz Radli ´nsk...
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[10]
What language model to train if you have one million gpu hours? In Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models. Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly paral- lelizable recurrence. In Proceedings of the 2018 Con- ference on Empirical ...
work page 2018
-
[11]
Parallelizing Linear Recurrent Neural Nets Over Sequence Length
Pay attention to mlps. Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. 2021. Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441– 2453. Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. 2023. Mega: Mo...
work page Pith review arXiv 2021
-
[12]
Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan
Scaling data-constrained language models. arXiv preprint arXiv:2305.16264. OpenAI. 2022. Introducing chatgpt. https://openai. com/blog/chatgpt. Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. 2023. Resurrecting recurrent neu- ral networks for long sequences. arXiv preprint arXiv:2303.06349. Deni...
-
[13]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Six attributes of unhealthy conversations. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 114–124, Online. Association for Computational Linguistics. Markus N. Rabe and Charles Staats. 2022. Self- attention does not need o(n2) memory. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System opti- mi...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Mlp-mixer: An all-mlp architecture for vi- sion. CoRR, abs/2105.01601. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language mo...
-
[15]
ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
Transformers: State-of-the-Art Natural Lan- guage Processing. pages 38–45. Association for Computational Linguistics. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A com- prehensive survey on graph neural networks. IEEE transactions on neural networks and learning sys- tems, 32(1):4–24. Ellery Wulczyn, Nithum Th...
work page Pith review arXiv 2020
-
[16]
share similarities with RWKV’s chunked computation scheme. Despite being memory-efficient, their time complexity remains quadratic or contains chunk size as a hidden factor. In contrast, RWKV achieves better space and time complexity during inference by formulating a linear attention as an RNN. Attention Free Models Another line of research replaces the a...
work page 2021
-
[17]
uses both convolutional layers and recurrent pooling functions across timesteps and channels. While QRNN utilizes convolutional filters with fixed sizes, RWKV employs a time-mixing module as an attention mechanism with time-decaying factors. Different from the element-wise pooling in QRNN, RWKV includes a parametrized channel-mixing module that is paralle...
work page 2022
-
[18]
Run the model once, and record all states and activation of each layer during the computation
-
[19]
Corrupt the input embeddings of the subject using noise (“The Eiffel Tower” in this example)
-
[20]
Restore the states and activation of a certain layer at a certain token during the computation, and record the log-probability of the model outputting the correct answer (“Paris”). Unlike transformers, RWKV relies on the recursive propagation of information in the time dimension. In this case, the fact that the Eiffel Tower is located in Paris is retrieve...
work page 2018
-
[21]
RWKV-4-Raven-14B I would like to cook some Chinese food at home
We can find the sum of this series using the formula for a geometric series: ∞X i=0 ai = a0 1 − r = 1 2 1 − 1 2 = 1 1 − 1 2 = 2 1 = 2 So, the sum of the given series is 2 . RWKV-4-Raven-14B I would like to cook some Chinese food at home. Do you have any suggestions on the meal I could choose? Yes, of course! If you are not very experienced in cooking Chin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.