arxiv: 2305.13048 · v2 · submitted 2023-05-22 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng , Eric Alcaide , Quentin Anthony , Alon Albalak , Samuel Arcadinho , Stella Biderman , Huanqi Cao , Xin Cheng

show 26 more authors

Michael Chung Matteo Grella Kranthi Kiran GV Xuzheng He Haowen Hou Jiaju Lin Przemyslaw Kazienko Jan Kocon Jiaming Kong Bartlomiej Koptyra Hayden Lau Krishna Sri Ipsit Mantri Ferdinand Mom Atsushi Saito Guangyu Song Xiangru Tang Bolun Wang Johan S. Wind Stanislaw Wozniak Ruichong Zhang Zhenyuan Zhang Qihang Zhao Peng Zhou Qinghua Zhou Jian Zhu Rui-Jie Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords RWKVlinear attentionRNNTransformerefficient inferencelanguage modelingscaling lawslong context

0 comments

The pith

RWKV architecture achieves Transformer-level performance at 14 billion parameters while scaling linearly for inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes RWKV as a model that trains in parallel like a Transformer yet runs with constant memory and compute like an RNN. It achieves this by replacing quadratic attention with a linear formulation based on receptance-weighted key-value pairs that can be expressed in either recurrent or parallel form. When scaled to 14 billion parameters, the largest dense RNN reported, RWKV matches the performance of comparably sized Transformers on language tasks. This matters because it removes the quadratic memory barrier that limits long-sequence use and deployment of large models. If the parity holds, sequence models could be trained at scale yet deployed with far lower inference cost.

Core claim

RWKV reformulates attention as a linear time-decay operation over receptance-weighted key-value pairs, allowing the identical set of weights to be computed either as a parallel Transformer during training or as a recurrent model during inference that maintains O(1) memory and compute per token regardless of sequence length. Models trained this way reach performance parity with Transformers when scaled to 14 billion parameters.

What carries the argument

Receptance Weighted Key Value (RWKV) linear attention, which computes each output token as an exponentially decayed weighted sum of all prior key-value pairs using a time-difference decay factor, enabling exact equivalence between the recurrent and parallel formulations.

If this is right

Training remains fully parallelizable while inference cost stays constant per token, removing the need to trade one for the other.
Memory usage during inference does not grow with sequence length, enabling arbitrarily long contexts at fixed hardware cost.
The same trained weights support both batched training and single-stream deployment without architectural changes.
Scaling behavior observed in Transformers appears to transfer to this linear-attention RNN form at least up to 14 billion parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the linear formulation generalizes, hardware optimized for recurrent computation could be reused for large language models without accuracy loss.
The architecture may simplify serving of long-context models on memory-constrained devices by eliminating quadratic KV caches.
Future work could test whether the same linear mechanism extends cleanly to non-text modalities while preserving the training-inference duality.

Load-bearing premise

The linear attention mechanism captures the same long-range dependencies as full quadratic attention without needing extra adjustments or task-specific changes.

What would settle it

A side-by-side evaluation of a 14-billion-parameter RWKV model and a Transformer of identical size showing a clear performance gap on standard language-modeling benchmarks would falsify the parity claim.

read the original abstract

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RWKV gives a clean dual formulation that trains like a Transformer but infers like an RNN, and they actually trained it to 14B with reported parity, though the comparison details are still thin.

read the letter

The main point is that RWKV uses a receptance-weighted key-value mechanism to let the same weights run either in parallel during training or as a recurrent state at inference, with exact equivalence rather than approximation. They scale this to 14 billion parameters, which is the largest dense RNN-style model reported, and claim it matches similarly sized Transformers on standard tasks. That combination of scale and the clean math for the two views is the real contribution here. The paper walks through the time-mixing and channel-mixing blocks in both formulations, which makes the idea straightforward to implement and check. Releasing code and checkpoints also helps anyone who wants to test it directly. The soft spot is the empirical side. The abstract states parity at 14B but gives no tables, no exact benchmark numbers, and no confirmation that the Transformer baselines used identical data, context lengths, or training budgets. Without those, it is hard to tell whether the linear recurrence fully preserves long-range modeling or whether small adjustments were needed to reach the reported level. The exponential decay in the time-mixing step could introduce a bias that only shows up on certain long-context evaluations, and the paper needs to address that head-on with ablations. This work is aimed at people building or deploying large sequence models who care about inference cost and context length. Anyone already experimenting with linear attention or efficient RNN variants will find the formulation useful to compare against. It deserves a serious referee because the architectural move is distinct from prior linear-attention papers and the scale is high enough that confirming the claims would matter for practical work. Send it to review; the core idea stands even if the experiments need more transparent reporting.

Referee Report

2 major / 2 minor

Summary. The paper proposes RWKV, a hybrid architecture that reformulates RNNs via a linear attention mechanism (Receptance Weighted Key Value) combining time-mixing with exponential decay and channel-mixing. This allows parallelizable training like Transformers while maintaining constant memory and compute during inference like RNNs. The central claim is that models scale to 14B parameters—the largest dense RNN trained—and achieve performance on par with similarly sized Transformers on NLP tasks.

Significance. If the empirical parity holds under matched conditions, the result would be significant: it offers a path to linear scaling in sequence length without sacrificing the modeling capacity of quadratic attention, potentially enabling more efficient large-scale language models and reducing the inference-memory trade-off that currently favors Transformers.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the headline claim that RWKV 'performs on par' with 14B-scale Transformers supplies no quantitative metrics, benchmark tables, or ablation details. Without these, it is impossible to verify whether identical data, optimizer, context lengths, or training steps were used, leaving open the possibility that parity depends on unstated post-hoc adjustments rather than the architecture itself.
[§3.2] §3.2 (RWKV formulation): the receptance-weighted KV time-mixing with exponential decay imposes a fixed decay bias on long-range interactions. The paper must demonstrate—via controlled long-context benchmarks or comparison to full attention—that this bias does not degrade modeling capacity relative to quadratic attention at 14B scale; otherwise the generality argument is at risk.

minor comments (2)

[§3] Notation for the linear attention recurrence should be clarified with explicit equations showing how the parallel training form reduces to the RNN inference form without additional approximations.
[Figures] Figure captions and axis labels in the scaling plots should include exact model sizes, token counts, and baseline Transformer variants for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment point-by-point below and will revise the manuscript to incorporate additional details and clarifications as outlined.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim that RWKV 'performs on par' with 14B-scale Transformers supplies no quantitative metrics, benchmark tables, or ablation details. Without these, it is impossible to verify whether identical data, optimizer, context lengths, or training steps were used, leaving open the possibility that parity depends on unstated post-hoc adjustments rather than the architecture itself.

Authors: We agree that the current presentation of results in the abstract and §4 would benefit from greater explicitness. In the revised manuscript we will expand both sections with benchmark tables reporting perplexity and zero-shot accuracy for RWKV models up to 14B parameters against matched Transformer baselines (e.g., comparable GPT-style models), together with precise statements of training data, optimizer settings, context length, and total steps. These additions will make the parity claim directly verifiable from the text. revision: yes
Referee: [§3.2] §3.2 (RWKV formulation): the receptance-weighted KV time-mixing with exponential decay imposes a fixed decay bias on long-range interactions. The paper must demonstrate—via controlled long-context benchmarks or comparison to full attention—that this bias does not degrade modeling capacity relative to quadratic attention at 14B scale; otherwise the generality argument is at risk.

Authors: The per-channel decay rates in RWKV are learned rather than fixed, which in principle allows the model to retain long-range information when beneficial. Our scaling curves up to 14B already show continued gains on tasks that require long dependencies, but we accept that explicit controlled comparisons would strengthen the claim. We will add a dedicated subsection with long-context evaluations (e.g., extended-sequence perplexity and retrieval tasks) and direct head-to-head results against full-attention models at the largest feasible scale to demonstrate that modeling capacity is not materially degraded. revision: yes

Circularity Check

0 steps flagged

RWKV architecture derivation is self-contained with no circular reductions

full rationale

The paper introduces a new linear-attention formulation (receptance-weighted KV with time-mixing and channel-mixing blocks) and reports empirical scaling results up to 14B parameters. No equation in the provided text defines a quantity in terms of itself, renames a fitted parameter as a prediction, or imports a uniqueness theorem from prior self-citations that would force the reported parity. The central claim is an empirical outcome of training the proposed architecture, not a reduction to its own inputs by construction. External benchmarks and training details are presented as independent evidence rather than tautological restatements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the empirical success of a newly introduced linear-attention mechanism whose internal weighting rules are defined by the paper; no explicit free parameters or external axioms are stated in the abstract.

invented entities (1)

Receptance Weighted Key Value (RWKV) mechanism no independent evidence
purpose: Linear attention operator enabling dual Transformer/RNN formulation
The mechanism is introduced by the paper to achieve the claimed efficiency trade-off.

pith-pipeline@v0.9.0 · 5616 in / 1047 out tokens · 32977 ms · 2026-05-13T10:48:35.134302+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models
cs.LG 2026-03 conditional novelty 8.0

Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
cs.LG 2024-07 conditional novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
cond-mat.str-el 2026-05 conditional novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
cs.LG 2026-05 conditional novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
cs.LG 2026-04 unverdicted novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
Winner-Take-All Spiking Transformer for Language Modeling
cs.NE 2026-04 unverdicted novelty 7.0

Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
cs.LG 2026-05 unverdicted novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
RT-Transformer: The Transformer Block as a Spherical State Estimator
cs.LG 2026-05 unverdicted novelty 6.0

Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
cs.LG 2026-04 unverdicted novelty 6.0

Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
Predicting Where Steering Vectors Succeed
cs.LG 2026-04 unverdicted novelty 6.0

The Linear Accessibility Profile predicts steering vector effectiveness and optimal layers with Spearman correlations of 0.86-0.91 using unembedding projections on intermediate states across multiple models and concepts.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
cs.CL 2026-04 conditional novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Attention to Mamba: A Recipe for Cross-Architecture Distillation
cs.CL 2026-04 unverdicted novelty 6.0

A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
cs.CL 2025-07 unverdicted novelty 6.0

MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
cs.LG 2026-04 unverdicted novelty 5.0

Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
cs.AI 2026-04 unverdicted novelty 5.0

System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.
Adaptive Spiking Neurons for Vision and Language Modeling
cs.NE 2026-04 unverdicted novelty 5.0

ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
Belief-State RWKV for Reinforcement Learning under Partial Observability
cs.LG 2026-04 unverdicted novelty 5.0

Belief-state RWKV maintains an uncertainty-aware recurrent state for RL policies in partial observability and shows modest gains over standard recurrent baselines in a pilot with observation noise.
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
cs.DC 2026-03 unverdicted novelty 5.0

Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 32 Pith papers · 6 internal anchors

[1]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer. arXiv:2004.05150. Stella Biderman, Kieran Bicheno, and Leo Gao

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

A framework for few-shot language model evaluation

Datasheet for the pile. arXiv preprint arXiv:2201.07311. Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raf. 2023a. Emer- gent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158. Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley...

work page doi:10.5281/zenodo 2020
[3]

In Proceedings of BigScience Episode\# 5–Workshop on Challenges & Perspec- tives in Creating Large Language Models, pages 95– 136

Gpt-neox-20b: An open-source autoregres- sive language model. In Proceedings of BigScience Episode\# 5–Workshop on Challenges & Perspec- tives in Creating Large Language Models, pages 95– 136. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural net- works. In ICLR. Tom Brown, Benjamin Mann, Nick Ryder, Melanie S...

work page 2017
[4]

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev

Scaling transformer to 1m tokens and beyond with rmt. Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev

work page
[5]

PaLM: Scaling Language Modeling with Pathways

Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079– 11091. Sahil Chaudhary. 2023. Code alpaca: An instruction- following llama model for code generation. https: //github.com/sahil280114/codealpaca. Joseph Cheung. 2023. Guanacodataset. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Ga...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Goemotions: A dataset of fine-grained emo- tions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pages 4040–4054. Association for Computational Linguistics. Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Ho- race He, Anish ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

Scaling Laws for Autoregressive Generative Modeling

Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701. Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem so- lutions. International Journal of Uncertainty, Fuzzi- ness and Knowledge-Based Systems, 6(02):107–116. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-ter...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

Data mining and knowledge discovery , 33(4):917–963

Deep learning for time series classification: a review. Data mining and knowledge discovery , 33(4):917–963. Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative atten- tion. In International conference on machine learn- ing, pages 4651–4664. PMLR. Hanhwi Jang, Joon...

work page doi:10.18653/v1/w17-4413 2021
[9]

Reformer: The Efficient Transformer

Reformer: The efficient transformer. ArXiv, abs/2001.04451. Jan Koco´n, Igor Cichecki, Oliwier Kaszyca, Mateusz Kochanek, Dominika Szydło, Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz, Anna Koco ´n, Bartłomiej Koptyra, Wik- toria Mieleszczenko-Kowszewicz, Piotr Miłkowski, Marcin Oleksy, Maciej Piasecki, Łukasz Radli ´nsk...

work page internal anchor Pith review Pith/arXiv arXiv 2001
[10]

Tao Lei, Yu Zhang, Sida I

What language model to train if you have one million gpu hours? In Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models. Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly paral- lelizable recurrence. In Proceedings of the 2018 Con- ference on Empirical ...

work page 2018
[11]

Parallelizing Linear Recurrent Neural Nets Over Sequence Length

Pay attention to mlps. Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. 2021. Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441– 2453. Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. 2023. Mega: Mo...

work page Pith review arXiv 2021
[12]

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

Scaling data-constrained language models. arXiv preprint arXiv:2305.16264. OpenAI. 2022. Introducing chatgpt. https://openai. com/blog/chatgpt. Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. 2023. Resurrecting recurrent neu- ral networks for long sequences. arXiv preprint arXiv:2303.06349. Deni...

work page arXiv 2022
[13]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Six attributes of unhealthy conversations. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 114–124, Online. Association for Computational Linguistics. Markus N. Rabe and Charles Staats. 2022. Self- attention does not need o(n2) memory. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System opti- mi...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

CoRR, abs/2105.01601

Mlp-mixer: An all-mlp architecture for vi- sion. CoRR, abs/2105.01601. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language mo...

work page arXiv 2023
[15]

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

Transformers: State-of-the-Art Natural Lan- guage Processing. pages 38–45. Association for Computational Linguistics. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A com- prehensive survey on graph neural networks. IEEE transactions on neural networks and learning sys- tems, 32(1):4–24. Ellery Wulczyn, Nithum Th...

work page Pith review arXiv 2020
[16]

Despite being memory-efficient, their time complexity remains quadratic or contains chunk size as a hidden factor

share similarities with RWKV’s chunked computation scheme. Despite being memory-efficient, their time complexity remains quadratic or contains chunk size as a hidden factor. In contrast, RWKV achieves better space and time complexity during inference by formulating a linear attention as an RNN. Attention Free Models Another line of research replaces the a...

work page 2021
[17]

time decay

uses both convolutional layers and recurrent pooling functions across timesteps and channels. While QRNN utilizes convolutional filters with fixed sizes, RWKV employs a time-mixing module as an attention mechanism with time-decaying factors. Different from the element-wise pooling in QRNN, RWKV includes a parametrized channel-mixing module that is paralle...

work page 2022
[18]

Run the model once, and record all states and activation of each layer during the computation

work page
[19]

The Eiffel Tower

Corrupt the input embeddings of the subject using noise (“The Eiffel Tower” in this example)

work page
[20]

The Eiffel

Restore the states and activation of a certain layer at a certain token during the computation, and record the log-probability of the model outputting the correct answer (“Paris”). Unlike transformers, RWKV relies on the recursive propagation of information in the time dimension. In this case, the fact that the Eiffel Tower is located in Paris is retrieve...

work page 2018
[21]

RWKV-4-Raven-14B I would like to cook some Chinese food at home

We can find the sum of this series using the formula for a geometric series: ∞X i=0 ai = a0 1 − r = 1 2 1 − 1 2 = 1 1 − 1 2 = 2 1 = 2 So, the sum of the given series is 2 . RWKV-4-Raven-14B I would like to cook some Chinese food at home. Do you have any suggestions on the meal I could choose? Yes, of course! If you are not very experienced in cooking Chin...

work page