arxiv: 2412.06464 · v3 · submitted 2024-12-09 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Gated Delta Networks: Improving Mamba2 with Delta Rule

Ali Hatamizadeh, Jan Kautz, Songlin Yang

Pith reviewed 2026-05-13 14:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords Gated DeltaNetdelta ruleMamba2linear transformersmemory controllong-context understandingsequence modelinggating

0 comments

The pith

Gated DeltaNet merges memory erasure gating with the delta update rule to outperform Mamba2 and DeltaNet on language and long-context tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that gating for rapid memory erasure and the delta rule for targeted updates are complementary in linear sequence models. By fusing them into a single gated delta rule supported by a parallel training algorithm, the authors create Gated DeltaNet, which delivers consistent gains over Mamba2 and DeltaNet. A sympathetic reader would care because the approach addresses retrieval and long-context weaknesses that have limited linear transformers while preserving their efficiency advantages. The work also shows that hybrid stacks mixing Gated DeltaNet layers with attention or Mamba2 layers further boost both speed and accuracy.

Core claim

By observing that gating enables rapid memory erasure while the delta rule facilitates targeted updates, the paper introduces the gated delta rule and develops a parallel training algorithm optimized for modern hardware. The resulting Gated DeltaNet architecture consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. Hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers achieve both improved training efficiency and superior task performance.

What carries the argument

The gated delta rule, which integrates a gating mechanism for adaptive memory control with the delta update rule for precise memory modifications to enable both rapid erasure and targeted updates.

Load-bearing premise

That the gating mechanism for rapid memory erasure and the delta rule for targeted updates combine into a single rule that yields consistent gains without instability or hidden trade-offs.

What would settle it

A direct comparison showing Gated DeltaNet underperforming Mamba2 on long-context understanding or exhibiting training instability on standard language modeling benchmarks would falsify the claim.

read the original abstract

Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

Gated DeltaNet fuses gating with the delta rule plus a parallel trainer and reports gains over Mamba2 on retrieval and long-context tasks, but the parallel algorithm's fidelity to the intended sequential updates is the part that needs direct checking. The new element is the specific combination: gating for quick erasure paired with delta-style targeted updates, turned into one rule that admits an efficient parallel implementation. That synthesis is not in the earlier Mamba2 or DeltaNet papers, and the authors supply a hardware-oriented training algorithm to make it practical. The empirical side shows consistent outperformance across language modeling, reasoning, in-context retrieval, length extrapolation, and long-context benchmarks, with extra lift from hybrid layers that mix in sliding-window attention or Mamba2 blocks. Those hybrids are a pragmatic addition that keeps training costs reasonable while improving results. The soft spot sits in the parallel algorithm. The central claim assumes the parallel version produces the same memory dynamics as the sequential gated delta rule. If the implementation uses chunking or associative reformulation that changes the order of erasure and update steps, behavior on long sequences could diverge from what the mechanism is supposed to do. The abstract does not include a small-scale equivalence check or derivation, so that verification belongs in the full paper. Experimental details also matter. The reported wins look promising, but without seeing the exact baselines, hyperparameter matching, ablation tables, and run-to-run variance, it is hard to judge how robust the margins are. This work is aimed at researchers building efficient sequence models who already follow linear attention and state-space lines. Anyone tuning Mamba variants or looking for drop-in improvements for long-context retrieval will find the concrete numbers and hybrid recipes useful. It deserves a serious referee because the idea is testable, the benchmarks are standard, and the practical stakes for scaling context length are real. Send it to review, with requests for the parallel-sequential equivalence test and fuller experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes the gated delta rule, which combines a gating mechanism for rapid memory erasure with the delta rule for targeted memory updates. It develops a hardware-optimized parallel training algorithm for this rule and introduces the Gated DeltaNet architecture, claiming consistent outperformance over Mamba2 and DeltaNet on language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context tasks. Hybrid models combining Gated DeltaNet layers with sliding-window attention or Mamba2 layers are also presented for further efficiency and performance gains.

Significance. If the parallel algorithm faithfully reproduces the sequential gated delta dynamics and the reported gains prove robust to hyperparameter controls and statistical testing, the work would meaningfully advance linear sequence models by addressing retrieval and long-context limitations. The complementarity insight and hybrid design offer practical value for efficient training on modern hardware.

major comments (2)

[§3] §3 (Gated Delta Rule and Parallel Algorithm): No derivation or equivalence proof is supplied showing that the parallel training algorithm exactly replicates the sequential semantics of combined gating and delta updates. This is load-bearing for the length-extrapolation and long-context claims, because chunking or associative reformulation could reorder erasure/update interactions and produce divergent behavior on long sequences.
[§5] §5 (Experiments): The reported consistent outperformance lacks any mention of run counts, statistical significance tests, ablation studies isolating the gated delta rule, or error analysis. Without these controls it is impossible to determine whether gains survive hyperparameter or data-choice variation.

minor comments (2)

Abstract: Specific benchmark names and dataset sizes should be listed to allow immediate assessment of the scope of the claimed improvements.
Notation: The definition of the gated delta update should be presented with an explicit equation number and contrasted with the original delta rule and gating formulations for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the manuscript's rigor and clarity without altering its core contributions.

read point-by-point responses

Referee: [§3] §3 (Gated Delta Rule and Parallel Algorithm): No derivation or equivalence proof is supplied showing that the parallel training algorithm exactly replicates the sequential semantics of combined gating and delta updates. This is load-bearing for the length-extrapolation and long-context claims, because chunking or associative reformulation could reorder erasure/update interactions and produce divergent behavior on long sequences.

Authors: We appreciate the referee's emphasis on this foundational aspect. The parallel algorithm was constructed via an associative scan that preserves the exact sequential order of gating (erasure) and delta (update) operations. However, we acknowledge that an explicit step-by-step derivation and equivalence proof were omitted from the original submission. In the revised manuscript we will add a dedicated subsection in §3 containing the full mathematical derivation, showing that the parallel formulation is mathematically identical to the sequential gated delta rule for any sequence length, including the non-commutative interactions between erasure and update steps. This will directly bolster the length-extrapolation and long-context results. revision: yes
Referee: [§5] §5 (Experiments): The reported consistent outperformance lacks any mention of run counts, statistical significance tests, ablation studies isolating the gated delta rule, or error analysis. Without these controls it is impossible to determine whether gains survive hyperparameter or data-choice variation.

Authors: We agree that the experimental section would benefit from greater statistical rigor. In the revised version we will (i) report results from at least five independent runs with different random seeds, including mean and standard deviation; (ii) include paired statistical significance tests against Mamba2 and DeltaNet baselines; (iii) add ablation experiments that isolate the gated delta rule by ablating the gating mechanism and the delta update separately; and (iv) provide a concise error analysis on representative retrieval and long-context tasks. These additions will be placed in §5 and the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks

full rationale

The paper proposes combining gating (for rapid memory erasure) and the delta rule (for targeted updates) into a gated delta rule, then presents a parallel training algorithm optimized for hardware. The central claims are that Gated DeltaNet surpasses Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks, with further gains from hybrid architectures. No load-bearing derivation reduces a claimed result to a fitted parameter, self-citation chain, or definitional tautology. The complementarity observation and parallel algorithm are presented as engineering insights supported by experiments rather than by construction from the same data. This is a standard empirical architecture paper whose performance claims are externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated assumption that the two mechanisms combine without interference and that the parallel algorithm preserves correctness.

pith-pipeline@v0.9.0 · 5450 in / 1070 out tokens · 45112 ms · 2026-05-13T14:45:36.685384+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing (8-tick period) eight_tick_forces_D3 echoes
We introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware... preserves the benefits of chunkwise parallelism
Foundation.LedgerForcing conservation_from_balance echoes
the gated delta rule... combines both approaches... hardware-efficient chunkwise algorithm

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction
cs.LG 2026-05 unverdicted novelty 7.0

Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
q-bio.NC 2026-05 unverdicted novelty 7.0

SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...
Mixture of Layers with Hybrid Attention
cs.LG 2026-05 unverdicted novelty 7.0

Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the rout...
Transformers with Selective Access to Early Representations
cs.LG 2026-05 unverdicted novelty 7.0

SATFormer uses a learned context-dependent gate for selective access to early-layer value representations in Transformers, improving loss and accuracy over static residual baselines.
Transformers with Selective Access to Early Representations
cs.LG 2026-05 unverdicted novelty 7.0

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
cs.CV 2026-04 unverdicted novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
cs.CL 2026-04 conditional novelty 7.0

S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
A Single-Layer Model Can Do Language Modeling
cs.CL 2026-05 unverdicted novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
cs.LG 2026-05 unverdicted novelty 6.0

Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Learning to Forget: Continual Learning with Adaptive Weight Decay
cs.LG 2026-04 unverdicted novelty 6.0

FADE adapts per-parameter weight decay rates online via approximate meta-gradient descent to improve controlled forgetting over fixed decay in online tracking and streaming classification.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
cs.LG 2026-04 unverdicted novelty 6.0

Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attenti...
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Olmo Hybrid: From Theory to Practice and Back
cs.LG 2026-04 conditional novelty 6.0

A 7B hybrid attention-recurrent model outperforms its pure-transformer counterpart on pretraining metrics and scales more efficiently, supported by a proof that hybrids are strictly more expressive than either transfo...
Beyond Similarity: Temporal Operator Attention for Time Series Analysis
cs.LG 2026-05 unverdicted novelty 5.0

Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on t...
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
cs.DC 2026-05 unverdicted novelty 5.0

Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
Reasoning Primitives in Hybrid and Non-Hybrid LLMs
cs.CL 2026-04 unverdicted novelty 5.0

Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
cs.LG 2026-04 unverdicted novelty 5.0

FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
On The Application of Linear Attention in Multimodal Transformers
cs.CV 2026-04 unverdicted novelty 4.0

Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.

Reference graph

Works this paper leans on

298 extracted references · 298 canonical work pages · cited by 25 Pith papers · 1 internal anchor

[1]

On orthogonality and learning recurrent networks with long term dependencies , url =

Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal , bibsource =. On orthogonality and learning recurrent networks with long term dependencies , url =. Proceedings of the 34th International Conference on Machine Learning,

work page
[2]

Michael Zhang and Simran Arora and Rahul Chalamala and Benjamin Frederick Spector and Alan Wu and Krithik Ramesh and Aaryan Singhal and Christopher Re , booktitle=. Lo. 2025 , url=

work page 2025
[3]

2024 , eprint=

Expansion Span: Combining Fading Memory and Retrieval in Hybrid State Space Models , author=. 2024 , eprint=

work page 2024
[4]

Luca Zancato and Arjun Seshadri and Yonatan Dukler and Aditya Golatkar and Yantao Shen and Benjamin Bowman and Matthew Trager and Alessandro Achille and Stefano Soatto , booktitle=. B'. 2024 , url=

work page 2024
[5]

The Thirteenth International Conference on Learning Representations , year=

Hymba: A Hybrid-head Architecture for Small Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[6]

2024 , eprint=

An Empirical Study of Mamba-based Language Models , author=. 2024 , eprint=

work page 2024
[7]

2025 , eprint=

Implicit Language Models are RNNs: Balancing Parallelization and Expressivity , author=. 2025 , eprint=

work page 2025
[8]

Smith and Scott Linderman , booktitle=

Xavier Gonzalez and Andrew Warrington and Jimmy T.H. Smith and Scott Linderman , booktitle=. Towards Scalable and Stable Parallelization of Nonlinear. 2024 , url=

work page 2024
[9]

2024 , eprint=

Parallelizing non-linear sequential models over the sequence length , author=. 2024 , eprint=

work page 2024
[10]

2025 , eprint=

MiniMax-01: Scaling Foundation Models with Lightning Attention , author=. 2025 , eprint=

work page 2025
[11]

2025 , eprint=

ReGLA: Refining Gated Linear Attention , author=. 2025 , eprint=

work page 2025
[12]

The Thirteenth International Conference on Learning Representations , year=

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[13]

2024 , eprint=

Uncovering mesa-optimization algorithms in Transformers , author=. 2024 , eprint=

work page 2024
[14]

2025 , eprint=

Test-time regression: a unifying framework for designing sequence models with associative memory , author=. 2025 , eprint=

work page 2025
[15]

2025 , eprint=

DeltaProduct: Increasing the Expressivity of DeltaNet Through Products of Householders , author=. 2025 , eprint=

work page 2025
[16]

2024 , eprint=

Titans: Learning to Memorize at Test Time , author=. 2024 , eprint=

work page 2024
[17]

Yuhong Chou and Man Yao and Kexin Wang and Yuqi Pan and Rui-Jie Zhu and Jibin Wu and Yiran Zhong and Yu Qiao and Bo XU and Guoqi Li , booktitle=. Meta. 2024 , url=

work page 2024
[18]

Orthogonal Recurrent Neural Networks with Scaled Cayley Transform , url =

Kyle Helfrich and Devin Willmott and Qiang Ye , bibsource =. Orthogonal Recurrent Neural Networks with Scaled Cayley Transform , url =. Proceedings of the 35th International Conference on Machine Learning,

work page
[19]

Jing, Li and Gulcehre, Caglar and Peurifoy, John and Shen, Yichen and Tegmark, Max and Soljacic, Marin and Bengio, Yoshua , doi =. Gated. Neural Computation , language =

work page
[20]

Tomczak and Max Welling , bibsource =

Rianne van den Berg and Leonard Hasenclever and Jakub M. Tomczak and Max Welling , bibsource =. Sylvester Normalizing Flows for Variational Inference , url =. Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence,

work page
[21]

Linearized

Qin, Zhen and Sun, Weixuan and Lu, Kaiyue and Deng, Hui and Li, Dongxu and Han, Xiaodong and Dai, Yuchao and Kong, Lingpeng and Zhong, Yiran , journal =. Linearized

work page
[22]

Dhillon , bibsource =

Jiong Zhang and Qi Lei and Inderjit S. Dhillon , bibsource =. Stabilizing Gradients for Deep Neural Networks via Efficient. Proceedings of the 35th International Conference on Machine Learning,

work page
[23]

and Welling, Max , journal =

Tomczak, Jakub M. and Welling, Max , journal =. Improving

work page
[24]

Aksenov, Yaroslav and Balagansky, Nikita and Vaina, Sofia Maria Lo Cicero and Shaposhnikov, Boris and Gorbatovski, Alexey and Gavrilov, Daniil , journal =. Linear

work page
[25]

Hopfield Networks is All You Need , url =

Hubert Ramsauer and Bernhard Sch. Hopfield Networks is All You Need , url =. 9th International Conference on Learning Representations,

work page
[26]

Hopfield , bibsource =

Dmitry Krotov and John J. Hopfield , bibsource =. Large Associative Memory Problem in Neurobiology and Machine Learning , url =. 9th International Conference on Learning Representations,

work page
[27]

Neural networks and physical systems with emergent collective computational abilities

Hopfield, J J , doi =. Neural networks and physical systems with emergent collective computational abilities. , url =. Proceedings of the National Academy of Sciences , note =

work page
[28]

On a model of associative memory with huge storage capacity , url =

Demircigil, Mete and Heusel, Judith and Löwe, Matthias and Upgang, Sven and Vermet, Franck , journal =. On a model of associative memory with huge storage capacity , url =

work page
[29]

Sun, Weigao and Qin, Zhen and Li, Dong and Shen, Xuyang and Qiao, Yu and Zhong, Yiran , journal =. Linear

work page
[30]

Nahshan, Yury and Kampeas, Joseph and Haleva, Emir , journal =. Linear

work page
[31]

Millidge, Beren , language =. Linear

work page
[32]

Hellicar and Ashfaqur Rahman and James Bailey , bibsource =

Zakaria Mhammedi and Andrew D. Hellicar and Ashfaqur Rahman and James Bailey , bibsource =. Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections , url =. Proceedings of the 34th International Conference on Machine Learning,

work page
[33]

Paperno, Denis and Kruszewski, Germ. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , doi =

work page
[34]

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute , url =

Lei, Tao , booktitle =. When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute , url =. doi:10.18653/v1/2021.emnlp-main.602 , editor =

work page doi:10.18653/v1/2021.emnlp-main.602 2021
[35]

Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle =. Social. doi:10.18653/v1/D19-1454 , editor =

work page doi:10.18653/v1/d19-1454
[36]

Encoding Recurrence into Transformers , url =

Feiqing Huang and Kexin Lu and Yuxi Cai and Zhen Qin and Yanwen Fang and Guangjian Tian and Guodong Li , bibsource =. Encoding Recurrence into Transformers , url =. The Eleventh International Conference on Learning Representations,

work page
[37]

Pointer Sentinel Mixture Models , url =

Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher , bibsource =. Pointer Sentinel Mixture Models , url =. 5th International Conference on Learning Representations,

work page
[38]

Graves, Alex and Wayne, Greg and Danihelka, Ivo , keywords =. Neural

work page
[39]

Jing, Li and Gulcehre, Caglar and Peurifoy, John and Shen, Yichen and Tegmark, Max and Soljačić, Marin and Bengio, Yoshua , journal =. Gated

work page
[40]

and Schmidhuber, J

Gers, F.A. and Schmidhuber, J. and Cummins, F. , booktitle =. Learning to forget: continual prediction with. doi:10.1049/cp:19991218 , note =

work page doi:10.1049/cp:19991218
[41]

Sigmoid-

Elfwing, Stefan and Uchibe, Eiji and Doya, Kenji , journal =. Sigmoid-

work page
[42]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , url =

Djork. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , url =. 4th International Conference on Learning Representations,

work page
[43]

, copyright =

Blelloch, Guy E. , copyright =. Prefix sums and their applications , url =. doi:10.1184/R1/6608579.V1 , keywords =

work page doi:10.1184/r1/6608579.v1
[44]

Blelloch, Guy E , language =. Preﬁx

work page
[45]

Patrick S. H. Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , editor =

work page 2020
[46]

Botev, Aleksandar and De, Soham and Smith, Samuel L. and Fernando, Anushan and Muraru, George-Cristian and Haroun, Ruba and Berrada, Leonard and Pascanu, Razvan and Sessa, Pier Giuseppe and Dadashi, Robert and Hussenot, Léonard and Ferret, Johan and Girgin, Sertan and Bachem, Olivier and Andreev, Alek and Kenealy, Kathleen and Mesnard, Thomas and Hardin, ...

work page
[47]

Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions , url =

Irie, Kazuki and Csord. Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions , url =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , doi =

work page 2023
[48]

Advancing Regular Language Reasoning in Linear Recurrent Neural Networks , url =

Fan, Ting-Han and Chi, Ta-Chung and Rudnicky, Alexander , booktitle =. Advancing Regular Language Reasoning in Linear Recurrent Neural Networks , url =

work page
[49]

Merrill, William and Petty, Jackson and Sabharwal, Ashish , journal =. The

work page
[50]

Know What You Don

Rajpurkar, Pranav and Jia, Robin and Liang, Percy , booktitle =. Know What You Don. doi:10.18653/v1/P18-2124 , editor =

work page doi:10.18653/v1/p18-2124
[51]

doi:10.18653/v1/N19-1309 , editor =

Lockard, Colin and Shiralkar, Prashant and Dong, Xin Luna , booktitle =. doi:10.18653/v1/N19-1309 , editor =

work page doi:10.18653/v1/n19-1309
[52]

and Zou, James , doi =

Wu, Eric and Wu, Kevin and Daneshjou, Roxana and Ouyang, David and Ho, Daniel E. and Zou, James , doi =. How medical. Nature Medicine , number =

work page
[53]

Language

Arora, Simran and Yang, Brandon and Eyuboglu, Sabri and Narayan, Avanika and Hojel, Andrew and Trummer, Immanuel and Ré, Christopher , journal =. Language

work page
[54]

De, Soham and Smith, Samuel L. and Fernando, Anushan and Botev, Aleksandar and Cristian-Muraru, George and Gu, Albert and Haroun, Ruba and Berrada, Leonard and Chen, Yutian and Srinivasan, Srivatsan and Desjardins, Guillaume and Doucet, Arnaud and Budden, David and Teh, Yee Whye and Pascanu, Razvan and De Freitas, Nando and Gulcehre, Caglar , journal =. Griffin:

work page
[55]

Linearizing Large Language Models , url =

Mercat, Jean and Vasiljevic, Igor and Keh, Sedrick and Arora, Kushal and Dave, Achal and Gaidon, Adrien and Kollar, Thomas , journal =. Linearizing Large Language Models , url =

work page
[56]

知乎专栏 , language =

大模型训练之序列并行双雄：. 知乎专栏 , language =

work page
[57]

Yang, Songlin and Zhang, Yu , copyright =

work page
[58]

Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , url =

Paul Smolensky , bibsource =. Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , url =. doi:10.1016/0004-3702(90)90007-M , journal =

work page doi:10.1016/0004-3702(90)90007-m
[59]

The Illusion of State in State-Space Models , url =

William Merrill and Jackson Petty and Ashish Sabharwal , bibsource =. The Illusion of State in State-Space Models , url =. Forty-first International Conference on Machine Learning,

work page
[60]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States , url =

Yu Sun and Xinhao Li and Karan Dalal and Jiarui Xu and Arjun Vikram and Genghan Zhang and Yann Dubois and Xinlei Chen and Xiaolong Wang and Sanmi Koyejo and Tatsunori Hashimoto and Carlos Guestrin , journal =. Learning to (Learn at Test Time): RNNs with Expressive Hidden States , url =

work page
[61]

Yu Zhang and Songlin Yang and Ruijie Zhu and Yue Zhang and Leyang Cui and Yiqiao Wang and Bolun Wang and Freda Shi and Bailin Wang and Wei Bi and Peng Zhou and Guohong Fu , title =

work page
[62]

Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules , url =

Kazuki Irie and J. Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules , url =. The Eleventh International Conference on Learning Representations,

work page
[63]

Gardner , journal =

E. Gardner , journal =. The space of interactions in neural network models , url =

work page
[64]

, booktitle =

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. doi:10.18653/v1/D18-1259 , editor =

work page doi:10.18653/v1/d18-1259
[65]

doi:10.18653/v1/2021.naacl-main.472 , editor =

Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Awadallah, Ahmed Hassan and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir , booktitle =. doi:10.18653/v1/2021.naacl-main.472 , editor =

work page doi:10.18653/v1/2021.naacl-main.472 2021
[66]

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , url =

Fabbri, Alexander and Li, Irene and She, Tianwei and Li, Suyi and Radev, Dragomir , booktitle =. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , url =. doi:10.18653/v1/P19-1102 , editor =

work page doi:10.18653/v1/p19-1102
[67]

Learning Question Classifiers , url =

Li, Xin and Roth, Dan , booktitle =. Learning Question Classifiers , url =

work page
[68]

Joshi, E

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke , booktitle =. doi:10.18653/v1/P17-1147 , editor =

work page doi:10.18653/v1/p17-1147
[69]

doi:10.18653/v1/D19-5409 , editor =

Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander , booktitle =. doi:10.18653/v1/D19-5409 , editor =

work page doi:10.18653/v1/d19-5409
[70]

McAuley , bibsource =

Daya Guo and Canwen Xu and Nan Duan and Jian Yin and Julian J. McAuley , bibsource =. LongCoder:. International Conference on Machine Learning,

work page
[71]

Liu, Tianyang and Xu, Canwen and McAuley, Julian , journal =

work page
[72]

Efficient Attentions for Long Document Summarization , url =

Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu , booktitle =. Efficient Attentions for Long Document Summarization , url =. doi:10.18653/v1/2021.naacl-main.112 , editor =

work page doi:10.18653/v1/2021.naacl-main.112 2021
[73]

Transactions of the Association for Computational Linguistics , pages =

Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish , doi =. Transactions of the Association for Computational Linguistics , pages =

work page
[74]

Constructing A Multi-hop

Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko , booktitle =. Constructing A Multi-hop. doi:10.18653/v1/2020.coling-main.580 , editor =

work page doi:10.18653/v1/2020.coling-main.580 2020
[75]

and Gardner, Matt , booktitle =

Dasigi, Pradeep and Lo, Kyle and Beltagy, Iz and Cohan, Arman and Smith, Noah A. and Gardner, Matt , booktitle =. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , url =. doi:10.18653/v1/2021.naacl-main.365 , editor =

work page doi:10.18653/v1/2021.naacl-main.365 2021
[76]

Ko. The. Transactions of the Association for Computational Linguistics , pages =. doi:10.1162/tacl_a_00023 , editor =

work page doi:10.1162/tacl_a_00023
[77]

The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention , url =

Kazuki Irie and R. The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention , url =. International Conference on Machine Learning,

work page
[78]

Smith and Albert Gu and Anushan Fernando and

Antonio Orvieto and Samuel L. Smith and Albert Gu and Anushan Fernando and. Resurrecting Recurrent Neural Networks for Long Sequences , url =. International Conference on Machine Learning,

work page
[79]

A Modern Self-Referential Weight Matrix That Learns to Modify Itself , url =

Kazuki Irie and Imanol Schlag and R. A Modern Self-Referential Weight Matrix That Learns to Modify Itself , url =. International Conference on Machine Learning,

work page
[80]

Beck, Maximilian and Pöppel, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, Günter and Brandstetter, Johannes and Hochreiter, Sepp , journal =

work page

Showing first 80 references.