Recognition: 2 theorem links
· Lean TheoremGated Delta Networks: Improving Mamba2 with Delta Rule
Pith reviewed 2026-05-13 14:45 UTC · model grok-4.3
The pith
Gated DeltaNet merges memory erasure gating with the delta update rule to outperform Mamba2 and DeltaNet on language and long-context tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By observing that gating enables rapid memory erasure while the delta rule facilitates targeted updates, the paper introduces the gated delta rule and develops a parallel training algorithm optimized for modern hardware. The resulting Gated DeltaNet architecture consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. Hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers achieve both improved training efficiency and superior task performance.
What carries the argument
The gated delta rule, which integrates a gating mechanism for adaptive memory control with the delta update rule for precise memory modifications to enable both rapid erasure and targeted updates.
Load-bearing premise
That the gating mechanism for rapid memory erasure and the delta rule for targeted updates combine into a single rule that yields consistent gains without instability or hidden trade-offs.
What would settle it
A direct comparison showing Gated DeltaNet underperforming Mamba2 on long-context understanding or exhibiting training instability on standard language modeling benchmarks would falsify the claim.
read the original abstract
Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the gated delta rule, which combines a gating mechanism for rapid memory erasure with the delta rule for targeted memory updates. It develops a hardware-optimized parallel training algorithm for this rule and introduces the Gated DeltaNet architecture, claiming consistent outperformance over Mamba2 and DeltaNet on language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context tasks. Hybrid models combining Gated DeltaNet layers with sliding-window attention or Mamba2 layers are also presented for further efficiency and performance gains.
Significance. If the parallel algorithm faithfully reproduces the sequential gated delta dynamics and the reported gains prove robust to hyperparameter controls and statistical testing, the work would meaningfully advance linear sequence models by addressing retrieval and long-context limitations. The complementarity insight and hybrid design offer practical value for efficient training on modern hardware.
major comments (2)
- [§3] §3 (Gated Delta Rule and Parallel Algorithm): No derivation or equivalence proof is supplied showing that the parallel training algorithm exactly replicates the sequential semantics of combined gating and delta updates. This is load-bearing for the length-extrapolation and long-context claims, because chunking or associative reformulation could reorder erasure/update interactions and produce divergent behavior on long sequences.
- [§5] §5 (Experiments): The reported consistent outperformance lacks any mention of run counts, statistical significance tests, ablation studies isolating the gated delta rule, or error analysis. Without these controls it is impossible to determine whether gains survive hyperparameter or data-choice variation.
minor comments (2)
- Abstract: Specific benchmark names and dataset sizes should be listed to allow immediate assessment of the scope of the claimed improvements.
- Notation: The definition of the gated delta update should be presented with an explicit equation number and contrasted with the original delta rule and gating formulations for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the manuscript's rigor and clarity without altering its core contributions.
read point-by-point responses
-
Referee: [§3] §3 (Gated Delta Rule and Parallel Algorithm): No derivation or equivalence proof is supplied showing that the parallel training algorithm exactly replicates the sequential semantics of combined gating and delta updates. This is load-bearing for the length-extrapolation and long-context claims, because chunking or associative reformulation could reorder erasure/update interactions and produce divergent behavior on long sequences.
Authors: We appreciate the referee's emphasis on this foundational aspect. The parallel algorithm was constructed via an associative scan that preserves the exact sequential order of gating (erasure) and delta (update) operations. However, we acknowledge that an explicit step-by-step derivation and equivalence proof were omitted from the original submission. In the revised manuscript we will add a dedicated subsection in §3 containing the full mathematical derivation, showing that the parallel formulation is mathematically identical to the sequential gated delta rule for any sequence length, including the non-commutative interactions between erasure and update steps. This will directly bolster the length-extrapolation and long-context results. revision: yes
-
Referee: [§5] §5 (Experiments): The reported consistent outperformance lacks any mention of run counts, statistical significance tests, ablation studies isolating the gated delta rule, or error analysis. Without these controls it is impossible to determine whether gains survive hyperparameter or data-choice variation.
Authors: We agree that the experimental section would benefit from greater statistical rigor. In the revised version we will (i) report results from at least five independent runs with different random seeds, including mean and standard deviation; (ii) include paired statistical significance tests against Mamba2 and DeltaNet baselines; (iii) add ablation experiments that isolate the gated delta rule by ablating the gating mechanism and the delta update separately; and (iv) provide a concise error analysis on representative retrieval and long-context tasks. These additions will be placed in §5 and the appendix. revision: yes
Circularity Check
No significant circularity; claims rest on empirical benchmarks
full rationale
The paper proposes combining gating (for rapid memory erasure) and the delta rule (for targeted updates) into a gated delta rule, then presents a parallel training algorithm optimized for hardware. The central claims are that Gated DeltaNet surpasses Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks, with further gains from hybrid architectures. No load-bearing derivation reduces a claimed result to a fitted parameter, self-citation chain, or definitional tautology. The complementarity observation and parallel algorithm are presented as engineering insights supported by experiments rather than by construction from the same data. This is a standard empirical architecture paper whose performance claims are externally falsifiable via the reported benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation.DimensionForcing (8-tick period)eight_tick_forces_D3 echoesWe introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware... preserves the benefits of chunkwise parallelism
-
Foundation.LedgerForcingconservation_from_balance echoesthe gated delta rule... combines both approaches... hardware-efficient chunkwise algorithm
Forward citations
Cited by 27 Pith papers
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction
Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
-
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...
-
Mixture of Layers with Hybrid Attention
Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the rout...
-
Transformers with Selective Access to Early Representations
SATFormer uses a learned context-dependent gate for selective access to early-layer value representations in Transformers, improving loss and accuracy over static residual baselines.
-
Transformers with Selective Access to Early Representations
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
-
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
-
A Single-Layer Model Can Do Language Modeling
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
-
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
-
Training Transformers for KV Cache Compressibility
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
-
Training Transformers for KV Cache Compressibility
KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
-
The Impossibility Triangle of Long-Context Modeling
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
-
Learning to Forget: Continual Learning with Adaptive Weight Decay
FADE adapts per-parameter weight decay rates online via approximate meta-gradient descent to improve controlled forgetting over fixed decay in online tracking and streaming classification.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attenti...
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Olmo Hybrid: From Theory to Practice and Back
A 7B hybrid attention-recurrent model outperforms its pure-transformer counterpart on pretraining metrics and scales more efficiently, supported by a proof that hybrids are strictly more expressive than either transfo...
-
Beyond Similarity: Temporal Operator Attention for Time Series Analysis
Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on t...
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
-
Reasoning Primitives in Hybrid and Non-Hybrid LLMs
Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.
-
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
-
On The Application of Linear Attention in Multimodal Transformers
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
Reference graph
Works this paper leans on
-
[1]
On orthogonality and learning recurrent networks with long term dependencies , url =
Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal , bibsource =. On orthogonality and learning recurrent networks with long term dependencies , url =. Proceedings of the 34th International Conference on Machine Learning,
-
[2]
Michael Zhang and Simran Arora and Rahul Chalamala and Benjamin Frederick Spector and Alan Wu and Krithik Ramesh and Aaryan Singhal and Christopher Re , booktitle=. Lo. 2025 , url=
work page 2025
-
[3]
Expansion Span: Combining Fading Memory and Retrieval in Hybrid State Space Models , author=. 2024 , eprint=
work page 2024
-
[4]
Luca Zancato and Arjun Seshadri and Yonatan Dukler and Aditya Golatkar and Yantao Shen and Benjamin Bowman and Matthew Trager and Alessandro Achille and Stefano Soatto , booktitle=. B'. 2024 , url=
work page 2024
-
[5]
The Thirteenth International Conference on Learning Representations , year=
Hymba: A Hybrid-head Architecture for Small Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[6]
An Empirical Study of Mamba-based Language Models , author=. 2024 , eprint=
work page 2024
-
[7]
Implicit Language Models are RNNs: Balancing Parallelization and Expressivity , author=. 2025 , eprint=
work page 2025
-
[8]
Smith and Scott Linderman , booktitle=
Xavier Gonzalez and Andrew Warrington and Jimmy T.H. Smith and Scott Linderman , booktitle=. Towards Scalable and Stable Parallelization of Nonlinear. 2024 , url=
work page 2024
-
[9]
Parallelizing non-linear sequential models over the sequence length , author=. 2024 , eprint=
work page 2024
-
[10]
MiniMax-01: Scaling Foundation Models with Lightning Attention , author=. 2025 , eprint=
work page 2025
- [11]
-
[12]
The Thirteenth International Conference on Learning Representations , year=
Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions , author=. The Thirteenth International Conference on Learning Representations , year=
-
[13]
Uncovering mesa-optimization algorithms in Transformers , author=. 2024 , eprint=
work page 2024
-
[14]
Test-time regression: a unifying framework for designing sequence models with associative memory , author=. 2025 , eprint=
work page 2025
-
[15]
DeltaProduct: Increasing the Expressivity of DeltaNet Through Products of Householders , author=. 2025 , eprint=
work page 2025
- [16]
-
[17]
Yuhong Chou and Man Yao and Kexin Wang and Yuqi Pan and Rui-Jie Zhu and Jibin Wu and Yiran Zhong and Yu Qiao and Bo XU and Guoqi Li , booktitle=. Meta. 2024 , url=
work page 2024
-
[18]
Orthogonal Recurrent Neural Networks with Scaled Cayley Transform , url =
Kyle Helfrich and Devin Willmott and Qiang Ye , bibsource =. Orthogonal Recurrent Neural Networks with Scaled Cayley Transform , url =. Proceedings of the 35th International Conference on Machine Learning,
-
[19]
Jing, Li and Gulcehre, Caglar and Peurifoy, John and Shen, Yichen and Tegmark, Max and Soljacic, Marin and Bengio, Yoshua , doi =. Gated. Neural Computation , language =
-
[20]
Tomczak and Max Welling , bibsource =
Rianne van den Berg and Leonard Hasenclever and Jakub M. Tomczak and Max Welling , bibsource =. Sylvester Normalizing Flows for Variational Inference , url =. Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence,
-
[21]
Qin, Zhen and Sun, Weixuan and Lu, Kaiyue and Deng, Hui and Li, Dongxu and Han, Xiaodong and Dai, Yuchao and Kong, Lingpeng and Zhong, Yiran , journal =. Linearized
-
[22]
Jiong Zhang and Qi Lei and Inderjit S. Dhillon , bibsource =. Stabilizing Gradients for Deep Neural Networks via Efficient. Proceedings of the 35th International Conference on Machine Learning,
- [23]
-
[24]
Aksenov, Yaroslav and Balagansky, Nikita and Vaina, Sofia Maria Lo Cicero and Shaposhnikov, Boris and Gorbatovski, Alexey and Gavrilov, Daniil , journal =. Linear
-
[25]
Hopfield Networks is All You Need , url =
Hubert Ramsauer and Bernhard Sch. Hopfield Networks is All You Need , url =. 9th International Conference on Learning Representations,
-
[26]
Dmitry Krotov and John J. Hopfield , bibsource =. Large Associative Memory Problem in Neurobiology and Machine Learning , url =. 9th International Conference on Learning Representations,
-
[27]
Neural networks and physical systems with emergent collective computational abilities
Hopfield, J J , doi =. Neural networks and physical systems with emergent collective computational abilities. , url =. Proceedings of the National Academy of Sciences , note =
-
[28]
On a model of associative memory with huge storage capacity , url =
Demircigil, Mete and Heusel, Judith and Löwe, Matthias and Upgang, Sven and Vermet, Franck , journal =. On a model of associative memory with huge storage capacity , url =
-
[29]
Sun, Weigao and Qin, Zhen and Li, Dong and Shen, Xuyang and Qiao, Yu and Zhong, Yiran , journal =. Linear
-
[30]
Nahshan, Yury and Kampeas, Joseph and Haleva, Emir , journal =. Linear
-
[31]
Millidge, Beren , language =. Linear
-
[32]
Hellicar and Ashfaqur Rahman and James Bailey , bibsource =
Zakaria Mhammedi and Andrew D. Hellicar and Ashfaqur Rahman and James Bailey , bibsource =. Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections , url =. Proceedings of the 34th International Conference on Machine Learning,
-
[33]
Paperno, Denis and Kruszewski, Germ. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , doi =
-
[34]
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute , url =
Lei, Tao , booktitle =. When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute , url =. doi:10.18653/v1/2021.emnlp-main.602 , editor =
-
[35]
Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle =. Social. doi:10.18653/v1/D19-1454 , editor =
-
[36]
Encoding Recurrence into Transformers , url =
Feiqing Huang and Kexin Lu and Yuxi Cai and Zhen Qin and Yanwen Fang and Guangjian Tian and Guodong Li , bibsource =. Encoding Recurrence into Transformers , url =. The Eleventh International Conference on Learning Representations,
-
[37]
Pointer Sentinel Mixture Models , url =
Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher , bibsource =. Pointer Sentinel Mixture Models , url =. 5th International Conference on Learning Representations,
-
[38]
Graves, Alex and Wayne, Greg and Danihelka, Ivo , keywords =. Neural
-
[39]
Jing, Li and Gulcehre, Caglar and Peurifoy, John and Shen, Yichen and Tegmark, Max and Soljačić, Marin and Bengio, Yoshua , journal =. Gated
-
[40]
Gers, F.A. and Schmidhuber, J. and Cummins, F. , booktitle =. Learning to forget: continual prediction with. doi:10.1049/cp:19991218 , note =
- [41]
-
[42]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , url =
Djork. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , url =. 4th International Conference on Learning Representations,
-
[43]
Blelloch, Guy E. , copyright =. Prefix sums and their applications , url =. doi:10.1184/R1/6608579.V1 , keywords =
-
[44]
Blelloch, Guy E , language =. Prefix
-
[45]
Patrick S. H. Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , editor =
work page 2020
-
[46]
Botev, Aleksandar and De, Soham and Smith, Samuel L. and Fernando, Anushan and Muraru, George-Cristian and Haroun, Ruba and Berrada, Leonard and Pascanu, Razvan and Sessa, Pier Giuseppe and Dadashi, Robert and Hussenot, Léonard and Ferret, Johan and Girgin, Sertan and Bachem, Olivier and Andreev, Alek and Kenealy, Kathleen and Mesnard, Thomas and Hardin, ...
-
[47]
Irie, Kazuki and Csord. Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions , url =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , doi =
work page 2023
-
[48]
Advancing Regular Language Reasoning in Linear Recurrent Neural Networks , url =
Fan, Ting-Han and Chi, Ta-Chung and Rudnicky, Alexander , booktitle =. Advancing Regular Language Reasoning in Linear Recurrent Neural Networks , url =
-
[49]
Merrill, William and Petty, Jackson and Sabharwal, Ashish , journal =. The
-
[50]
Rajpurkar, Pranav and Jia, Robin and Liang, Percy , booktitle =. Know What You Don. doi:10.18653/v1/P18-2124 , editor =
-
[51]
doi:10.18653/v1/N19-1309 , editor =
Lockard, Colin and Shiralkar, Prashant and Dong, Xin Luna , booktitle =. doi:10.18653/v1/N19-1309 , editor =
-
[52]
Wu, Eric and Wu, Kevin and Daneshjou, Roxana and Ouyang, David and Ho, Daniel E. and Zou, James , doi =. How medical. Nature Medicine , number =
- [53]
-
[54]
De, Soham and Smith, Samuel L. and Fernando, Anushan and Botev, Aleksandar and Cristian-Muraru, George and Gu, Albert and Haroun, Ruba and Berrada, Leonard and Chen, Yutian and Srinivasan, Srivatsan and Desjardins, Guillaume and Doucet, Arnaud and Budden, David and Teh, Yee Whye and Pascanu, Razvan and De Freitas, Nando and Gulcehre, Caglar , journal =. Griffin:
-
[55]
Linearizing Large Language Models , url =
Mercat, Jean and Vasiljevic, Igor and Keh, Sedrick and Arora, Kushal and Dave, Achal and Gaidon, Adrien and Kollar, Thomas , journal =. Linearizing Large Language Models , url =
- [56]
-
[57]
Yang, Songlin and Zhang, Yu , copyright =
-
[58]
Paul Smolensky , bibsource =. Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , url =. doi:10.1016/0004-3702(90)90007-M , journal =
-
[59]
The Illusion of State in State-Space Models , url =
William Merrill and Jackson Petty and Ashish Sabharwal , bibsource =. The Illusion of State in State-Space Models , url =. Forty-first International Conference on Machine Learning,
-
[60]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States , url =
Yu Sun and Xinhao Li and Karan Dalal and Jiarui Xu and Arjun Vikram and Genghan Zhang and Yann Dubois and Xinlei Chen and Xiaolong Wang and Sanmi Koyejo and Tatsunori Hashimoto and Carlos Guestrin , journal =. Learning to (Learn at Test Time): RNNs with Expressive Hidden States , url =
-
[61]
Yu Zhang and Songlin Yang and Ruijie Zhu and Yue Zhang and Leyang Cui and Yiqiao Wang and Bolun Wang and Freda Shi and Bailin Wang and Wei Bi and Peng Zhou and Guohong Fu , title =
-
[62]
Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules , url =
Kazuki Irie and J. Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules , url =. The Eleventh International Conference on Learning Representations,
-
[63]
E. Gardner , journal =. The space of interactions in neural network models , url =
-
[64]
Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. doi:10.18653/v1/D18-1259 , editor =
-
[65]
doi:10.18653/v1/2021.naacl-main.472 , editor =
Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Awadallah, Ahmed Hassan and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir , booktitle =. doi:10.18653/v1/2021.naacl-main.472 , editor =
-
[66]
Fabbri, Alexander and Li, Irene and She, Tianwei and Li, Suyi and Radev, Dragomir , booktitle =. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , url =. doi:10.18653/v1/P19-1102 , editor =
-
[67]
Learning Question Classifiers , url =
Li, Xin and Roth, Dan , booktitle =. Learning Question Classifiers , url =
-
[68]
Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke , booktitle =. doi:10.18653/v1/P17-1147 , editor =
-
[69]
doi:10.18653/v1/D19-5409 , editor =
Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander , booktitle =. doi:10.18653/v1/D19-5409 , editor =
-
[70]
Daya Guo and Canwen Xu and Nan Duan and Jian Yin and Julian J. McAuley , bibsource =. LongCoder:. International Conference on Machine Learning,
-
[71]
Liu, Tianyang and Xu, Canwen and McAuley, Julian , journal =
-
[72]
Efficient Attentions for Long Document Summarization , url =
Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu , booktitle =. Efficient Attentions for Long Document Summarization , url =. doi:10.18653/v1/2021.naacl-main.112 , editor =
-
[73]
Transactions of the Association for Computational Linguistics , pages =
Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish , doi =. Transactions of the Association for Computational Linguistics , pages =
-
[74]
Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko , booktitle =. Constructing A Multi-hop. doi:10.18653/v1/2020.coling-main.580 , editor =
-
[75]
and Gardner, Matt , booktitle =
Dasigi, Pradeep and Lo, Kyle and Beltagy, Iz and Cohan, Arman and Smith, Noah A. and Gardner, Matt , booktitle =. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , url =. doi:10.18653/v1/2021.naacl-main.365 , editor =
-
[76]
Ko. The. Transactions of the Association for Computational Linguistics , pages =. doi:10.1162/tacl_a_00023 , editor =
-
[77]
Kazuki Irie and R. The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention , url =. International Conference on Machine Learning,
-
[78]
Smith and Albert Gu and Anushan Fernando and
Antonio Orvieto and Samuel L. Smith and Albert Gu and Anushan Fernando and. Resurrecting Recurrent Neural Networks for Long Sequences , url =. International Conference on Machine Learning,
-
[79]
A Modern Self-Referential Weight Matrix That Learns to Modify Itself , url =
Kazuki Irie and Imanol Schlag and R. A Modern Self-Referential Weight Matrix That Learns to Modify Itself , url =. International Conference on Machine Learning,
-
[80]
Beck, Maximilian and Pöppel, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, Günter and Brandstetter, Johannes and Hochreiter, Sepp , journal =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.