Recognition: no theorem link
M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
Pith reviewed 2026-05-15 11:32 UTC · model grok-4.3
The pith
Non-linear RNNs with matrix-valued states achieve perfect unseen-length state tracking and outperform equivalent attention hybrids by 0.4-0.5 perplexity points while using three times smaller recurrent states.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
M2RNN demonstrates that non-linear matrix-to-matrix recurrence supplies the missing expressive power for language modeling; the model tracks entities perfectly on sequences longer than those seen in training, and hybrid M2RNN stacks achieve 0.4-0.5 lower perplexity than matched Gated DeltaNet hybrids on a 7B MoE while storing only one-third the state per recurrent layer.
What carries the argument
Matrix-valued hidden state updated by a non-linear transition function whose size is expanded on-the-fly to exploit tensor-core matrix multiplications.
If this is right
- A single M2RNN layer inserted into an existing hybrid yields accuracy gains nearly identical to a full hybrid replacement.
- Hybrid M2RNN models surpass prior linear-attention hybrids by up to 8 points on LongBench long-context tasks.
- Recurrent state memory can be reduced by a factor of three while preserving or improving language-model quality.
- Non-linear recurrence becomes a practical drop-in component for scaling language models beyond pure attention stacks.
Where Pith is reading between the lines
- The same matrix-state mechanism could be tested on code-generation or multi-hop reasoning benchmarks that require explicit state maintenance.
- Because state expansion is hardware-friendly, models could be trained at longer context lengths without proportional memory growth.
- Replacing only a few layers suggests a spectrum of hybrid densities could be explored to trade throughput against generalization.
Load-bearing premise
The non-linear matrix transitions and state expansion deliver the claimed expressive power and efficiency without introducing training instability or extra compute costs at scale.
What would settle it
Training and evaluating a 7B MoE hybrid that replaces all recurrent layers with M2RNN and measuring whether perplexity on the validation set rises or state-tracking accuracy collapses on lengths twice those used in training.
read the original abstract
Transformers are highly parallel but are limited to computations in the TC$^0$ complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we revisit non-linear Recurrent Neural Networks (RNNs) for language modeling and introduce Matrix-to-Matrix RNN (M$^2$RNN): an architecture with matrix-valued hidden states and expressive non-linear state transitions. We demonstrate that the language modeling performance of non-linear RNNs is limited by their state size, and show how the state size expansion mechanism enables efficient use of tensor cores. Empirically, M$^2$RNN achieves perfect state tracking generalization at sequence lengths not seen during training. These benefits also translate to large-scale language modeling. In hybrid settings that interleave recurrent layers with attention, Hybrid M$^2$RNN outperforms equivalent Gated DeltaNet hybrids by $0.4$-$0.5$ perplexity points on a 7B MoE model, while using $3\times$ smaller state sizes for the recurrent layers. Notably, replacing even a single recurrent layer with M$^2$RNN in an existing hybrid architecture yields accuracy gains comparable to Hybrid M$^2$RNN with minimal impact on training throughput. Further, the Hybrid Gated DeltaNet models with a single M$^2$RNN layer also achieve superior long-context generalization, outperforming state-of-the-art hybrid linear attention architectures by up to $8$ points on LongBench. Together, these results establish non-linear RNN layers as a compelling building block for efficient and scalable language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces M²RNN, a non-linear RNN with matrix-valued hidden states and non-linear state transitions for language modeling. It claims that state size limits non-linear RNN performance and that a state size expansion mechanism enables efficient tensor core use. Empirically, M²RNN achieves perfect state tracking generalization to unseen sequence lengths; in hybrid attention-recurrent models, Hybrid M²RNN outperforms equivalent Gated DeltaNet hybrids by 0.4-0.5 perplexity points on a 7B MoE model while using 3× smaller recurrent state sizes, and single-layer substitutions yield comparable gains with minimal throughput impact. Hybrid models with one M²RNN layer also improve long-context results on LongBench by up to 8 points over state-of-the-art linear attention hybrids.
Significance. If the empirical results hold under matched conditions, the work offers a concrete path toward expressive, scalable recurrent layers that address transformer limitations in TC⁰-exceeding tasks such as entity tracking. The matrix-state design and reported efficiency gains could influence hybrid architectures for long-context and stateful modeling.
major comments (2)
- [Abstract] Abstract: The headline efficiency claim ('3× smaller state sizes for the recurrent layers' while beating Gated DeltaNet hybrids by 0.4-0.5 PPL) is load-bearing for the scalability argument, yet the manuscript does not demonstrate that matrix d×d states are equivalent in effective capacity or FLOPs to the vector states used in the Gated DeltaNet baselines (matrix-matrix vs. vector-matrix transitions).
- [Experimental Results] Experimental sections: The reported perfect state-tracking generalization and perplexity gains lack the full experimental protocol, exact baseline configurations, error bars, ablation tables, and training details needed to evaluate robustness; without these the central empirical claims remain provisional.
minor comments (1)
- [Abstract] Abstract: The TC⁰ complexity reference would benefit from a short parenthetical or citation for readers outside circuit complexity.
Simulated Author's Rebuttal
We thank the referee for the careful review and for recognizing the potential significance of M²RNN for expressive recurrent modeling. We address the two major comments below with clarifications and commitments to strengthen the manuscript. All requested details can be provided in a revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline efficiency claim ('3× smaller state sizes for the recurrent layers' while beating Gated DeltaNet hybrids by 0.4-0.5 PPL) is load-bearing for the scalability argument, yet the manuscript does not demonstrate that matrix d×d states are equivalent in effective capacity or FLOPs to the vector states used in the Gated DeltaNet baselines (matrix-matrix vs. vector-matrix transitions).
Authors: We agree that explicit equivalence between matrix-valued (d×d) and vector states must be demonstrated for the efficiency claim to be fully convincing. In M²RNN the recurrent state is a d×d matrix (d² scalar elements); Gated DeltaNet baselines use vector states whose dimension m was chosen so that m ≈ 3d², yielding the reported 3× reduction in state size (measured by element count and memory footprint). The transition FLOPs are O(d³) for the matrix-matrix multiplication in M²RNN versus O(m·d) for the vector-matrix operations in the baselines; because d is substantially smaller, the net compute per token remains comparable or lower while tensor-core utilization improves. We will add a dedicated appendix subsection with (i) a side-by-side parameter/FLOP table, (ii) explicit formulas for both architectures, and (iii) a short ablation confirming that performance gains persist when total recurrent FLOPs are strictly matched. This revision will be marked clearly in the abstract and main text. revision: yes
-
Referee: [Experimental Results] Experimental sections: The reported perfect state-tracking generalization and perplexity gains lack the full experimental protocol, exact baseline configurations, error bars, ablation tables, and training details needed to evaluate robustness; without these the central empirical claims remain provisional.
Authors: We acknowledge that the current manuscript is missing several reproducibility elements. In the revised version we will expand the experimental sections and add a comprehensive appendix containing: (1) the complete training protocol (optimizer, learning-rate schedule, batch size, sequence length, number of tokens, hardware), (2) exact hyper-parameter tables for every baseline and hybrid configuration, (3) error bars or standard deviations from at least three independent runs for the 7B-scale perplexity results, (4) additional ablation tables varying state dimension, transition non-linearity, and hybrid placement, and (5) the precise state-tracking task setup (vocabulary, training lengths, evaluation lengths, and success criterion). These additions will be referenced from the main text so that the central claims can be evaluated directly. revision: yes
Circularity Check
No significant circularity; claims rest on empirical comparisons and architectural definitions.
full rationale
The paper introduces M²RNN as a new architecture with matrix-valued states and non-linear transitions, motivated by Transformer limitations in TC^0. Key results (perfect state tracking generalization, 0.4-0.5 PPL gains in 7B MoE hybrids with 3× smaller states) are presented as direct empirical outcomes from experiments comparing to Gated DeltaNet and other baselines. No equations or steps reduce by construction to inputs (e.g., no fitted parameters renamed as predictions, no self-definitional state transitions, no load-bearing self-citations to prior author work that would make uniqueness or ansatz tautological). The derivation chain is self-contained via explicit architectural choices and external benchmarks, with no evidence of the result being equivalent to its inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Association for Computing Machinery. ISBN 9798400703850. doi: 10.1145/3620665.3640366.https://doi.org/10.1145/3620665.3640366. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 23 6...
work page doi:10.1145/3620665.3640366.https://doi.org/10.1145/3620665.3640366
-
[2]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi
doi: 10.1109/72.279181. Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019.https://arxiv.org/abs/1911.11641. Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksan...
-
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[4]
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches.arXiv preprint arXiv:1409.1259, 2014a. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn enc...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv preprint arXiv:1412.3555,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019.https://arxiv.org/abs/1905.10044. PeterClark, IsaacCowhey, OrenEtzioni, TusharKhot, AshishSabharwal, CarissaSchoenick, andOyvind Tafjord. Think you have solved question answer...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[7]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052,
-
[11]
The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria 24 Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lang...
-
[12]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021a. Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in neural infor...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Gaussian Error Linear Units (GELUs)
D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
doi: 10.1145/7902.7903.https://doi.org/10.1145/7902.7903
ISSN 0001-0782. doi: 10.1145/7902.7903.https://doi.org/10.1145/7902.7903. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780,
work page doi:10.1145/7902.7903.https://doi.org/10.1145/7902.7903
-
[15]
RULER: What's the Real Context Size of Your Long-Context Language Models?
doi: 10.1162/neco.1997.9.8.1735. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/neco.1997.9.8.1735 1997
-
[16]
A Study of BFLOAT16 for Deep Learning Training
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of bfloat16 for deep learning training.arXiv preprint arXiv:1905.12322,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[17]
A Simple Way to Initialize Recurrent Networks of Rectified Linear Units
26 Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. A simple way to initialize recurrent networks of rectified linear units, 2015.https://arxiv.org/abs/1504.00941. Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba ...
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[18]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019.https://arxiv.org/abs/1711.05101. Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio. Speech model pre-training for end-to-end spoken language understanding.arXiv preprint arXiv:1904.03670,
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[20]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017.https://openreview.net/forum?id=Byj72udxe. William Merrill. Sequential neural networks as automata.arXiv preprint arXiv:1906.01615,
-
[21]
https://aclanthology.org/2022.tacl-1.49/
doi: 10.1162/tacl_a_00493. https://aclanthology.org/2022.tacl-1.49/. William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models.arXiv preprint arXiv:2404.08819,
-
[22]
https: //ai.meta.com/blog/llama-4-multimodal-intelligence/. PauliusMicikevicius, SharanNarang, JonahAlben, GregoryDiamos, ErichElsen, DavidGarcia, BorisGinsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.https://arxiv.org/abs/1809.02789. Mayank Mishra. Lm engine: A hyper-optimized library for pretraining and finetuning, 2024.https://github.com/ open-lm-engine/lm-engine. Mayank Mishra, Matt Stallone, ...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Needle in the haystack for memory based large language models.arXiv preprint arXiv:2407.01437,
Elliot Nelson, Georgios Kollias, Payel Das, Subhajit Chaudhury, and Soham Dan. Needle in the haystack for memory based large language models.arXiv preprint arXiv:2407.01437,
-
[25]
NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav M...
-
[26]
https://epubs.siam.org/doi/abs/10
doi: 10.1137/1.9780898719468. https://epubs.siam.org/doi/abs/10. 1137/1.9780898719468. Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meetin...
-
[27]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era.arXiv preprint arXiv:2305.13048,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Random feature attention.arXiv preprint arXiv:2103.02143,
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. Random feature attention.arXiv preprint arXiv:2103.02143,
-
[29]
Korbinian Pöppel, Maximilian Beck, and Sepp Hochreiter. Flashrnn: I/o-aware optimization of traditional rnns on modern hardware, 2025.https://arxiv.org/abs/2412.07752. Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904,
-
[30]
Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522,
-
[31]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
http://ict.usc.edu/pubs/Choice%20of%20Plausible%20Alternatives-%20An% 20Evaluation%20of%20Commonsense%20Causal%20Reasoning.pdf. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019.https://arxiv.org/abs/1907.10641. Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Line...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[32]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[33]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 28 Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[35]
Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear rnns via householder products.arXiv preprint arXiv:2502.10297,
-
[36]
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model.arXiv preprint arXiv:2201.11990,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Efficient Transformers : A survey
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey, 2022.https://arxiv. org/abs/2009.06732. Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,
-
[39]
Aleksandar Terzić, Nicolas Menet, Michael Hersche, Thomas Hofmann, and Abbas Rahimi. Structured sparse transition matrices to enable state tracking in state-space models.arXiv preprint arXiv:2509.22284,
-
[40]
Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY, USA,
work page 2019
-
[41]
Association for Computing Machinery. ISBN 9781450367196. doi: 10.1145/3315508.3329973.https://doi.org/10.1145/3315508.3329973. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,
work page doi:10.1145/3315508.3329973.https://doi.org/10.1145/3315508.3329973
-
[42]
Crowdsourcing Multiple Choice Science Questions
Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions, 2017.https: //arxiv.org/abs/1707.06209. BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilin...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024a. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.Advances in neural information processing systems, 37:115491–115522, 2024b. T...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
29 Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, and Tri Dao. Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping, 2025.https://arxiv.org/abs/2501.06589v2. A State Size Expansion Attention PatternW...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.