arxiv: 2605.06384 · v2 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· cs.FL

Recognition: 2 theorem links

· Lean Theorem

MinMax Recurrent Neural Cascades

Alessandro Ronca

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.FL

keywords MinMax algebrarecurrent neural networksregular languagesvanishing gradientsparallel evaluationsequence modelingfinite automatagradient stability

0 comments

The pith

MinMax Recurrent Neural Cascades express every regular language, evaluate in logarithmic parallel time, and maintain constant state gradients over arbitrary distances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MinMax Recurrent Neural Cascades as a new form of recurrence based on the min and max operations rather than weighted sums and nonlinear activations. It sets out to prove that these models reach the full expressivity of regular languages, support parallel evaluation whose runtime grows only logarithmically with input length, keep all internal states and activations inside fixed bounds for inputs of any size, and avoid vanishing state gradients so that the influence of a past state on the current one can remain exactly one no matter how far apart they are. A reader would care because conventional recurrent networks lose long-range information through gradient decay or explosion, limiting their reliability on extended sequences. The work supplies formal proofs for the listed properties plus experimental confirmation that the models solve synthetic tasks perfectly and reach competitive next-token prediction performance at a scale of 127 million parameters.

Core claim

MinMax RNCs are built by cascading layers whose recurrence follows the MinMax algebra; this construction yields models whose formal expressivity includes all regular languages, whose parallel evaluation runs in time logarithmic in sequence length, whose states and activations remain uniformly bounded for every input length, whose loss gradients exist and stay bounded almost everywhere, and whose state gradients never vanish because the derivative of a state with respect to an earlier state can equal one independently of the time gap between them.

What carries the argument

The MinMax recurrence step, which replaces the standard linear-plus-nonlinearity update with direct min and max operations over previous states and current inputs, then cascaded across multiple layers.

If this is right

Any regular language can be recognized exactly by some MinMax RNC of finite size.
Sequence evaluation can be performed either sequentially or in parallel with runtime O(log n) given sufficient processors.
States and activations stay inside fixed numerical bounds regardless of how long the input grows.
The gradient of any state with respect to a state from arbitrarily far in the past can remain exactly one at almost all parameter settings.
On synthetic tasks that stress long-range dependencies the models reach perfect solutions while standard recurrent networks do not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform bounds and non-vanishing gradients could allow reliable training on sequences far longer than those feasible for vanilla RNNs or LSTMs.
Hybrid architectures that insert MinMax layers into larger networks might inherit stability while adding other inductive biases.
The non-differentiable loci might still require careful initialization or smoothing in very deep cascades before the theoretical advantages appear on noisy real-world data.

Load-bearing premise

The points where min and max are non-differentiable do not block reliable gradient-based training, and the algebraic bounds proved for single layers continue to hold once the layers are stacked and the whole network is optimized on actual data.

What would settle it

Training a MinMax RNC on a regular language such as balanced parentheses and observing either failure to reach perfect accuracy or a state gradient that shrinks toward zero with increasing distance would falsify the expressivity or gradient claims.

Figures

Figures reproduced from arXiv: 2605.06384 by Alessandro Ronca.

**Figure 1.** Figure 1: Diagram of a MinMax Recurrent Layer consisting of two units view at source ↗

**Figure 2.** Figure 2: Evaluation accuracy: Latching(4) on the left, Sequences(3) on the right. AutoDiff [44]. However, different computation graphs may result into different intensional derivatives, as the chain rule for PAP function is not guaranteed to be invariant across different decompositions of a function. Thus, we could only make a claim for the intensional derivatives obtained through the computation graph we have cons… view at source ↗

read the original abstract

We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivity for a finite-memory system. Second, they can be evaluated in parallel with a runtime that is logarithmic in the input length given enough processors; and they can also be evaluated sequentially. Third, their state and activations are bounded uniformly for all input lengths. Fourth, at almost all points, their loss gradient exists and it is bounded. Fifth, they do not exhibit a vanishing state gradient: the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance between the two states. Finally, we find empirical evidence that the favourable theoretical properties of MinMax RNCs are matched by their practical capabilities: they are able to perfectly solve a number of synthetic tasks, showing superior performance compared to the considered state-of-the-art recurrent neural networks; also, we train a MinMax RNC of 127M parameters on next-token prediction, and the obtained model shows competitive performance for its size, providing evidence of the potential of MinMax RNCs on real-world tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MinMax RNCs combine regular-language expressivity, log-parallel evaluation, uniform bounds, and non-vanishing state gradients in one algebraic construction, but the non-differentiability of min/max needs checking once layers are cascaded and trained.

read the letter

The core contribution is a recurrent layer that replaces the usual additive and multiplicative updates with min and max operations drawn from the min-max semiring. Cascading several such layers produces models that the paper claims can recognize every regular language, evaluate in logarithmic parallel time, keep every state and activation bounded independently of sequence length, and maintain state gradients that can stay exactly one across arbitrary time gaps. Those four properties together are not standard in the RNN literature the abstract cites, and the algebraic route to them looks direct rather than fitted to data inside the model itself. The synthetic-task results, where the architecture solves parity, copying, and similar problems that trip up LSTM and GRU baselines, line up with the claimed gradient behavior. The 127 M parameter next-token model reaching competitive performance for its size shows the construction can be trained at scale without immediate divergence or collapse. That combination of formal guarantees and a real-sized experiment is the part worth paying attention to. The soft spot is the handling of non-differentiable points. Min and max are non-differentiable wherever two arguments are equal, and the paper correctly notes that gradients exist almost everywhere. Once you stack layers, the measure of those points grows, and it is not obvious that the constant-one state-gradient identity survives in the subgradient sense or that the uniform bounds remain strict after composition. The abstract does not report ablations on the language-model run that would show whether long-range credit assignment is actually using the claimed gradient path or whether training succeeds for other reasons. The synthetic tasks are low-dimensional and noise-free, so they do not stress this issue. The proofs appear to rest on standard automata and semiring facts rather than on quantities learned inside the paper, which keeps the circularity burden low. This work is aimed at researchers who want recurrent primitives with formal guarantees on expressivity and gradient flow, especially for long sequences where attention is expensive. It deserves a serious referee because the theoretical package is specific and the empirical scale is large enough to make the differentiability question answerable rather than speculative. I would send it out for review with a request for explicit subgradient analysis on cascaded layers and at least one ablation on the large model that isolates the long-range gradient claim.

Referee Report

2 major / 2 minor

Summary. The paper introduces MinMax Recurrent Neural Cascades (RNCs) based on min-max algebra for recurrence. It claims these models express all regular languages, support parallel logarithmic-time evaluation (and sequential), maintain uniform bounds on states and activations for any input length, have loss gradients that exist at almost all points and are bounded, and exhibit non-vanishing state gradients where the partial derivative of a state at time t w.r.t. a past state at s equals 1 independently of |t-s|. Empirical support is given via perfect performance on synthetic tasks (outperforming standard RNNs) and competitive results from a 127M-parameter model on next-token prediction.

Significance. If the central derivations hold, the work provides a theoretically grounded RNN variant with automata-theoretic expressivity, parallel efficiency, and explicit non-vanishing gradient guarantees derived from the min-max semiring. The explicit connection to regular languages and the scaling demonstration to 127M parameters are notable strengths that could influence recurrent architecture design for long-range tasks.

major comments (2)

[gradient analysis section] Abstract and the gradient analysis section: the claim that 'the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance' is derived from algebraic identities in the min-max semiring that hold only at differentiable points. Cascading layers composes these operations and enlarges the non-differentiable set (where arguments are equal); the manuscript states gradients exist 'at almost all points' but does not establish that the constant-1 identity survives in the subgradient sense after composition, which is load-bearing for the no-vanishing-gradient and trainability claims.
[empirical evaluation section] The 127M-parameter next-token experiment: while competitive performance is reported, no ablation or gradient-norm analysis is provided to test whether optimization actually exploits the claimed constant state-gradient property versus other convergence mechanisms. This weakens the link between the theoretical gradient bound and the practical scaling result.

minor comments (2)

The cascade construction and its relation to standard automata should be illustrated with a small explicit example (e.g., a 2-state acceptor) to make the expressivity proof more accessible.
Notation for the min and max operations and the recurrence update should be unified across the theoretical and experimental sections to avoid ambiguity in the boundedness proofs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment in detail below, indicating revisions where appropriate to clarify and strengthen the presentation of our results.

read point-by-point responses

Referee: [gradient analysis section] Abstract and the gradient analysis section: the claim that 'the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance' is derived from algebraic identities in the min-max semiring that hold only at differentiable points. Cascading layers composes these operations and enlarges the non-differentiable set (where arguments are equal); the manuscript states gradients exist 'at almost all points' but does not establish that the constant-1 identity survives in the subgradient sense after composition, which is load-bearing for the no-vanishing-gradient and trainability claims.

Authors: The referee correctly identifies that the core algebraic identities for the constant gradient of 1 are established at points of differentiability for the basic min-max recurrence. The manuscript already states that gradients exist at almost all points, which follows because the non-differentiable loci (where min or max arguments are equal) form a lower-dimensional subset of the input space. For cascaded layers, this property is preserved under composition: the overall non-differentiable set remains a union of measure-zero manifolds. We agree that an explicit argument for the subgradient (in the sense of Clarke's generalized gradient) is needed to confirm that 1 remains in the subgradient after chaining the operations. We will revise the gradient analysis section to include a short proof sketch showing that the chain rule for subgradients preserves the constant-1 element in the limiting sense from nearby differentiable points, thereby supporting the non-vanishing claim for the full cascaded model. revision: yes
Referee: [empirical evaluation section] The 127M-parameter next-token experiment: while competitive performance is reported, no ablation or gradient-norm analysis is provided to test whether optimization actually exploits the claimed constant state-gradient property versus other convergence mechanisms. This weakens the link between the theoretical gradient bound and the practical scaling result.

Authors: We acknowledge that the 127M-parameter experiment reports competitive next-token prediction performance without accompanying ablation studies or explicit gradient-norm measurements that would isolate the contribution of the constant state-gradient property. The empirical section is presented as supporting evidence of practical scalability rather than a direct empirical validation of the gradient theory; the theoretical non-vanishing result stands independently. Nevertheless, the successful training of a model at this scale without divergence is consistent with the bounded-gradient guarantees. We will add a paragraph in the empirical evaluation section discussing how the uniform state bounds and non-vanishing gradient property provide a theoretical basis for stable optimization at large parameter counts, and we will include a brief note on the absence of vanishing gradients observed during training. revision: partial

Circularity Check

0 steps flagged

No circularity: properties derived from external min/max algebra and automata theory

full rationale

The derivation begins from the explicit definition of MinMax recurrence (using min and max as the state-update operators) and proceeds via standard algebraic identities and automata constructions to prove expressivity, parallelizability, uniform bounds, and non-vanishing state gradients. These steps rely on external facts about min/max (which are differentiable almost everywhere) and regular-language theory rather than on any fitted parameter, self-referential definition, or load-bearing self-citation. No claimed prediction or result is shown to equal its own input by construction; the 127 M-parameter experiment is presented only as empirical corroboration, not as part of the formal chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the algebraic properties of the min-max semiring and standard results from automata theory; no free parameters or new invented entities are introduced beyond the model definition itself.

axioms (2)

standard math The min-max semiring is associative, commutative, and distributive in the manner required for recurrence definitions.
Invoked to establish the recurrence update rule and its algebraic closure properties.
standard math Finite automata recognize exactly the regular languages.
Used to equate the model's expressivity with the class of regular languages.

invented entities (1)

MinMax Recurrent Neural Cascade no independent evidence
purpose: A layered architecture whose recurrence is defined by min-max operations.
Core new model introduced by the paper; no independent empirical evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5550 in / 1485 out tokens · 51972 ms · 2026-05-11T01:43:06.099587+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness, washburn_uniqueness_aczel) reality_from_one_distinction, Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MinMax recurrence xt = (Rt ⊗ xt−1) ⊕ st … piece-wise linear operator with constant unit norm … ∇xut(x0) = 1 for all x0 ∈ (at, bt)
IndisputableMonolith/Foundation/DimensionForcing.lean (8-tick period, 2^D = 8) alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

parallel scan … associative operator ⊕⊗ … depth O(log T log N)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

[1]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi

Yoshua Bengio, Patrice Y . Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult.IEEE Trans. Neural Networks, 5(2):157–166, 1994. doi: 10.1109/72.279181. URLhttps://doi.org/10.1109/72.279181

work page doi:10.1109/72.279181 1994
[2]

Jeffrey L. Elman. Finding structure in time.Cognitive Science, 14(2):179–211, 1990. ISSN 0364-0213. doi: https://doi.org/10.1016/0364-0213(90)90002-E. URL https://www. sciencedirect.com/science/article/pii/036402139090002E

work page doi:10.1016/0364-0213(90)90002-e 1990
[3]

On the expressivity of recurrent neural cascades

Nadezda Alexandrovna Knorozova and Alessandro Ronca. On the expressivity of recurrent neural cascades. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium ...

work page doi:10.1609/aaai.v38i9.28929 2024
[4]

Neural Computation 9(8), 1735–1780 (1997)

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Comput., 9(8): 1735–1780, 1997. doi: 10.1162/NECO.1997.9.8.1735. URL https://doi.org/10.1162/ neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[5]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference...

work page 2017
[6]

Theoretical Limitations of Self Attention in Neural Sequence Models,

Michael Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 01 2020. ISSN 2307-387X. doi: 10.1162/tacl_a_00306. URLhttps://doi.org/10.1162/tacl_a_00306

work page doi:10.1162/tacl_a_00306 2020
[7]

What Formal Languages Can Transformers Express? A Survey

Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What for- mal languages can transformers express? a survey.Transactions of the Association for Computational Linguistics, 12:543–561, 2024. doi: 10.1162/tacl_a_00663. URL https: //aclanthology.org/2024.tacl-1.30/

work page doi:10.1162/tacl_a_00663 2024
[8]

xlstm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547– 107603, 2024

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547– 107603, 2024

work page 2024
[9]

Titans: Learning to memorize at test time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=8GjSf9Rh7Z. 10

work page 2026
[10]

Decimamba: Exploring the length extrapolation potential of mamba

Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=iWSl5Zyjjw

work page 2025
[11]

Improved state mixing in higher-order and block diagonal linear recurrent networks, 2026

Igor Dubinin, Antonio Orvieto, and Felix Effenberger. Improved state mixing in higher-order and block diagonal linear recurrent networks, 2026. URLhttps://openreview.net/forum? id=fF72TYOKjZ

work page 2026
[12]

Advancing regular language reasoning in linear recurrent neural networks

Ting-Han Fan, Ta-Chung Chi, and Alexander Rudnicky. Advancing regular language reasoning in linear recurrent neural networks. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages...

work page 2024
[13]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=tEYskw1VY2

work page 2024
[14]

Hippo: Recurrent memory with optimal polynomial projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur...

work page 2020
[15]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceeding...

work page 2024
[16]

Random feature attention

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=QtTKTdVrFBB

work page 2021
[17]

Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 ”goose” with expressive dynamic state evolution. InSecond Conference on Language Modeling,

work page
[18]

URLhttps://openreview.net/forum?id=ayB1PACN5j

work page
[19]

RWKV: Reinventing RNNs for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guan...

work page doi:10.18653/v1/2023.findings-emnlp.936 2023
[20]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 202...

work page 2020
[21]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volu...

work page 2023
[22]

Univer- sality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues

Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, and Samuel L Smith. Univer- sality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceed- ...

work page 2024
[23]

Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y . Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learnin...

work page 2023
[24]

HGRN2: Gated linear RNNs with state expansion

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion. InFirst Conference on Language Modeling,

work page
[25]

URLhttps://openreview.net/forum?id=y6SqbJfCSk

work page
[26]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Proceedings of Machine Learning Research, pages 9355–9366. PMLR, 2021. URL http: //proceedings...

work page 2021
[27]

Deltaproduct: Improving state-tracking in linear RNNs via householder products

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear RNNs via householder products. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=SoRiaijTGr

work page 2025
[28]

Learning to (learn at test time): RNNs with expressive hidden states

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=wXfuOj9C7L

work page 2025
[29]

Structured linear CDEs: Maximally expressive and parallel-in-time sequence models

Benjamin Walker, Lingyi Yang, Nicola Muca Cirone, Cristopher Salvi, and Terry Lyons. Structured linear CDEs: Maximally expressive and parallel-in-time sequence models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=HKDyRDzy1E

work page 2026
[30]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=r8H7xhYPwz

work page 2025
[31]

Benjamin Erichson

Annan Yu and N. Benjamin Erichson. Block-biased mamba for long-range sequence processing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=5WKEH9LhAQ

work page 2026
[32]

Gated slot attention for efficient linear-time sequence modeling

Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear-time sequence modeling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=jY4PhQibmg. 12

work page 2024
[33]

Blelloch and Bruce M

Guy E. Blelloch and Bruce M. Maggs. Parallel algorithms. In Mikhail J. Atallah and Marina Blanton, editors,Algorithms and Theory of Computation Handbook, chapter 25. Chapman & Hall/CRC, 2010

work page 2010
[34]

The expressive capacity of state space models: A formal language perspective

Yash Sarrof, Yana Veitsman, and Michael Hahn. The expressive capacity of state space models: A formal language perspective. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=eV5YIrJPdy

work page 2024
[35]

Metric automata theory: A unifying theory of RNNs

Adam Dankowiakowski and Alessandro Ronca. Metric automata theory: A unifying theory of RNNs. InNeurIPS, 2025

work page 2025
[36]

xLSTM 7b: A recurrent LLM for fast and efficient inference

Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M Blies, Günter Klambauer, Sebastian Böck, and Sepp Hochreiter. xLSTM 7b: A recurrent LLM for fast and efficient inference. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=LV3DpKD08B

work page 2025
[37]

Approximation capabilities of multilayer feedforward networks , journal =

Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural Networks, 4(2):251–257, 1991. doi: 10.1016/0893-6080(91)90009-T. URL https://doi.org/10. 1016/0893-6080(91)90009-T

work page doi:10.1016/0893-6080(91)90009-t 1991
[38]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[39]

Arbib.Theories of abstract automata

Michael A. Arbib.Theories of abstract automata. Prentice-Hall series in automatic computation. Prentice-Hall, Englewood Cliffs, N.J, 1969

work page 1969
[40]

Juris Hartmanis and R. E. Stearns.Algebraic structure theory of sequential machines. Prentice- Hall international series in applied mathematics. Prentice-Hall, Englewood Cliffs, N.J, 1966

work page 1966
[41]

Blelloch

G.E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, 38(11): 1526–1538, 1989. doi: 10.1109/12.42122

work page doi:10.1109/12.42122 1989
[42]

Prefix sums and their applications

Guy E Blelloch. Prefix sums and their applications. 1990

work page 1990
[43]

A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE transactions on computers, 100(8):786–793, 2009

Peter M Kogge and Harold S Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE transactions on computers, 100(8):786–793, 2009

work page 2009
[44]

On correctness of automatic differentiation for non-differentiable functions.Advances in neural information processing systems, 33:6719–6730, 2020

Wonyeol Lee, Hangyeol Yu, Xavier Rival, and Hongseok Yang. On correctness of automatic differentiation for non-differentiable functions.Advances in neural information processing systems, 33:6719–6730, 2020

work page 2020
[45]

Princeton university press, 1997

R Tyrrell Rockafellar.Convex analysis, volume 28. Princeton university press, 1997

work page 1997
[46]

Autograd mechanics

PyTorch. Autograd mechanics. https://docs.pytorch.org/docs/stable/notes/ autograd.html, 2026. Accessed: 2026-04-18

work page 2026
[47]

Learning long-term dependencies with gradient descent is difficult.IEEE transactions on neural networks, 5(2):157–166, 1994

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult.IEEE transactions on neural networks, 5(2):157–166, 1994

work page 1994
[48]

In-context learning and induction heads.Transformer Circuits Thread, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

work page 2022
[49]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des- maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

work page
[50]

URL https://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library

Curran Associates, Inc., 2019. URL https://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library . 13

work page 2019
[51]

Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022. GitHub repository. Accessed: 2026-05-04

work page 2022
[52]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

work page 2019
[53]

Academic Press, 1968

Abraham Ginzburg.Algebraic Theory of Automata. Academic Press, 1968

work page 1968
[54]

SIAM, 2005

Pál Dömösi and Chrystopher L Nehaniv.Algebraic theory of automata networks: An introduc- tion. SIAM, 2005

work page 2005
[55]

Springer, 1998

Lenore Blum, Felipe Cucker, Michael Shub, and Steve Smale.Complexity and Real Computa- tion. Springer, 1998. doi: 10.1007/978-1-4612-0701-6

work page doi:10.1007/978-1-4612-0701-6 1998
[56]

Floating-point neural networks can represent almost all floating-point functions

Geonho Hwang, Yeachan Park, Wonyeol Lee, and Sejun Park. Floating-point neural networks can represent almost all floating-point functions. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 26412–26432. PMLR, 13–19 Jul 2025

work page 2025
[57]

MIT press, 2022

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein.Introduction to algorithms. MIT press, 2022. 14 Appendices A Additional Preliminaries 16 A.1 Semigroup and Group Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Algebraic Automata Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Metric Au...

work page 2022