pith. machine review for the scientific record. sign in

arxiv: 2605.06384 · v2 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· cs.FL

Recognition: 2 theorem links

· Lean Theorem

MinMax Recurrent Neural Cascades

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.FL
keywords MinMax algebrarecurrent neural networksregular languagesvanishing gradientsparallel evaluationsequence modelingfinite automatagradient stability
0
0 comments X

The pith

MinMax Recurrent Neural Cascades express every regular language, evaluate in logarithmic parallel time, and maintain constant state gradients over arbitrary distances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MinMax Recurrent Neural Cascades as a new form of recurrence based on the min and max operations rather than weighted sums and nonlinear activations. It sets out to prove that these models reach the full expressivity of regular languages, support parallel evaluation whose runtime grows only logarithmically with input length, keep all internal states and activations inside fixed bounds for inputs of any size, and avoid vanishing state gradients so that the influence of a past state on the current one can remain exactly one no matter how far apart they are. A reader would care because conventional recurrent networks lose long-range information through gradient decay or explosion, limiting their reliability on extended sequences. The work supplies formal proofs for the listed properties plus experimental confirmation that the models solve synthetic tasks perfectly and reach competitive next-token prediction performance at a scale of 127 million parameters.

Core claim

MinMax RNCs are built by cascading layers whose recurrence follows the MinMax algebra; this construction yields models whose formal expressivity includes all regular languages, whose parallel evaluation runs in time logarithmic in sequence length, whose states and activations remain uniformly bounded for every input length, whose loss gradients exist and stay bounded almost everywhere, and whose state gradients never vanish because the derivative of a state with respect to an earlier state can equal one independently of the time gap between them.

What carries the argument

The MinMax recurrence step, which replaces the standard linear-plus-nonlinearity update with direct min and max operations over previous states and current inputs, then cascaded across multiple layers.

If this is right

  • Any regular language can be recognized exactly by some MinMax RNC of finite size.
  • Sequence evaluation can be performed either sequentially or in parallel with runtime O(log n) given sufficient processors.
  • States and activations stay inside fixed numerical bounds regardless of how long the input grows.
  • The gradient of any state with respect to a state from arbitrarily far in the past can remain exactly one at almost all parameter settings.
  • On synthetic tasks that stress long-range dependencies the models reach perfect solutions while standard recurrent networks do not.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The uniform bounds and non-vanishing gradients could allow reliable training on sequences far longer than those feasible for vanilla RNNs or LSTMs.
  • Hybrid architectures that insert MinMax layers into larger networks might inherit stability while adding other inductive biases.
  • The non-differentiable loci might still require careful initialization or smoothing in very deep cascades before the theoretical advantages appear on noisy real-world data.

Load-bearing premise

The points where min and max are non-differentiable do not block reliable gradient-based training, and the algebraic bounds proved for single layers continue to hold once the layers are stacked and the whole network is optimized on actual data.

What would settle it

Training a MinMax RNC on a regular language such as balanced parentheses and observing either failure to reach perfect accuracy or a state gradient that shrinks toward zero with increasing distance would falsify the expressivity or gradient claims.

Figures

Figures reproduced from arXiv: 2605.06384 by Alessandro Ronca.

Figure 1
Figure 1. Figure 1: Diagram of a MinMax Recurrent Layer consisting of two units view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation accuracy: Latching(4) on the left, Sequences(3) on the right. AutoDiff [44]. However, different computation graphs may result into different intensional derivatives, as the chain rule for PAP function is not guaranteed to be invariant across different decompositions of a function. Thus, we could only make a claim for the intensional derivatives obtained through the computation graph we have cons… view at source ↗
read the original abstract

We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivity for a finite-memory system. Second, they can be evaluated in parallel with a runtime that is logarithmic in the input length given enough processors; and they can also be evaluated sequentially. Third, their state and activations are bounded uniformly for all input lengths. Fourth, at almost all points, their loss gradient exists and it is bounded. Fifth, they do not exhibit a vanishing state gradient: the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance between the two states. Finally, we find empirical evidence that the favourable theoretical properties of MinMax RNCs are matched by their practical capabilities: they are able to perfectly solve a number of synthetic tasks, showing superior performance compared to the considered state-of-the-art recurrent neural networks; also, we train a MinMax RNC of 127M parameters on next-token prediction, and the obtained model shows competitive performance for its size, providing evidence of the potential of MinMax RNCs on real-world tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MinMax Recurrent Neural Cascades (RNCs) based on min-max algebra for recurrence. It claims these models express all regular languages, support parallel logarithmic-time evaluation (and sequential), maintain uniform bounds on states and activations for any input length, have loss gradients that exist at almost all points and are bounded, and exhibit non-vanishing state gradients where the partial derivative of a state at time t w.r.t. a past state at s equals 1 independently of |t-s|. Empirical support is given via perfect performance on synthetic tasks (outperforming standard RNNs) and competitive results from a 127M-parameter model on next-token prediction.

Significance. If the central derivations hold, the work provides a theoretically grounded RNN variant with automata-theoretic expressivity, parallel efficiency, and explicit non-vanishing gradient guarantees derived from the min-max semiring. The explicit connection to regular languages and the scaling demonstration to 127M parameters are notable strengths that could influence recurrent architecture design for long-range tasks.

major comments (2)
  1. [gradient analysis section] Abstract and the gradient analysis section: the claim that 'the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance' is derived from algebraic identities in the min-max semiring that hold only at differentiable points. Cascading layers composes these operations and enlarges the non-differentiable set (where arguments are equal); the manuscript states gradients exist 'at almost all points' but does not establish that the constant-1 identity survives in the subgradient sense after composition, which is load-bearing for the no-vanishing-gradient and trainability claims.
  2. [empirical evaluation section] The 127M-parameter next-token experiment: while competitive performance is reported, no ablation or gradient-norm analysis is provided to test whether optimization actually exploits the claimed constant state-gradient property versus other convergence mechanisms. This weakens the link between the theoretical gradient bound and the practical scaling result.
minor comments (2)
  1. The cascade construction and its relation to standard automata should be illustrated with a small explicit example (e.g., a 2-state acceptor) to make the expressivity proof more accessible.
  2. Notation for the min and max operations and the recurrence update should be unified across the theoretical and experimental sections to avoid ambiguity in the boundedness proofs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment in detail below, indicating revisions where appropriate to clarify and strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [gradient analysis section] Abstract and the gradient analysis section: the claim that 'the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance' is derived from algebraic identities in the min-max semiring that hold only at differentiable points. Cascading layers composes these operations and enlarges the non-differentiable set (where arguments are equal); the manuscript states gradients exist 'at almost all points' but does not establish that the constant-1 identity survives in the subgradient sense after composition, which is load-bearing for the no-vanishing-gradient and trainability claims.

    Authors: The referee correctly identifies that the core algebraic identities for the constant gradient of 1 are established at points of differentiability for the basic min-max recurrence. The manuscript already states that gradients exist at almost all points, which follows because the non-differentiable loci (where min or max arguments are equal) form a lower-dimensional subset of the input space. For cascaded layers, this property is preserved under composition: the overall non-differentiable set remains a union of measure-zero manifolds. We agree that an explicit argument for the subgradient (in the sense of Clarke's generalized gradient) is needed to confirm that 1 remains in the subgradient after chaining the operations. We will revise the gradient analysis section to include a short proof sketch showing that the chain rule for subgradients preserves the constant-1 element in the limiting sense from nearby differentiable points, thereby supporting the non-vanishing claim for the full cascaded model. revision: yes

  2. Referee: [empirical evaluation section] The 127M-parameter next-token experiment: while competitive performance is reported, no ablation or gradient-norm analysis is provided to test whether optimization actually exploits the claimed constant state-gradient property versus other convergence mechanisms. This weakens the link between the theoretical gradient bound and the practical scaling result.

    Authors: We acknowledge that the 127M-parameter experiment reports competitive next-token prediction performance without accompanying ablation studies or explicit gradient-norm measurements that would isolate the contribution of the constant state-gradient property. The empirical section is presented as supporting evidence of practical scalability rather than a direct empirical validation of the gradient theory; the theoretical non-vanishing result stands independently. Nevertheless, the successful training of a model at this scale without divergence is consistent with the bounded-gradient guarantees. We will add a paragraph in the empirical evaluation section discussing how the uniform state bounds and non-vanishing gradient property provide a theoretical basis for stable optimization at large parameter counts, and we will include a brief note on the absence of vanishing gradients observed during training. revision: partial

Circularity Check

0 steps flagged

No circularity: properties derived from external min/max algebra and automata theory

full rationale

The derivation begins from the explicit definition of MinMax recurrence (using min and max as the state-update operators) and proceeds via standard algebraic identities and automata constructions to prove expressivity, parallelizability, uniform bounds, and non-vanishing state gradients. These steps rely on external facts about min/max (which are differentiable almost everywhere) and regular-language theory rather than on any fitted parameter, self-referential definition, or load-bearing self-citation. No claimed prediction or result is shown to equal its own input by construction; the 127 M-parameter experiment is presented only as empirical corroboration, not as part of the formal chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the algebraic properties of the min-max semiring and standard results from automata theory; no free parameters or new invented entities are introduced beyond the model definition itself.

axioms (2)
  • standard math The min-max semiring is associative, commutative, and distributive in the manner required for recurrence definitions.
    Invoked to establish the recurrence update rule and its algebraic closure properties.
  • standard math Finite automata recognize exactly the regular languages.
    Used to equate the model's expressivity with the class of regular languages.
invented entities (1)
  • MinMax Recurrent Neural Cascade no independent evidence
    purpose: A layered architecture whose recurrence is defined by min-max operations.
    Core new model introduced by the paper; no independent empirical evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5550 in / 1485 out tokens · 51972 ms · 2026-05-11T01:43:06.099587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

  1. [1]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi

    Yoshua Bengio, Patrice Y . Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult.IEEE Trans. Neural Networks, 5(2):157–166, 1994. doi: 10.1109/72.279181. URLhttps://doi.org/10.1109/72.279181

  2. [2]

    Jeffrey L. Elman. Finding structure in time.Cognitive Science, 14(2):179–211, 1990. ISSN 0364-0213. doi: https://doi.org/10.1016/0364-0213(90)90002-E. URL https://www. sciencedirect.com/science/article/pii/036402139090002E

  3. [3]

    On the expressivity of recurrent neural cascades

    Nadezda Alexandrovna Knorozova and Alessandro Ronca. On the expressivity of recurrent neural cascades. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium ...

  4. [4]

    Neural Computation 9(8), 1735–1780 (1997)

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Comput., 9(8): 1735–1780, 1997. doi: 10.1162/NECO.1997.9.8.1735. URL https://doi.org/10.1162/ neco.1997.9.8.1735

  5. [5]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference...

  6. [6]

    Theoretical Limitations of Self ­Attention in Neural Sequence Models,

    Michael Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 01 2020. ISSN 2307-387X. doi: 10.1162/tacl_a_00306. URLhttps://doi.org/10.1162/tacl_a_00306

  7. [7]

    What Formal Languages Can Transformers Express? A Survey

    Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What for- mal languages can transformers express? a survey.Transactions of the Association for Computational Linguistics, 12:543–561, 2024. doi: 10.1162/tacl_a_00663. URL https: //aclanthology.org/2024.tacl-1.30/

  8. [8]

    xlstm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547– 107603, 2024

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547– 107603, 2024

  9. [9]

    Titans: Learning to memorize at test time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=8GjSf9Rh7Z. 10

  10. [10]

    Decimamba: Exploring the length extrapolation potential of mamba

    Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=iWSl5Zyjjw

  11. [11]

    Improved state mixing in higher-order and block diagonal linear recurrent networks, 2026

    Igor Dubinin, Antonio Orvieto, and Felix Effenberger. Improved state mixing in higher-order and block diagonal linear recurrent networks, 2026. URLhttps://openreview.net/forum? id=fF72TYOKjZ

  12. [12]

    Advancing regular language reasoning in linear recurrent neural networks

    Ting-Han Fan, Ta-Chung Chi, and Alexander Rudnicky. Advancing regular language reasoning in linear recurrent neural networks. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages...

  13. [13]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=tEYskw1VY2

  14. [14]

    Hippo: Recurrent memory with optimal polynomial projections

    Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur...

  15. [15]

    Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceeding...

  16. [16]

    Random feature attention

    Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=QtTKTdVrFBB

  17. [17]

    Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng

    Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 ”goose” with expressive dynamic state evolution. InSecond Conference on Language Modeling,

  18. [18]

    URLhttps://openreview.net/forum?id=ayB1PACN5j

  19. [19]

    RWKV: Reinventing RNNs for the transformer era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guan...

  20. [20]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 202...

  21. [21]

    Resurrecting recurrent neural networks for long sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volu...

  22. [22]

    Univer- sality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues

    Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, and Samuel L Smith. Univer- sality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceed- ...

  23. [23]

    Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré

    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y . Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learnin...

  24. [24]

    HGRN2: Gated linear RNNs with state expansion

    Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion. InFirst Conference on Language Modeling,

  25. [25]

    URLhttps://openreview.net/forum?id=y6SqbJfCSk

  26. [26]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Proceedings of Machine Learning Research, pages 9355–9366. PMLR, 2021. URL http: //proceedings...

  27. [27]

    Deltaproduct: Improving state-tracking in linear RNNs via householder products

    Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear RNNs via householder products. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=SoRiaijTGr

  28. [28]

    Learning to (learn at test time): RNNs with expressive hidden states

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=wXfuOj9C7L

  29. [29]

    Structured linear CDEs: Maximally expressive and parallel-in-time sequence models

    Benjamin Walker, Lingyi Yang, Nicola Muca Cirone, Cristopher Salvi, and Terry Lyons. Structured linear CDEs: Maximally expressive and parallel-in-time sequence models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=HKDyRDzy1E

  30. [30]

    Gated delta networks: Improving mamba2 with delta rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=r8H7xhYPwz

  31. [31]

    Benjamin Erichson

    Annan Yu and N. Benjamin Erichson. Block-biased mamba for long-range sequence processing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=5WKEH9LhAQ

  32. [32]

    Gated slot attention for efficient linear-time sequence modeling

    Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear-time sequence modeling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=jY4PhQibmg. 12

  33. [33]

    Blelloch and Bruce M

    Guy E. Blelloch and Bruce M. Maggs. Parallel algorithms. In Mikhail J. Atallah and Marina Blanton, editors,Algorithms and Theory of Computation Handbook, chapter 25. Chapman & Hall/CRC, 2010

  34. [34]

    The expressive capacity of state space models: A formal language perspective

    Yash Sarrof, Yana Veitsman, and Michael Hahn. The expressive capacity of state space models: A formal language perspective. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=eV5YIrJPdy

  35. [35]

    Metric automata theory: A unifying theory of RNNs

    Adam Dankowiakowski and Alessandro Ronca. Metric automata theory: A unifying theory of RNNs. InNeurIPS, 2025

  36. [36]

    xLSTM 7b: A recurrent LLM for fast and efficient inference

    Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M Blies, Günter Klambauer, Sebastian Böck, and Sepp Hochreiter. xLSTM 7b: A recurrent LLM for fast and efficient inference. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=LV3DpKD08B

  37. [37]

    Approximation capabilities of multilayer feedforward networks , journal =

    Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural Networks, 4(2):251–257, 1991. doi: 10.1016/0893-6080(91)90009-T. URL https://doi.org/10. 1016/0893-6080(91)90009-T

  38. [38]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  39. [39]

    Arbib.Theories of abstract automata

    Michael A. Arbib.Theories of abstract automata. Prentice-Hall series in automatic computation. Prentice-Hall, Englewood Cliffs, N.J, 1969

  40. [40]

    Juris Hartmanis and R. E. Stearns.Algebraic structure theory of sequential machines. Prentice- Hall international series in applied mathematics. Prentice-Hall, Englewood Cliffs, N.J, 1966

  41. [41]

    Blelloch

    G.E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, 38(11): 1526–1538, 1989. doi: 10.1109/12.42122

  42. [42]

    Prefix sums and their applications

    Guy E Blelloch. Prefix sums and their applications. 1990

  43. [43]

    A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE transactions on computers, 100(8):786–793, 2009

    Peter M Kogge and Harold S Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE transactions on computers, 100(8):786–793, 2009

  44. [44]

    On correctness of automatic differentiation for non-differentiable functions.Advances in neural information processing systems, 33:6719–6730, 2020

    Wonyeol Lee, Hangyeol Yu, Xavier Rival, and Hongseok Yang. On correctness of automatic differentiation for non-differentiable functions.Advances in neural information processing systems, 33:6719–6730, 2020

  45. [45]

    Princeton university press, 1997

    R Tyrrell Rockafellar.Convex analysis, volume 28. Princeton university press, 1997

  46. [46]

    Autograd mechanics

    PyTorch. Autograd mechanics. https://docs.pytorch.org/docs/stable/notes/ autograd.html, 2026. Accessed: 2026-04-18

  47. [47]

    Learning long-term dependencies with gradient descent is difficult.IEEE transactions on neural networks, 5(2):157–166, 1994

    Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult.IEEE transactions on neural networks, 5(2):157–166, 1994

  48. [48]

    In-context learning and induction heads.Transformer Circuits Thread, 2022

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  49. [49]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des- maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

  50. [50]

    URL https://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library

    Curran Associates, Inc., 2019. URL https://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library . 13

  51. [51]

    Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2022. GitHub repository. Accessed: 2026-05-04

  52. [52]

    Openwebtext corpus

    Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

  53. [53]

    Academic Press, 1968

    Abraham Ginzburg.Algebraic Theory of Automata. Academic Press, 1968

  54. [54]

    SIAM, 2005

    Pál Dömösi and Chrystopher L Nehaniv.Algebraic theory of automata networks: An introduc- tion. SIAM, 2005

  55. [55]

    Springer, 1998

    Lenore Blum, Felipe Cucker, Michael Shub, and Steve Smale.Complexity and Real Computa- tion. Springer, 1998. doi: 10.1007/978-1-4612-0701-6

  56. [56]

    Floating-point neural networks can represent almost all floating-point functions

    Geonho Hwang, Yeachan Park, Wonyeol Lee, and Sejun Park. Floating-point neural networks can represent almost all floating-point functions. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 26412–26432. PMLR, 13–19 Jul 2025

  57. [57]

    MIT press, 2022

    Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein.Introduction to algorithms. MIT press, 2022. 14 Appendices A Additional Preliminaries 16 A.1 Semigroup and Group Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Algebraic Automata Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Metric Au...