Recognition: unknown
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
Pith reviewed 2026-05-09 22:40 UTC · model grok-4.3
The pith
S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We derive an exact operator expression for the full forward pass of S4D, yielding an analytical characterization of its complete input-output map. This expression reveals that the nonlinear decoder in the system induces interactions between these information-carrying waves that enable classifying real-world sequences.
Load-bearing premise
The diagonal linear time-invariant implementation of S4 can be exactly embedded into a ring network topology in which inputs are encoded as waves of activity, and this embedding preserves the full computation without loss or approximation.
Figures
read the original abstract
We establish a mathematical correspondence between state space models, a state-of-the-art architecture for capturing long-range dependencies in data, and an exactly solvable nonlinear oscillator network. As a specific example of this general correspondence, we analyze the diagonal linear time-invariant implementation of the Structured State Space Sequence model (S4). The correspondence embeds S4D, a specific implementation of S4, into a ring network topology, in which recent inputs are encoded, as waves of activity traveling over the one-dimensional spatial layout of the network. We then derive an exact operator expression for the full forward pass of S4D, yielding an analytical characterization of its complete input-output map. This expression reveals that the nonlinear decoder in the system induces interactions between these information-carrying waves that enable classifying real-world sequences. These results generalize across modern SSM architectures, and show that they admit an exact mathematical description with a clear physical interpretation. These insights enable a new level of interpretability for these systems in terms of nonlinear oscillator networks.
Editorial analysis
A structured set of objections, weighed in public.
Circularity Check
No circularity: derivation presents independent mathematical embedding and operator derivation.
full rationale
The provided abstract and context describe establishing a correspondence by embedding S4D into a ring network of oscillators and deriving an exact operator for the forward pass. No quoted equations or steps reduce the claimed result to a re-expression of fitted parameters, self-citations, or ansatzes by construction. The embedding is asserted to preserve the computation exactly, and the operator is presented as newly derived from that structure. Per hard rules, absent specific quotes exhibiting reduction (e.g., Eq. X = input by definition), no circularity is identified. This is the expected outcome for a self-contained mathematical correspondence paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The diagonal linear time-invariant S4D implementation admits an exact embedding into a ring network of nonlinear oscillators that preserves the full forward pass.
Reference graph
Works this paper leans on
-
[1]
and in trained recurrent neural networks [29]. It has previously been recognized that this property can be a useful way to store long-term dependencies directly in a network’s activity structure [3, 30], but has not previ- ously been expressed in a direct mathematical form. We can now show that, when driven by input, S4D indeed stores information about th...
-
[2]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ˚A. Kaiser, and I. Polosukhin, Attention is all you need, inAdvances in Neural Infor- mation Processing Systems, Vol. 30 (2017)
2017
-
[3]
Neural Machine Translation by Jointly Learning to Align and Translate
D. Bahdanau, Neural machine translation by jointly learning to align and translate, arXiv:1409.0473 (2014)
work page internal anchor Pith review arXiv 2014
-
[4]
Muller, P
L. Muller, P. S. Churchland, and T. J. Sejnowski, Trans- formers and cortical waves: encoders for pulling in con- text across time, Trends in neurosciences (2024)
2024
- [5]
-
[6]
Generating Long Sequences with Sparse Transformers
R. Child, Generating long sequences with sparse trans- formers, arXiv:1904.10509 (2019)
work page internal anchor Pith review arXiv 1904
-
[7]
Katharopoulos, A
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, Transformers are rnns: Fast autoregressive transformers with linear attention, inInternational Conference on Ma- chine Learning(2020)
2020
-
[8]
A. Gu, K. Goel, and C. R´ e, Efficiently modeling long sequences with structured state spaces, arXiv:2111.00396 (2021)
work page internal anchor Pith review arXiv 2021
-
[9]
A. Gu, K. Goel, A. Gupta, and C. R´ e, On the parameter- ization and initialization of diagonal state space models, Advances in Neural Information Processing Systems35 (2022)
2022
-
[10]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, Mamba: Linear-time sequence model- ing with selective state spaces, arXiv:2312.00752 (2023)
work page Pith review arXiv 2023
- [11]
-
[12]
Orvieto, S
A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gul- cehre, R. Pascanu, and S. De, Resurrecting recurrent neural networks for long sequences, inInternational Con- ference on Machine Learning(PMLR, 2023)
2023
-
[13]
Elhage, N
N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly,et al., A mathematical framework for transformer circuits, Trans- former Circuits Thread1, 12 (2021)
2021
-
[14]
Wang and B
S. Wang and B. Xue, State-space models with layer-wise nonlinearity are universal approximators with exponen- tial decaying memory, inAdvances in Neural Information Processing Systems, Vol. 36 (2023)
2023
-
[15]
Muca Cirone, A
N. Muca Cirone, A. Orvieto, B. Walker, C. Salvi, and T. Lyons, Theoretical foundations of deep selective state- space models, inAdvances in Neural Information Pro- cessing Systems, Vol. 37 (2024)
2024
-
[16]
Muller, J
L. Muller, J. Min´ aˇ c, and T. T. Nguyen, Algebraic ap- proach to the kuramoto model, Physical Review E104, L022201 (2021)
2021
-
[17]
R. C. Budzinski, A. N. Busch, S. Mestern, E. Martin, L. H. B. Liboni, F. W. Pasini, J. Min´ aˇ c, T. Coleman, W. Inoue, and L. E. Muller, An exact mathematical de- scription of computation with transient spatiotemporal dynamics in a complex-valued neural network, Commu- nications Physics7, 239 (2024)
2024
-
[18]
Gupta, A
A. Gupta, A. Gu, and J. Berant, Diagonal state spaces are as effective as structured state spaces, inAdvances in neural information processing systems, Vol. 35 (2022)
2022
-
[19]
S. H. Strogatz and R. E. Mirollo, Collective synchroni- sation in lattices of nonlinear oscillators with random- ness, Journal of Physics A: Mathematical and General 21, L699 (1988)
1988
-
[20]
D. M. Abrams and S. H. Strogatz, Chimera states for coupled oscillators, Physical Review Letters93, 174102 (2004)
2004
-
[21]
L. H. B. Liboni, R. C. Budzinski, A. N. Busch, S. L¨ owe, T. A. Keller, M. Welling, and L. E. Muller, Image seg- mentation with traveling waves in an exactly solvable recurrent neural network, Proceedings of the National Academy of Sciences122, e2321319121 (2025)
2025
-
[22]
P. J. Davis,Circulant Matrices(Wiley, 1979)
1979
-
[23]
Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, 11 Long range arena: A benchmark for efficient transform- ers, inInternational Conference on Learning Representa- tions(2021)
2021
-
[24]
R. C. Budzinski, T. T. Nguyen, J. Do` an, J. Min´ aˇ c, T. J. Sejnowski, and L. E. Muller, Geometry unites synchrony, chimeras, and waves in nonlinear oscillator networks, Chaos: An Interdisciplinary Journal of Nonlinear Science 32, 031104 (2022)
2022
-
[25]
R. C. Budzinski, T. T. Nguyen, G. B. Benigno, J. Do` an, J. Min´ aˇ c, T. J. Sejnowski, and L. E. Muller, Analyti- cal prediction of specific spatiotemporal patterns in non- linear oscillator networks with distance-dependent time delays, Physical Review Research5, 013159 (2023)
2023
-
[26]
Muller, F
L. Muller, F. Chavane, J. Reynolds, and T. J. Sejnowski, Cortical travelling waves: mechanisms and computa- tional principles, Nature Reviews Neuroscience19, 255 (2018)
2018
-
[27]
G. B. Benigno, R. C. Budzinski, Z. W. Davis, J. H. Reynolds, and L. Muller, Waves traveling over a map of visual space can ignite short-term predictions of sensory input, Nature Communications14, 3409 (2023)
2023
-
[28]
T. A. Keller, L. Muller, T. Sejnowski, and M. Welling, Traveling waves encode the recent past and enhance se- quence learning, inICLR(2024)
2024
-
[29]
Perrard and M
S. Perrard and M. Labousse, Transition to chaos in wave memory dynamics in a harmonic well: Deterministic and noise-driven behavior, Chaos: An Interdisciplinary Jour- nal of Nonlinear Science28(2018)
2018
-
[30]
T. A. Keller and M. Welling, Neural wave ma- chines: learning spatiotemporally structured represen- tations with locally coupled oscillatory recurrent neural networks, inInternational Conference on Machine Learn- ing(2023)
2023
-
[31]
T. A. Keller, L. Muller, T. J. Sejnowski, and M. Welling, A spatiotemporal perspective on dynamical computation in neural information processing systems, ArXiv , arXiv (2026)
2026
-
[32]
Carleman, Application de la theorie des polynomes orthogonaux a un probleme de la theorie des fonctions analytiques, Arkiv f¨ or Matematik, Astronomi och Fysik 17, 1 (1932)
T. Carleman, Application de la theorie des polynomes orthogonaux a un probleme de la theorie des fonctions analytiques, Arkiv f¨ or Matematik, Astronomi och Fysik 17, 1 (1932)
1932
-
[33]
A., Lines, J., Flynn, M., Large, J., Bostrom, A.,
A. Bagnall, H. A. Dau, J. Levy, G. Forestier, C. Hou, G. Jehan, and L. Ye, The uea multivariate time series classification archive, 2018, arXiv:1811.00075 (2018)
-
[34]
Amini, C
A. Amini, C. Zheng, Q. Sun, and N. Motee, Carleman lin- earization of nonlinear systems and its finite-section ap- proximations, Discrete and Continuous Dynamical Sys- tems - B30, 577 (2025)
2025
-
[35]
A. M. Saxe, J. L. McClelland, and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv:1312.6120 (2013)
work page Pith review arXiv 2013
-
[36]
D. J. Heeger and W. E. Mackey, Oscillatory recurrent gated neural integrator circuits (organics), a unifying the- oretical framework for neural dynamics, Proceedings of the National Academy of Sciences116, 22783 (2019)
2019
- [37]
- [38]
-
[39]
A. Karuvally, T. J. Sejnowski, and H. T. Siegelmann, Hidden traveling waves bind working memory variables in recurrent neural networks, arXiv:2402.10163 (2024)
-
[40]
S. Muzellec, A. Alamia, T. Serre, and R. VanRullen, En- hancing deep neural networks through complex-valued representations and kuramoto synchronization dynamics, arXiv:2502.21077 (2025)
-
[41]
T. A. Engel and N. A. Steinmetz, New perspectives on di- mensionality and variability from large-scale cortical dy- namics, Current opinion in neurobiology58, 181 (2019)
2019
-
[42]
J. D. Hart, L. Larger, T. E. Murphy, and R. Roy, De- layed dynamical systems: networks, chimeras and reser- voir computing, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sci- ences377, 20180389 (2019)
2019
-
[43]
Ebato, K
Y. Ebato, K. Nakajima, and R. Masuda, Impact of time- history terms on reservoir dynamics and prediction accu- racy in echo state networks, Scientific Reports14, 8871 (2024)
2024
-
[44]
S. K. Tavakoli and A. Longtin, Boosting reservoir com- puter performance with multiple delays, Physical Review E109, 054203 (2024)
2024
-
[45]
Marzen, Time delays improve performance of certain neural networks, Physics17, 111 (2024)
S. Marzen, Time delays improve performance of certain neural networks, Physics17, 111 (2024)
2024
-
[46]
Nanda, L
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Stein- hardt, Progress measures for grokking via mechanistic interpretability, inInternational Conference on Learning Representations(2023). 12 I. APPENDIX A. Closed-form diagonalization of circulant operators LetC∈C N×N be a circulant matrix generated by the vector c= (c 1, c2, . . . , cN), such that each ...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.