An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
Pith reviewed 2026-05-09 22:40 UTC · model grok-4.3
The pith
S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We derive an exact operator expression for the full forward pass of S4D, yielding an analytical characterization of its complete input-output map. This expression reveals that the nonlinear decoder in the system induces interactions between these information-carrying waves that enable classifying real-world sequences.
Load-bearing premise
The diagonal linear time-invariant implementation of S4 can be exactly embedded into a ring network topology in which inputs are encoded as waves of activity, and this embedding preserves the full computation without loss or approximation.
Figures
read the original abstract
We establish a mathematical correspondence between state space models, a state-of-the-art architecture for capturing long-range dependencies in data, and an exactly solvable nonlinear oscillator network. As a specific example of this general correspondence, we analyze the diagonal linear time-invariant implementation of the Structured State Space Sequence model (S4). The correspondence embeds S4D, a specific implementation of S4, into a ring network topology, in which recent inputs are encoded, as waves of activity traveling over the one-dimensional spatial layout of the network. We then derive an exact operator expression for the full forward pass of S4D, yielding an analytical characterization of its complete input-output map. This expression reveals that the nonlinear decoder in the system induces interactions between these information-carrying waves that enable classifying real-world sequences. These results generalize across modern SSM architectures, and show that they admit an exact mathematical description with a clear physical interpretation. These insights enable a new level of interpretability for these systems in terms of nonlinear oscillator networks.
Editorial analysis
A structured set of objections, weighed in public.
Circularity Check
No circularity: derivation presents independent mathematical embedding and operator derivation.
full rationale
The provided abstract and context describe establishing a correspondence by embedding S4D into a ring network of oscillators and deriving an exact operator for the forward pass. No quoted equations or steps reduce the claimed result to a re-expression of fitted parameters, self-citations, or ansatzes by construction. The embedding is asserted to preserve the computation exactly, and the operator is presented as newly derived from that structure. Per hard rules, absent specific quotes exhibiting reduction (e.g., Eq. X = input by definition), no circularity is identified. This is the expected outcome for a self-contained mathematical correspondence paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The diagonal linear time-invariant S4D implementation admits an exact embedding into a ring network of nonlinear oscillators that preserves the full forward pass.
Reference graph
Works this paper leans on
-
[1]
and in trained recurrent neural networks [29]. It has previously been recognized that this property can be a useful way to store long-term dependencies directly in a network’s activity structure [3, 30], but has not previ- ously been expressed in a direct mathematical form. We can now show that, when driven by input, S4D indeed stores information about th...
-
[2]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ˚A. Kaiser, and I. Polosukhin, Attention is all you need, inAdvances in Neural Infor- mation Processing Systems, Vol. 30 (2017)
work page 2017
-
[3]
Neural Machine Translation by Jointly Learning to Align and Translate
D. Bahdanau, Neural machine translation by jointly learning to align and translate, arXiv:1409.0473 (2014)
work page internal anchor Pith review arXiv 2014
- [4]
- [5]
-
[6]
Generating Long Sequences with Sparse Transformers
R. Child, Generating long sequences with sparse trans- formers, arXiv:1904.10509 (2019)
work page internal anchor Pith review arXiv 1904
-
[7]
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, Transformers are rnns: Fast autoregressive transformers with linear attention, inInternational Conference on Ma- chine Learning(2020)
work page 2020
-
[8]
A. Gu, K. Goel, and C. R´ e, Efficiently modeling long sequences with structured state spaces, arXiv:2111.00396 (2021)
work page internal anchor Pith review arXiv 2021
-
[9]
A. Gu, K. Goel, A. Gupta, and C. R´ e, On the parameter- ization and initialization of diagonal state space models, Advances in Neural Information Processing Systems35 (2022)
work page 2022
-
[10]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, Mamba: Linear-time sequence model- ing with selective state spaces, arXiv:2312.00752 (2023)
work page Pith review arXiv 2023
-
[11]
J. T. H. Smith, A. Warrington, and S. W. Linder- man, Simplified state space layers for sequence modeling, arXiv:2208.04933 (2022)
work page internal anchor Pith review arXiv 2022
-
[12]
A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gul- cehre, R. Pascanu, and S. De, Resurrecting recurrent neural networks for long sequences, inInternational Con- ference on Machine Learning(PMLR, 2023)
work page 2023
- [13]
-
[14]
S. Wang and B. Xue, State-space models with layer-wise nonlinearity are universal approximators with exponen- tial decaying memory, inAdvances in Neural Information Processing Systems, Vol. 36 (2023)
work page 2023
-
[15]
N. Muca Cirone, A. Orvieto, B. Walker, C. Salvi, and T. Lyons, Theoretical foundations of deep selective state- space models, inAdvances in Neural Information Pro- cessing Systems, Vol. 37 (2024)
work page 2024
- [16]
-
[17]
R. C. Budzinski, A. N. Busch, S. Mestern, E. Martin, L. H. B. Liboni, F. W. Pasini, J. Min´ aˇ c, T. Coleman, W. Inoue, and L. E. Muller, An exact mathematical de- scription of computation with transient spatiotemporal dynamics in a complex-valued neural network, Commu- nications Physics7, 239 (2024)
work page 2024
- [18]
-
[19]
S. H. Strogatz and R. E. Mirollo, Collective synchroni- sation in lattices of nonlinear oscillators with random- ness, Journal of Physics A: Mathematical and General 21, L699 (1988)
work page 1988
-
[20]
D. M. Abrams and S. H. Strogatz, Chimera states for coupled oscillators, Physical Review Letters93, 174102 (2004)
work page 2004
-
[21]
L. H. B. Liboni, R. C. Budzinski, A. N. Busch, S. L¨ owe, T. A. Keller, M. Welling, and L. E. Muller, Image seg- mentation with traveling waves in an exactly solvable recurrent neural network, Proceedings of the National Academy of Sciences122, e2321319121 (2025)
work page 2025
-
[22]
P. J. Davis,Circulant Matrices(Wiley, 1979)
work page 1979
-
[23]
Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, 11 Long range arena: A benchmark for efficient transform- ers, inInternational Conference on Learning Representa- tions(2021)
work page 2021
-
[24]
R. C. Budzinski, T. T. Nguyen, J. Do` an, J. Min´ aˇ c, T. J. Sejnowski, and L. E. Muller, Geometry unites synchrony, chimeras, and waves in nonlinear oscillator networks, Chaos: An Interdisciplinary Journal of Nonlinear Science 32, 031104 (2022)
work page 2022
-
[25]
R. C. Budzinski, T. T. Nguyen, G. B. Benigno, J. Do` an, J. Min´ aˇ c, T. J. Sejnowski, and L. E. Muller, Analyti- cal prediction of specific spatiotemporal patterns in non- linear oscillator networks with distance-dependent time delays, Physical Review Research5, 013159 (2023)
work page 2023
- [26]
-
[27]
G. B. Benigno, R. C. Budzinski, Z. W. Davis, J. H. Reynolds, and L. Muller, Waves traveling over a map of visual space can ignite short-term predictions of sensory input, Nature Communications14, 3409 (2023)
work page 2023
-
[28]
T. A. Keller, L. Muller, T. Sejnowski, and M. Welling, Traveling waves encode the recent past and enhance se- quence learning, inICLR(2024)
work page 2024
-
[29]
S. Perrard and M. Labousse, Transition to chaos in wave memory dynamics in a harmonic well: Deterministic and noise-driven behavior, Chaos: An Interdisciplinary Jour- nal of Nonlinear Science28(2018)
work page 2018
-
[30]
T. A. Keller and M. Welling, Neural wave ma- chines: learning spatiotemporally structured represen- tations with locally coupled oscillatory recurrent neural networks, inInternational Conference on Machine Learn- ing(2023)
work page 2023
-
[31]
T. A. Keller, L. Muller, T. J. Sejnowski, and M. Welling, A spatiotemporal perspective on dynamical computation in neural information processing systems, ArXiv , arXiv (2026)
work page 2026
-
[32]
T. Carleman, Application de la theorie des polynomes orthogonaux a un probleme de la theorie des fonctions analytiques, Arkiv f¨ or Matematik, Astronomi och Fysik 17, 1 (1932)
work page 1932
-
[33]
The UEA multivariate time series classification archive, 2018
A. Bagnall, H. A. Dau, J. Levy, G. Forestier, C. Hou, G. Jehan, and L. Ye, The uea multivariate time series classification archive, 2018, arXiv:1811.00075 (2018)
work page Pith review arXiv 2018
- [34]
-
[35]
A. M. Saxe, J. L. McClelland, and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv:1312.6120 (2013)
work page Pith review arXiv 2013
-
[36]
D. J. Heeger and W. E. Mackey, Oscillatory recurrent gated neural integrator circuits (organics), a unifying the- oretical framework for neural dynamics, Proceedings of the National Academy of Sciences116, 22783 (2019)
work page 2019
- [37]
- [38]
-
[39]
A. Karuvally, T. J. Sejnowski, and H. T. Siegelmann, Hidden traveling waves bind working memory variables in recurrent neural networks, arXiv:2402.10163 (2024)
-
[40]
S. Muzellec, A. Alamia, T. Serre, and R. VanRullen, En- hancing deep neural networks through complex-valued representations and kuramoto synchronization dynamics, arXiv:2502.21077 (2025)
-
[41]
T. A. Engel and N. A. Steinmetz, New perspectives on di- mensionality and variability from large-scale cortical dy- namics, Current opinion in neurobiology58, 181 (2019)
work page 2019
-
[42]
J. D. Hart, L. Larger, T. E. Murphy, and R. Roy, De- layed dynamical systems: networks, chimeras and reser- voir computing, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sci- ences377, 20180389 (2019)
work page 2019
- [43]
-
[44]
S. K. Tavakoli and A. Longtin, Boosting reservoir com- puter performance with multiple delays, Physical Review E109, 054203 (2024)
work page 2024
-
[45]
Marzen, Time delays improve performance of certain neural networks, Physics17, 111 (2024)
S. Marzen, Time delays improve performance of certain neural networks, Physics17, 111 (2024)
work page 2024
-
[46]
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Stein- hardt, Progress measures for grokking via mechanistic interpretability, inInternational Conference on Learning Representations(2023). 12 I. APPENDIX A. Closed-form diagonalization of circulant operators LetC∈C N×N be a circulant matrix generated by the vector c= (c 1, c2, . . . , cN), such that each ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.