Recognition: unknown
State-Space NTK Collapse Near Bifurcations
Pith reviewed 2026-05-14 20:59 UTC · model grok-4.3
The pith
Near bifurcations, the state-space NTK reduces to a rank-one operator matching the learning geometry of a classical normal form.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bifurcations both dominate and simplify learning dynamics: near bifurcations, we can reduce sNTK to a rank-one operator corresponding to learning in a classical normal form system, providing an analytically tractable description of the local learning geometry, even for high-dimensional recurrent systems. Concretely, the sNTK decomposes into bifurcation-relevant and residual channels; near common codimension-1 bifurcations the relevant channel is rank-one and highly amplified, so it dominates the full kernel and funnels gradient descent into a few critical dynamical directions whose geometry matches the normal form.
What carries the argument
The empirical state-space neural tangent kernel (sNTK), decomposed into a dominant rank-one bifurcation channel and residual channels via local linearization around the transition point.
If this is right
- The bifurcation channel amplifies and dominates the full sNTK, warping the learning landscape toward a few critical dynamical directions.
- Local loss geometry and gradient steps near the transition become predictable from the classical normal form of the bifurcation.
- In student-teacher RNNs the first learned bifurcation produces a sharp drop in sNTK effective rank together with a dominant parameter direction.
- Low-rank natural gradient methods remove the resulting instability with negligible extra cost over SGD.
Where Pith is reading between the lines
- Monitoring the effective rank of the sNTK during training could serve as an online diagnostic for impending dynamical transitions.
- The same reduction may apply to other architectures that exhibit qualitative changes in internal dynamics during learning.
- Curricula or regularizers could be designed to steer networks toward or away from specific bifurcations to control stability.
- The rank-one structure suggests that effective low-dimensional descriptions of learning may exist more broadly near any point where the Jacobian spectrum crosses the imaginary axis.
Load-bearing premise
The local linearization around the bifurcation point and the separation into bifurcation-relevant versus residual channels remain valid and dominant inside the full nonlinear high-dimensional dynamics.
What would settle it
In a high-dimensional nonlinear recurrent network approaching a pitchfork or Hopf bifurcation, the effective rank of the sNTK fails to collapse and no single parameter direction emerges whose restricted kernel matches the scalar normal-form prediction.
Figures
read the original abstract
Rich feature learning in tasks that unfold over time often requires the model to pass through bifurcations, constituting qualitative changes in the underlying model dynamics. We develop a local theory of gradient descent near these transitions through the empirical state-space neural tangent kernel (sNTK). Our central finding is that bifurcations both dominate and simplify learning dynamics: near bifurcations, we can reduce sNTK to a rank-one operator corresponding to learning in a classical normal form system, providing an analytically tractable description of the local learning geometry, even for high-dimensional recurrent systems. Concretely, we give a procedure for decomposing sNTK into bifurcation-relevant and residual channels, showing that near commonly codimension-1 bifurcations the relevant channel is a rank-one operator that is highly amplified. This amplification causes the bifurcation channel to dominate the full sNTK. Thus, bifurcations locally warp the learning landscape, funneling gradient descent into a few critical dynamical directions and making the nearby kernel and loss geometry predictable from classical normal forms. We illustrate this in a student-teacher recurrent neural network: the first learned bifurcation coincides with a sharp collapse in sNTK effective rank and the emergence of a dominant parameter direction whose restricted sNTK closely matches the landscape predicted by the scalar pitchfork normal form. Finally, we show that low-rank natural gradient methods resolve the resulting learning instability near bifurcations with very little overhead over SGD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a local theory of gradient descent near bifurcations in recurrent systems via the empirical state-space neural tangent kernel (sNTK). Its central claim is that near codimension-1 bifurcations the sNTK reduces to a rank-one operator corresponding to learning in a classical normal-form system; this operator is highly amplified, dominates the full kernel, and thereby warps the local learning geometry into a few critical dynamical directions that are analytically tractable even for high-dimensional RNNs. The reduction is illustrated numerically in a student-teacher RNN, where the first learned bifurcation produces a sharp collapse in sNTK effective rank and a dominant parameter direction whose restricted kernel matches the scalar pitchfork normal form; low-rank natural gradient descent is shown to mitigate the resulting instability.
Significance. If the reduction and dominance can be placed on rigorous footing, the work supplies a concrete link between bifurcation theory and neural tangent kernels that could explain feature-learning instabilities in recurrent models and guide the design of geometry-aware optimizers. The explicit decomposition procedure and the numerical demonstration of rank collapse are concrete strengths; the connection to normal forms offers falsifiable predictions for the local loss landscape.
major comments (2)
- [Decomposition procedure and §4 (numerical illustration)] The headline claim that the bifurcation channel dominates the sNTK (and therefore controls learning) requires a bound showing that the residual operator norm remains o(1) relative to the amplified rank-one term as the bifurcation parameter approaches criticality, even after all nonlinear interactions are restored. No such spectral or perturbation estimate is derived; the manuscript supplies only the decomposition procedure and numerical evidence restricted to one student-teacher RNN architecture.
- [Local theory development (near Eq. for rank-one operator)] The amplification factor of the bifurcation channel is listed among the free parameters in the supporting analysis. If this factor is fitted rather than derived parameter-free from the normal-form linearization, the reduction is not fully analytical and the claimed tractability is weakened.
minor comments (2)
- [Abstract] The abstract refers to 'commonly codimension-1 bifurcations' without enumerating them; a short explicit list (pitchfork, transcritical, Hopf, etc.) would clarify the scope.
- [Preliminaries] Notation for the state-space NTK versus the ordinary NTK is introduced without a dedicated comparison table; readers would benefit from one.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. We address each major comment below with clarifications on the derivations and the supporting numerical evidence.
read point-by-point responses
-
Referee: [Decomposition procedure and §4 (numerical illustration)] The headline claim that the bifurcation channel dominates the sNTK (and therefore controls learning) requires a bound showing that the residual operator norm remains o(1) relative to the amplified rank-one term as the bifurcation parameter approaches criticality, even after all nonlinear interactions are restored. No such spectral or perturbation estimate is derived; the manuscript supplies only the decomposition procedure and numerical evidence restricted to one student-teacher RNN architecture.
Authors: The decomposition procedure yields an exact separation of the empirical sNTK into the rank-one bifurcation channel (obtained from the normal-form linearization) and the residual operator. The amplification of the bifurcation channel is controlled by the distance to criticality in the normal form, which grows without bound as the bifurcation parameter approaches the critical value. While a general spectral bound on the residual (accounting for all nonlinear terms) is not derived, the local theory shows that the residual remains bounded while the bifurcation term diverges, implying dominance in a sufficiently small neighborhood of the bifurcation. The numerical experiments in §4 for the student-teacher RNN confirm that the effective rank collapses and the residual contribution is negligible near the learned bifurcation. In revision we will add explicit operator-norm comparisons in §4 and a remark stating the local regime of validity. revision: partial
-
Referee: [Local theory development (near Eq. for rank-one operator)] The amplification factor of the bifurcation channel is listed among the free parameters in the supporting analysis. If this factor is fitted rather than derived parameter-free from the normal-form linearization, the reduction is not fully analytical and the claimed tractability is weakened.
Authors: The amplification factor is derived directly and parameter-free from the normal-form linearization. For the codimension-1 pitchfork case it equals the reciprocal of the unfolding parameter (or equivalently the real part of the critical eigenvalue of the Jacobian evaluated at the bifurcation point). This quantity is computed from the system dynamics without any fitting. We will revise the text surrounding the rank-one operator equation to display this explicit derivation from the linearization, making the parameter-free character clear. revision: yes
Circularity Check
No significant circularity; local reduction derived from normal-form linearization
full rationale
The paper's central derivation proceeds by decomposing the empirical state-space NTK into bifurcation-relevant and residual channels using the local linearization around codimension-1 bifurcations and the structure of classical normal forms. The rank-one character and amplification of the relevant channel follow directly from the normal-form Jacobian and the projection onto the critical mode, without any parameter fitting that is then re-labeled as a prediction. Numerical results on the student-teacher RNN are presented as illustration and validation of the analytic reduction rather than as the source of the rank-one claim itself. No self-citations, ansatzes, or uniqueness theorems imported from prior author work are invoked as load-bearing steps in the provided chain, and the residual-channel suppression is treated as a consequence of the local theory rather than an input. The derivation therefore remains self-contained against the normal-form assumptions.
Axiom & Free-Parameter Ledger
free parameters (1)
- amplification factor of bifurcation channel
axioms (2)
- domain assumption Local linearization of the RNN dynamics around the bifurcation point yields a valid empirical NTK
- standard math Codimension-1 bifurcations admit standard normal forms whose learning geometry is analytically tractable
invented entities (1)
-
bifurcation-relevant channel of sNTK
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2311.10869 , year=
Evolutionary algorithms as an alternative to backpropagation for supervised training of Biophysical Neural Networks and Neural ODEs , author=. arXiv preprint arXiv:2311.10869 , year=
-
[2]
2015 , eprint=
Deep Residual Learning for Image Recognition , author=. 2015 , eprint=
2015
-
[3]
James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander
-
[4]
and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and
Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =
-
[5]
Papyan, Vardan and Han, X. Y. and Donoho, David L. , title =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =
2020
-
[6]
Organizing recurrent network dynamics by task-computation to enable continual learning , url =
Duncker, Lea and Driscoll, Laura and Shenoy, Krishna V and Sahani, Maneesh and Sussillo, David , booktitle =. Organizing recurrent network dynamics by task-computation to enable continual learning , url =
-
[7]
and Bialek, William , title =
Tishby, Naftali and Pereira, Fernando C. and Bialek, William , title =. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , year =
-
[8]
Frederickson, Paul and Kaplan, J. L. and Yorke, E. D. and Yorke, J. A. , title =. Journal of Differential Equations , volume =
-
[9]
1962 , publisher =
The Mathematical Theory of Optimal Processes , author =. 1962 , publisher =
1962
-
[10]
Rumelhart, D. E. and Hinton, G. E. and Williams, R. J. , title =. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations , year =
-
[11]
2016 , eprint=
Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=
2016
-
[12]
Physical Review X , volume=
Transition to Chaos in Random Neuronal Networks , author=. Physical Review X , volume=. 2015 , publisher=
2015
-
[13]
2025 , eprint=
Dynamically Learning to Integrate in Recurrent Neural Networks , author=. 2025 , eprint=
2025
-
[14]
2020 , eprint=
Lyapunov spectra of chaotic recurrent neural networks , author=. 2020 , eprint=
2020
-
[15]
, title =
Tolmachev, Pavel and Engel, Tatiana A. , title =. 2024 , doi =. https://www.biorxiv.org/content/early/2024/11/24/2024.11.23.625012.full.pdf , journal =
2024
-
[16]
2019 , eprint=
PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=
2019
-
[17]
2019 , eprint=
Surrogate Gradient Learning in Spiking Neural Networks , author=. 2019 , eprint=
2019
-
[18]
Exploring the Impact of Activation Functions in Training Neural
Tianxiang Gao and Siyuan Sun and Hailiang Liu and Hongyang Gao , booktitle=. Exploring the Impact of Activation Functions in Training Neural. 2025 , url=
2025
-
[19]
The Journal of physiology , volume=
A quantitative description of membrane current and its application to conduction and excitation in nerve , author=. The Journal of physiology , volume=. 1952 , doi=
1952
-
[20]
The interplay between randomness and structure during learning in RNNs , url =
Schuessler, Friedrich and Mastrogiuseppe, Francesca and Dubreuil, Alexis and Ostojic, Srdjan and Barak, Omri , booktitle =. The interplay between randomness and structure during learning in RNNs , url =
-
[21]
Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows , author=
Discretize-Optimize vs. Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows , author=. 2020 , eprint=
2020
-
[22]
Neural Networks: Tricks of the Trade , pages =
Efficient BackProp , author =. Neural Networks: Tricks of the Trade , pages =. 1998 , publisher =
1998
-
[23]
Advances in Neural Information Processing Systems , volume =
Neural Ordinary Differential Equations , author =. Advances in Neural Information Processing Systems , volume =
-
[24]
arXiv preprint arXiv:1711.00579 , year =
A Proposal on Machine Learning via Dynamical Systems , author =. arXiv preprint arXiv:1711.00579 , year =
-
[25]
Des premiers travaux de Le Verrier \`a la d\'ecouverte de Neptune
Stable Architectures for Deep Neural Networks , author =. arXiv preprint arXiv:1710.03688 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Optimal Stopping and the Sufficiency of Randomized Threshold Strategies
Maximum Principle Based Algorithms for Deep Learning , author =. arXiv preprint arXiv:1708.01038 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Automatic Differentiation of Algorithms , pages =
Gradient Calculations for Dynamic Systems , author =. Automatic Differentiation of Algorithms , pages =. 1995 , publisher =
1995
-
[28]
Advances in Neural Information Processing Systems , volume=
Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
International Conference on Machine Learning , year=
A Convergence Theory for Deep Learning via Over-Parameterization , author=. International Conference on Machine Learning , year=
-
[30]
Advances in Neural Information Processing Systems , volume=
Mean-field theory of two-layer neural networks: dimension-free bounds and kernel limit , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
Advances in Neural Information Processing Systems , volume=
The Recurrent Neural Tangent Kernel , author=. Advances in Neural Information Processing Systems , volume=
-
[32]
Proceedings of COMPSTAT'2010 , pages=
Large-Scale Machine Learning with Stochastic Gradient Descent , author=. Proceedings of COMPSTAT'2010 , pages=. 2010 , publisher=
2010
-
[33]
2016 , publisher=
Deep Learning , author=. 2016 , publisher=
2016
-
[34]
Adam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization , author=. arXiv preprint arXiv:1412.6980 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Nature , volume=
Deep Learning , author=. Nature , volume=. 2015 , publisher=
2015
-
[36]
Journal of Computational Physics , volume=
Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations , author=. Journal of Computational Physics , volume=. 2019 , publisher=
2019
-
[37]
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners , author=. arXiv preprint arXiv:2005.14165 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[38]
Journal of Machine Learning Research , volume=
Automatic Differentiation in Machine Learning: a Survey , author=. Journal of Machine Learning Research , volume=
-
[39]
International Conference on Learning Representations , year=
Gradient Descent Provably Optimizes Over-parameterized Neural Networks , author=. International Conference on Learning Representations , year=
-
[40]
Nature , volume=
Learning representations by back-propagating errors , author=. Nature , volume=. 1986 , publisher=
1986
-
[41]
Proceedings of COMPSTAT , pages=
Large-Scale Machine Learning with Stochastic Gradient Descent , author=. Proceedings of COMPSTAT , pages=. 2010 , publisher=
2010
-
[42]
1990 , publisher=
Nonlinear Dynamical Control Systems , author=. 1990 , publisher=
1990
-
[43]
2007 , publisher=
Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting , author=. 2007 , publisher=
2007
-
[44]
2014 , publisher=
Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition , author=. 2014 , publisher=
2014
-
[45]
2001 , publisher=
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems , author=. 2001 , publisher=
2001
-
[46]
Context-dependent computation by recurrent dynamics in prefrontal cortex , author=. Nature , volume=. 2013 , publisher=. doi:10.1038/nature12742 , PMID=
-
[47]
Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks , author=. Neural Computation , volume=. 2013 , publisher=. doi:10.1162/NECO\_a\_00409 , PMID=
-
[48]
The Twelfth International Conference on Learning Representations , year=
How connectivity structure shapes rich and lazy learning in neural circuits , author=. The Twelfth International Conference on Learning Representations , year=
-
[49]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
On Lazy Training in Differentiable Programming , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[50]
Nature Neuroscience , volume=
Task representations in neural networks trained to perform many cognitive tasks , author=. Nature Neuroscience , volume=. 2019 , doi=
2019
-
[51]
The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation , url =
Turner, Elia and Barak, Omri , booktitle =. The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation , url =
-
[52]
Nature Neuroscience , volume=
Flexible multitask computation in recurrent networks utilizes shared dynamical motifs , author=. Nature Neuroscience , volume=. 2024 , doi=
2024
-
[53]
Open the Black Box of Recurrent Neural Network by Decoding the Internal Dynamics , year=
Tang, Jiacheng and Yin, Hao and Kang, Qi , booktitle=. Open the Black Box of Recurrent Neural Network by Decoding the Internal Dynamics , year=
-
[54]
Current Opinion in Neurobiology , volume=
From lazy to rich to exclusive task representations in neural networks and neural codes , author=. Current Opinion in Neurobiology , volume=. 2023 , doi=
2023
-
[55]
Frontiers in Systems Neuroscience , volume=
Exploring Flip Flop Memories and Beyond: Training Recurrent Neural Networks with Key Insights , author=. Frontiers in Systems Neuroscience , volume=. 2024 , month=. doi:10.3389/fnsys.2024.1269190 , pmid=
-
[56]
Physical Review Letters , volume=
Chaos in random neural networks , author=. Physical Review Letters , volume=. 1988 , publisher=
1988
-
[57]
Physical Review Letters , volume=
Eigenvalue spectra of random matrices for neural networks , author=. Physical Review Letters , volume=. 2006 , publisher=
2006
-
[58]
Neuron , volume=
Generating coherent patterns of activity from chaotic neural networks , author=. Neuron , volume=. 2009 , publisher=
2009
-
[59]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
Exponential expressivity in deep neural networks through transient chaos , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[60]
Frontiers in Applied Mathematics and Statistics , volume=
On Lyapunov Exponents for RNNs: Understanding Information Propagation Using Dynamical Systems Tools , author=. Frontiers in Applied Mathematics and Statistics , volume=. 2022 , doi=
2022
-
[61]
Physical Review Letters , volume=
Finite-time Lyapunov exponents of deep neural networks , author=. Physical Review Letters , volume=. 2024 , publisher=
2024
-
[62]
Advances in neural information processing systems , volume=
On the difficulty of learning chaotic dynamics with RNNs , author=. Advances in neural information processing systems , volume=
-
[63]
arXiv preprint arXiv:2402.18377 , year=
Out-of-domain generalization in dynamical systems reconstruction , author=. arXiv preprint arXiv:2402.18377 , year=
-
[64]
Publications Mathématiques de l'IHÉS , volume=
Ergodic theory of differentiable dynamical systems , author=. Publications Mathématiques de l'IHÉS , volume=. 1979 , publisher=
1979
-
[65]
and Bottou, L
Lecun, Y. and Bottou, L. and Bengio, Y. and Haffner, P. , journal=. Gradient-based learning applied to document recognition , year=
-
[66]
Advances in Neural Information Processing Systems , volume=
Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? , author=. Advances in Neural Information Processing Systems , volume=. 2018 , publisher=
2018
-
[67]
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , author=. arXiv preprint arXiv:1312.6120 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning , year=
From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks , author=. NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning , year=
2024
-
[69]
Science , volume=
Nonlinear dimensionality reduction by locally linear embedding , author=. Science , volume=. 2000 , publisher=
2000
-
[70]
Neural constraints on learning , author=. Nature , volume=. 2014 , publisher=. doi:10.1038/nature13665 , url=
-
[71]
Charles R. Harris and K. Jarrod Millman and St. Array programming with. 2020 , month = sep, journal =. doi:10.1038/s41586-020-2649-2 , publisher =
-
[72]
2013 , publisher=
Matrix Computations , author=. 2013 , publisher=
2013
-
[73]
1989 , publisher=
Introductory Functional Analysis with Applications , author=. 1989 , publisher=
1989
-
[74]
Nature Methods , volume=
SciPy 1.0: fundamental algorithms for scientific computing in Python , author=. Nature Methods , volume=. 2020 , publisher=
2020
-
[75]
Proceedings of the 32nd International Conference on Machine Learning , volume=
Optimizing Neural Networks with Kronecker-factored Approximate Curvature , author=. Proceedings of the 32nd International Conference on Machine Learning , volume=
-
[76]
Advances in Neural Information Processing Systems , volume=
Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks , author=. Advances in Neural Information Processing Systems , volume=
-
[77]
2025 , eprint=
Fast Neural Tangent Kernel Alignment, Norm and Effective Rank via Trace Estimation , author=. 2025 , eprint=
2025
-
[78]
Advances in Neural Information Processing Systems , volume=
Limitations of the Empirical Fisher Approximation for Natural Gradient Descent , author=. Advances in Neural Information Processing Systems , volume=
-
[79]
Werbos , title =
Paul J. Werbos , title =. Proceedings of the IEEE , volume =. 1990 , doi =
1990
-
[80]
Proceedings of the 38th International Conference on Machine Learning , pages =
Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.