arxiv: 2605.12763 · v1 · submitted 2026-05-12 · 💻 cs.LG · math.DS· math.OC· q-bio.NC

Recognition: unknown

State-Space NTK Collapse Near Bifurcations

Eric Shea-Brown, James Hazelden

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:59 UTC · model grok-4.3

classification 💻 cs.LG math.DSmath.OCq-bio.NC

keywords bifurcationsneural tangent kernelrecurrent neural networksgradient descentnormal formslearning dynamicsdynamical systemsstate-space models

0 comments

The pith

Near bifurcations, the state-space NTK reduces to a rank-one operator matching the learning geometry of a classical normal form.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recurrent networks must often pass through bifurcations, or qualitative shifts in dynamics, to solve tasks that unfold over time. The paper develops a local theory showing that these transitions dominate gradient descent by simplifying the empirical state-space neural tangent kernel. Near codimension-1 bifurcations the kernel decomposes into a highly amplified rank-one channel aligned with the normal form and a residual channel that becomes negligible. This reduction makes the local loss landscape and parameter updates analytically tractable even when the underlying recurrent system is high-dimensional. The authors demonstrate the collapse in a student-teacher RNN and show that low-rank natural gradient steps can stabilize training at these points.

Core claim

Bifurcations both dominate and simplify learning dynamics: near bifurcations, we can reduce sNTK to a rank-one operator corresponding to learning in a classical normal form system, providing an analytically tractable description of the local learning geometry, even for high-dimensional recurrent systems. Concretely, the sNTK decomposes into bifurcation-relevant and residual channels; near common codimension-1 bifurcations the relevant channel is rank-one and highly amplified, so it dominates the full kernel and funnels gradient descent into a few critical dynamical directions whose geometry matches the normal form.

What carries the argument

The empirical state-space neural tangent kernel (sNTK), decomposed into a dominant rank-one bifurcation channel and residual channels via local linearization around the transition point.

If this is right

The bifurcation channel amplifies and dominates the full sNTK, warping the learning landscape toward a few critical dynamical directions.
Local loss geometry and gradient steps near the transition become predictable from the classical normal form of the bifurcation.
In student-teacher RNNs the first learned bifurcation produces a sharp drop in sNTK effective rank together with a dominant parameter direction.
Low-rank natural gradient methods remove the resulting instability with negligible extra cost over SGD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Monitoring the effective rank of the sNTK during training could serve as an online diagnostic for impending dynamical transitions.
The same reduction may apply to other architectures that exhibit qualitative changes in internal dynamics during learning.
Curricula or regularizers could be designed to steer networks toward or away from specific bifurcations to control stability.
The rank-one structure suggests that effective low-dimensional descriptions of learning may exist more broadly near any point where the Jacobian spectrum crosses the imaginary axis.

Load-bearing premise

The local linearization around the bifurcation point and the separation into bifurcation-relevant versus residual channels remain valid and dominant inside the full nonlinear high-dimensional dynamics.

What would settle it

In a high-dimensional nonlinear recurrent network approaching a pitchfork or Hopf bifurcation, the effective rank of the sNTK fails to collapse and no single parameter direction emerges whose restricted kernel matches the scalar normal-form prediction.

Figures

Figures reproduced from arXiv: 2605.12763 by Eric Shea-Brown, James Hazelden.

**Figure 1.** Figure 1: NTK collapse in a student-teacher RNN. Over SGD, we measure (A) loss, (B) stable rank of sNTK, and (C) the spectral radius of the student’s weights. (D) compares final readout dynamics. The dashed line in A–C corresponds to a pitchfork bifurcation, shown in (G), corresponding to a sudden drop in loss and collapse of sNTK effective rank to 1. (E) illustrates local sNTK norm amplification near this bifurcati… view at source ↗

**Figure 2.** Figure 2: Rank-one sNTK amplification for codimension-one bifurcations. We plot the norm of sNTK = (Dgh)(Dgh) T for initial conditions h0 sampled uniformly from [−0.05, 0.05] (blue) and [−0.1, 0.1] (orange), using T = 30 timesteps. In all cases, the norm is strongly amplified near the bifurcation point g ∗ = 1 (dashed line). The stability flip exhibits monotone growth past criticality, while the nonlinear bifurcatio… view at source ↗

**Figure 3.** Figure 3: Learning two unstable modes. After the first student mode becomes unstable, the local NTK geometry collapses toward one dominant direction, so the second unstable mode is learned only much more slowly. Solid lines denote student eigenvalues and dashed lines denote the unstable teacher eigenvalues. The right panel shows a zoomed view, where the large NTK norm near the transition leads to visible fluctuation… view at source ↗

**Figure 4.** Figure 4: Rank-one natural-gradient correction near bifurcation. Reconditioning only the top Fisher/NTK mode produces much smoother and cleaner training than plain GD, with very little overhead, while still preserving the large loss drops at the bifurcations. This suggests that the jagged behavior in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Rich feature learning in tasks that unfold over time often requires the model to pass through bifurcations, constituting qualitative changes in the underlying model dynamics. We develop a local theory of gradient descent near these transitions through the empirical state-space neural tangent kernel (sNTK). Our central finding is that bifurcations both dominate and simplify learning dynamics: near bifurcations, we can reduce sNTK to a rank-one operator corresponding to learning in a classical normal form system, providing an analytically tractable description of the local learning geometry, even for high-dimensional recurrent systems. Concretely, we give a procedure for decomposing sNTK into bifurcation-relevant and residual channels, showing that near commonly codimension-1 bifurcations the relevant channel is a rank-one operator that is highly amplified. This amplification causes the bifurcation channel to dominate the full sNTK. Thus, bifurcations locally warp the learning landscape, funneling gradient descent into a few critical dynamical directions and making the nearby kernel and loss geometry predictable from classical normal forms. We illustrate this in a student-teacher recurrent neural network: the first learned bifurcation coincides with a sharp collapse in sNTK effective rank and the emergence of a dominant parameter direction whose restricted sNTK closely matches the landscape predicted by the scalar pitchfork normal form. Finally, we show that low-rank natural gradient methods resolve the resulting learning instability near bifurcations with very little overhead over SGD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a local theory of gradient descent near bifurcations in recurrent systems via the empirical state-space neural tangent kernel (sNTK). Its central claim is that near codimension-1 bifurcations the sNTK reduces to a rank-one operator corresponding to learning in a classical normal-form system; this operator is highly amplified, dominates the full kernel, and thereby warps the local learning geometry into a few critical dynamical directions that are analytically tractable even for high-dimensional RNNs. The reduction is illustrated numerically in a student-teacher RNN, where the first learned bifurcation produces a sharp collapse in sNTK effective rank and a dominant parameter direction whose restricted kernel matches the scalar pitchfork normal form; low-rank natural gradient descent is shown to mitigate the resulting instability.

Significance. If the reduction and dominance can be placed on rigorous footing, the work supplies a concrete link between bifurcation theory and neural tangent kernels that could explain feature-learning instabilities in recurrent models and guide the design of geometry-aware optimizers. The explicit decomposition procedure and the numerical demonstration of rank collapse are concrete strengths; the connection to normal forms offers falsifiable predictions for the local loss landscape.

major comments (2)

[Decomposition procedure and §4 (numerical illustration)] The headline claim that the bifurcation channel dominates the sNTK (and therefore controls learning) requires a bound showing that the residual operator norm remains o(1) relative to the amplified rank-one term as the bifurcation parameter approaches criticality, even after all nonlinear interactions are restored. No such spectral or perturbation estimate is derived; the manuscript supplies only the decomposition procedure and numerical evidence restricted to one student-teacher RNN architecture.
[Local theory development (near Eq. for rank-one operator)] The amplification factor of the bifurcation channel is listed among the free parameters in the supporting analysis. If this factor is fitted rather than derived parameter-free from the normal-form linearization, the reduction is not fully analytical and the claimed tractability is weakened.

minor comments (2)

[Abstract] The abstract refers to 'commonly codimension-1 bifurcations' without enumerating them; a short explicit list (pitchfork, transcritical, Hopf, etc.) would clarify the scope.
[Preliminaries] Notation for the state-space NTK versus the ordinary NTK is introduced without a dedicated comparison table; readers would benefit from one.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below with clarifications on the derivations and the supporting numerical evidence.

read point-by-point responses

Referee: [Decomposition procedure and §4 (numerical illustration)] The headline claim that the bifurcation channel dominates the sNTK (and therefore controls learning) requires a bound showing that the residual operator norm remains o(1) relative to the amplified rank-one term as the bifurcation parameter approaches criticality, even after all nonlinear interactions are restored. No such spectral or perturbation estimate is derived; the manuscript supplies only the decomposition procedure and numerical evidence restricted to one student-teacher RNN architecture.

Authors: The decomposition procedure yields an exact separation of the empirical sNTK into the rank-one bifurcation channel (obtained from the normal-form linearization) and the residual operator. The amplification of the bifurcation channel is controlled by the distance to criticality in the normal form, which grows without bound as the bifurcation parameter approaches the critical value. While a general spectral bound on the residual (accounting for all nonlinear terms) is not derived, the local theory shows that the residual remains bounded while the bifurcation term diverges, implying dominance in a sufficiently small neighborhood of the bifurcation. The numerical experiments in §4 for the student-teacher RNN confirm that the effective rank collapses and the residual contribution is negligible near the learned bifurcation. In revision we will add explicit operator-norm comparisons in §4 and a remark stating the local regime of validity. revision: partial
Referee: [Local theory development (near Eq. for rank-one operator)] The amplification factor of the bifurcation channel is listed among the free parameters in the supporting analysis. If this factor is fitted rather than derived parameter-free from the normal-form linearization, the reduction is not fully analytical and the claimed tractability is weakened.

Authors: The amplification factor is derived directly and parameter-free from the normal-form linearization. For the codimension-1 pitchfork case it equals the reciprocal of the unfolding parameter (or equivalently the real part of the critical eigenvalue of the Jacobian evaluated at the bifurcation point). This quantity is computed from the system dynamics without any fitting. We will revise the text surrounding the rank-one operator equation to display this explicit derivation from the linearization, making the parameter-free character clear. revision: yes

Circularity Check

0 steps flagged

No significant circularity; local reduction derived from normal-form linearization

full rationale

The paper's central derivation proceeds by decomposing the empirical state-space NTK into bifurcation-relevant and residual channels using the local linearization around codimension-1 bifurcations and the structure of classical normal forms. The rank-one character and amplification of the relevant channel follow directly from the normal-form Jacobian and the projection onto the critical mode, without any parameter fitting that is then re-labeled as a prediction. Numerical results on the student-teacher RNN are presented as illustration and validation of the analytic reduction rather than as the source of the rank-one claim itself. No self-citations, ansatzes, or uniqueness theorems imported from prior author work are invoked as load-bearing steps in the provided chain, and the residual-channel suppression is treated as a consequence of the local theory rather than an input. The derivation therefore remains self-contained against the normal-form assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The claim rests on the existence of classical normal forms for codimension-1 bifurcations, the validity of the empirical NTK linearization near those points, and the assumption that the bifurcation channel can be cleanly separated from residual dynamics without introducing new fitted scales.

free parameters (1)

amplification factor of bifurcation channel
The abstract states the channel is 'highly amplified' but does not specify whether this factor is derived from the normal-form coefficients or fitted to the observed sNTK spectrum.

axioms (2)

domain assumption Local linearization of the RNN dynamics around the bifurcation point yields a valid empirical NTK
Invoked when reducing the high-dimensional sNTK to the normal-form operator.
standard math Codimension-1 bifurcations admit standard normal forms whose learning geometry is analytically tractable
Background result from dynamical systems used to claim analytical tractability.

invented entities (1)

bifurcation-relevant channel of sNTK no independent evidence
purpose: The rank-one operator that dominates learning near the bifurcation
New decomposition introduced to separate the dominant direction from residual channels.

pith-pipeline@v0.9.0 · 5555 in / 1610 out tokens · 39710 ms · 2026-05-14T20:59:36.791495+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

137 extracted references · 22 canonical work pages · 5 internal anchors

[1]

arXiv preprint arXiv:2311.10869 , year=

Evolutionary algorithms as an alternative to backpropagation for supervised training of Biophysical Neural Networks and Neural ODEs , author=. arXiv preprint arXiv:2311.10869 , year=

work page arXiv
[2]

2015 , eprint=

Deep Residual Learning for Image Recognition , author=. 2015 , eprint=

2015
[3]

James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander
[4]

and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =
[5]

Papyan, Vardan and Han, X. Y. and Donoho, David L. , title =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =

2020
[6]

Organizing recurrent network dynamics by task-computation to enable continual learning , url =

Duncker, Lea and Driscoll, Laura and Shenoy, Krishna V and Sahani, Maneesh and Sussillo, David , booktitle =. Organizing recurrent network dynamics by task-computation to enable continual learning , url =
[7]

and Bialek, William , title =

Tishby, Naftali and Pereira, Fernando C. and Bialek, William , title =. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , year =
[8]

Frederickson, Paul and Kaplan, J. L. and Yorke, E. D. and Yorke, J. A. , title =. Journal of Differential Equations , volume =
[9]

1962 , publisher =

The Mathematical Theory of Optimal Processes , author =. 1962 , publisher =

1962
[10]

Rumelhart, D. E. and Hinton, G. E. and Williams, R. J. , title =. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations , year =
[11]

2016 , eprint=

Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=

2016
[12]

Physical Review X , volume=

Transition to Chaos in Random Neuronal Networks , author=. Physical Review X , volume=. 2015 , publisher=

2015
[13]

2025 , eprint=

Dynamically Learning to Integrate in Recurrent Neural Networks , author=. 2025 , eprint=

2025
[14]

2020 , eprint=

Lyapunov spectra of chaotic recurrent neural networks , author=. 2020 , eprint=

2020
[15]

, title =

Tolmachev, Pavel and Engel, Tatiana A. , title =. 2024 , doi =. https://www.biorxiv.org/content/early/2024/11/24/2024.11.23.625012.full.pdf , journal =

2024
[16]

2019 , eprint=

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=

2019
[17]

2019 , eprint=

Surrogate Gradient Learning in Spiking Neural Networks , author=. 2019 , eprint=

2019
[18]

Exploring the Impact of Activation Functions in Training Neural

Tianxiang Gao and Siyuan Sun and Hailiang Liu and Hongyang Gao , booktitle=. Exploring the Impact of Activation Functions in Training Neural. 2025 , url=

2025
[19]

The Journal of physiology , volume=

A quantitative description of membrane current and its application to conduction and excitation in nerve , author=. The Journal of physiology , volume=. 1952 , doi=

1952
[20]

The interplay between randomness and structure during learning in RNNs , url =

Schuessler, Friedrich and Mastrogiuseppe, Francesca and Dubreuil, Alexis and Ostojic, Srdjan and Barak, Omri , booktitle =. The interplay between randomness and structure during learning in RNNs , url =
[21]

Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows , author=

Discretize-Optimize vs. Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows , author=. 2020 , eprint=

2020
[22]

Neural Networks: Tricks of the Trade , pages =

Efficient BackProp , author =. Neural Networks: Tricks of the Trade , pages =. 1998 , publisher =

1998
[23]

Advances in Neural Information Processing Systems , volume =

Neural Ordinary Differential Equations , author =. Advances in Neural Information Processing Systems , volume =
[24]

arXiv preprint arXiv:1711.00579 , year =

A Proposal on Machine Learning via Dynamical Systems , author =. arXiv preprint arXiv:1711.00579 , year =

work page arXiv
[25]

Des premiers travaux de Le Verrier \`a la d\'ecouverte de Neptune

Stable Architectures for Deep Neural Networks , author =. arXiv preprint arXiv:1710.03688 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Optimal Stopping and the Sufficiency of Randomized Threshold Strategies

Maximum Principle Based Algorithms for Deep Learning , author =. arXiv preprint arXiv:1708.01038 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Automatic Differentiation of Algorithms , pages =

Gradient Calculations for Dynamic Systems , author =. Automatic Differentiation of Algorithms , pages =. 1995 , publisher =

1995
[28]

Advances in Neural Information Processing Systems , volume=

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. Advances in Neural Information Processing Systems , volume=
[29]

International Conference on Machine Learning , year=

A Convergence Theory for Deep Learning via Over-Parameterization , author=. International Conference on Machine Learning , year=
[30]

Advances in Neural Information Processing Systems , volume=

Mean-field theory of two-layer neural networks: dimension-free bounds and kernel limit , author=. Advances in Neural Information Processing Systems , volume=
[31]

Advances in Neural Information Processing Systems , volume=

The Recurrent Neural Tangent Kernel , author=. Advances in Neural Information Processing Systems , volume=
[32]

Proceedings of COMPSTAT'2010 , pages=

Large-Scale Machine Learning with Stochastic Gradient Descent , author=. Proceedings of COMPSTAT'2010 , pages=. 2010 , publisher=

2010
[33]

2016 , publisher=

Deep Learning , author=. 2016 , publisher=

2016
[34]

Adam: A Method for Stochastic Optimization

Adam: A Method for Stochastic Optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Nature , volume=

Deep Learning , author=. Nature , volume=. 2015 , publisher=

2015
[36]

Journal of Computational Physics , volume=

Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations , author=. Journal of Computational Physics , volume=. 2019 , publisher=

2019
[37]

Language Models are Few-Shot Learners

Language Models are Few-Shot Learners , author=. arXiv preprint arXiv:2005.14165 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[38]

Journal of Machine Learning Research , volume=

Automatic Differentiation in Machine Learning: a Survey , author=. Journal of Machine Learning Research , volume=
[39]

International Conference on Learning Representations , year=

Gradient Descent Provably Optimizes Over-parameterized Neural Networks , author=. International Conference on Learning Representations , year=
[40]

Nature , volume=

Learning representations by back-propagating errors , author=. Nature , volume=. 1986 , publisher=

1986
[41]

Proceedings of COMPSTAT , pages=

Large-Scale Machine Learning with Stochastic Gradient Descent , author=. Proceedings of COMPSTAT , pages=. 2010 , publisher=

2010
[42]

1990 , publisher=

Nonlinear Dynamical Control Systems , author=. 1990 , publisher=

1990
[43]

2007 , publisher=

Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting , author=. 2007 , publisher=

2007
[44]

2014 , publisher=

Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition , author=. 2014 , publisher=

2014
[45]

2001 , publisher=

Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems , author=. 2001 , publisher=

2001
[46]

Nature , volume=

Context-dependent computation by recurrent dynamics in prefrontal cortex , author=. Nature , volume=. 2013 , publisher=. doi:10.1038/nature12742 , PMID=

work page doi:10.1038/nature12742 2013
[47]

Neural Computation , volume=

Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks , author=. Neural Computation , volume=. 2013 , publisher=. doi:10.1162/NECO\_a\_00409 , PMID=

work page doi:10.1162/neco 2013
[48]

The Twelfth International Conference on Learning Representations , year=

How connectivity structure shapes rich and lazy learning in neural circuits , author=. The Twelfth International Conference on Learning Representations , year=
[49]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

On Lazy Training in Differentiable Programming , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
[50]

Nature Neuroscience , volume=

Task representations in neural networks trained to perform many cognitive tasks , author=. Nature Neuroscience , volume=. 2019 , doi=

2019
[51]

The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation , url =

Turner, Elia and Barak, Omri , booktitle =. The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation , url =
[52]

Nature Neuroscience , volume=

Flexible multitask computation in recurrent networks utilizes shared dynamical motifs , author=. Nature Neuroscience , volume=. 2024 , doi=

2024
[53]

Open the Black Box of Recurrent Neural Network by Decoding the Internal Dynamics , year=

Tang, Jiacheng and Yin, Hao and Kang, Qi , booktitle=. Open the Black Box of Recurrent Neural Network by Decoding the Internal Dynamics , year=
[54]

Current Opinion in Neurobiology , volume=

From lazy to rich to exclusive task representations in neural networks and neural codes , author=. Current Opinion in Neurobiology , volume=. 2023 , doi=

2023
[55]

Frontiers in Systems Neuroscience , volume=

Exploring Flip Flop Memories and Beyond: Training Recurrent Neural Networks with Key Insights , author=. Frontiers in Systems Neuroscience , volume=. 2024 , month=. doi:10.3389/fnsys.2024.1269190 , pmid=

work page doi:10.3389/fnsys.2024.1269190 2024
[56]

Physical Review Letters , volume=

Chaos in random neural networks , author=. Physical Review Letters , volume=. 1988 , publisher=

1988
[57]

Physical Review Letters , volume=

Eigenvalue spectra of random matrices for neural networks , author=. Physical Review Letters , volume=. 2006 , publisher=

2006
[58]

Neuron , volume=

Generating coherent patterns of activity from chaotic neural networks , author=. Neuron , volume=. 2009 , publisher=

2009
[59]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Exponential expressivity in deep neural networks through transient chaos , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
[60]

Frontiers in Applied Mathematics and Statistics , volume=

On Lyapunov Exponents for RNNs: Understanding Information Propagation Using Dynamical Systems Tools , author=. Frontiers in Applied Mathematics and Statistics , volume=. 2022 , doi=

2022
[61]

Physical Review Letters , volume=

Finite-time Lyapunov exponents of deep neural networks , author=. Physical Review Letters , volume=. 2024 , publisher=

2024
[62]

Advances in neural information processing systems , volume=

On the difficulty of learning chaotic dynamics with RNNs , author=. Advances in neural information processing systems , volume=
[63]

arXiv preprint arXiv:2402.18377 , year=

Out-of-domain generalization in dynamical systems reconstruction , author=. arXiv preprint arXiv:2402.18377 , year=

work page arXiv
[64]

Publications Mathématiques de l'IHÉS , volume=

Ergodic theory of differentiable dynamical systems , author=. Publications Mathématiques de l'IHÉS , volume=. 1979 , publisher=

1979
[65]

and Bottou, L

Lecun, Y. and Bottou, L. and Bengio, Y. and Haffner, P. , journal=. Gradient-based learning applied to document recognition , year=
[66]

Advances in Neural Information Processing Systems , volume=

Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? , author=. Advances in Neural Information Processing Systems , volume=. 2018 , publisher=

2018
[67]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , author=. arXiv preprint arXiv:1312.6120 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning , year=

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks , author=. NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning , year=

2024
[69]

Science , volume=

Nonlinear dimensionality reduction by locally linear embedding , author=. Science , volume=. 2000 , publisher=

2000
[70]

Nature , volume=

Neural constraints on learning , author=. Nature , volume=. 2014 , publisher=. doi:10.1038/nature13665 , url=

work page doi:10.1038/nature13665 2014
[71]

Nature , author =

Charles R. Harris and K. Jarrod Millman and St. Array programming with. 2020 , month = sep, journal =. doi:10.1038/s41586-020-2649-2 , publisher =

work page doi:10.1038/s41586-020-2649-2 2020
[72]

2013 , publisher=

Matrix Computations , author=. 2013 , publisher=

2013
[73]

1989 , publisher=

Introductory Functional Analysis with Applications , author=. 1989 , publisher=

1989
[74]

Nature Methods , volume=

SciPy 1.0: fundamental algorithms for scientific computing in Python , author=. Nature Methods , volume=. 2020 , publisher=

2020
[75]

Proceedings of the 32nd International Conference on Machine Learning , volume=

Optimizing Neural Networks with Kronecker-factored Approximate Curvature , author=. Proceedings of the 32nd International Conference on Machine Learning , volume=
[76]

Advances in Neural Information Processing Systems , volume=

Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks , author=. Advances in Neural Information Processing Systems , volume=
[77]

2025 , eprint=

Fast Neural Tangent Kernel Alignment, Norm and Effective Rank via Trace Estimation , author=. 2025 , eprint=

2025
[78]

Advances in Neural Information Processing Systems , volume=

Limitations of the Empirical Fisher Approximation for Natural Gradient Descent , author=. Advances in Neural Information Processing Systems , volume=
[79]

Werbos , title =

Paul J. Werbos , title =. Proceedings of the IEEE , volume =. 1990 , doi =

1990
[80]

Proceedings of the 38th International Conference on Machine Learning , pages =

Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

2021

Showing first 80 references.