pith. machine review for the scientific record. sign in

arxiv: 2605.08746 · v1 · submitted 2026-05-09 · 💻 cs.LG · math.DS· math.OC

Recognition: 2 theorem links

· Lean Theorem

The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:47 UTC · model grok-4.3

classification 💻 cs.LG math.DSmath.OC
keywords neural tangent kernelgradient descentKronecker coreself-referential biaslow-rank representationsrecurrent neural networkstransformersimplicit constraints
0
0 comments X

The pith

The global empirical NTK decomposes into a Kronecker-core Gram matrix times a state-dependency operator, imposing a low-rank bottleneck on gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to characterize the structure of the global empirical neural tangent kernel that controls first-order updates during gradient descent in neural networks. By expressing the model state as the solution to one global implicit constraint, the NTK factors exactly into an immediate parameter-to-state operator K and an internal state-to-state operator P. For weight-based models including RNNs and transformers, a universal Kronecker-core theorem shows that K equals the Gram matrix of weight-site variables and is therefore exactly computable. This factorization reveals that the NTK is structurally bottlenecked with limited effective rank, which produces a self-referential bias directing learning toward the dominant modes of combined input and hidden activity. Readers care because the result accounts for the emergence of low-rank features and the selective difficulty of learning certain task elements right from initialization.

Core claim

Formulating the model state as the solution to a single global implicit constraint yields the global empirical NTK as the product of operators K and P. For RNNs, transformers and other weight-based models a universal Kronecker-core theorem establishes that K is exactly the Gram matrix of the weight-site variables. The resulting structure shows the NTK is bottlenecked in rank, producing a self-referential bias that directs gradient descent toward the principal modes of joint hidden and input activity. The spectrum of the NTK in recurrent models is accordingly biased and low-rank in space or time, and initialization dynamics further restrict the learnable subspace. The same low-rank constraint

What carries the argument

The universal Kronecker-core theorem, which shows that the immediate parameter-to-state operator K equals the exact Gram matrix of weight-site variables.

Load-bearing premise

The model state can be expressed as the solution to a single global implicit constraint that lets the NTK factor exactly into operators K and P.

What would settle it

Direct numerical computation of the global empirical NTK for a small finite-width RNN or transformer and verification that it does not equal the predicted product of the weight-site Gram matrix and the state-dependency operator would falsify the Kronecker-core claim.

Figures

Figures reproduced from arXiv: 2605.08746 by Eli Shlizerman, Eric Shea-Brown, James Hazelden, Laura Driscoll.

Figure 1
Figure 1. Figure 1: Backpropagation of errors for a recurrent model, annotated by the corresponding operators, P, K and NTKS from Proposition 1. The operator P ∗ maps the global error signal to state adjoint sensitivity, describing how the loss depends on the state, h. Then, these adjoints are projected into and out of the parameter space by K, potentially zeroing or misdirecting them. Finally, the corrections, which modify t… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic of Theorem 1 for the Example 4.1, with V = cat(H, X), as is the case for the GRU and RNN. If the hidden units and inputs have low￾dimensional activity, then this joint state matrix bottlenecks the full NTK. For the discrete recurrent example, consider the discrete-time system ht+1 = f(ht, Wrecht, Winxt+1), with parameters θ = cat(vec(Wrec), vec(Win)), where cat forms the direct sum concate￾nation… view at source ↗
Figure 3
Figure 3. Figure 3: Temporal bottlenecking of the global-state NTK by the Kronecker core V V T . A Schematic of the Memory-Pro task, in which the model must reproduce a two-dimensional stimulus after a delay period. B Cosine similarity between the core, K = V V ∗ ⊗ In, and the NTK, cos(NTKS, V V T ⊗ In), over GD training of the GRU model on the task in A, showing that the two operators share a similar common basis. Here, we u… view at source ↗
Figure 4
Figure 4. Figure 4: Self-referential bias can stall SGD on the Memory-Pro task. A-B Outputs before and after training for two GRU initializations. A (Network 1): a default Xavier initialization with weight scale 1, whose hidden dynamics collapse to a single fixed point under zero input during the response period, regardless of the input (“End” in left panel of Figure). B (Network 2): as an illustrative case, an initialization… view at source ↗
Figure 5
Figure 5. Figure 5: Recurrent gain and input rank induce distinct spatial and temporal NTK bottlenecks. A Two variants of a student-teacher task. For both, we begin with a vanilla RNN teacher with fixed weights W∗ , W∗ in, W∗ out, all drawn from Xavier normal initialization. In the top task, we freeze the student to have identical weights other than W, which is initialized randomly with gain g and trained with GD. The trainin… view at source ↗
Figure 6
Figure 6. Figure 6: Rank bottlenecking by input dimension in a self-attention model. A Self-attention architecture with time-varying inputs X ∈ R nx×nt×nin . As in the main text, the weight-site core of the NTK consists of the concatenated input activity and attention matrix, V = cat(X, A). See Appendix A.2.3 for model and input details. B Because A lies in the same temporal span as X, the temporal rank of the NTK is bottlene… view at source ↗
Figure 7
Figure 7. Figure 7: Examples of partial reductions of an operator acting on a 3-tensor domain, R B×T ×H. The operator Φ acts on a domain R B×T ×H, representing batch, time, and hidden-unit axes. Each reduction averages over one or more axes. The reduced operators at the bottom are simple B × B, T × T, and H × H matrices, respectively. These capture how the operator varies across batch inputs, hidden units, or timesteps after … view at source ↗
Figure 8
Figure 8. Figure 8: Dynamics of an RNN with added non-trivial fixed points (NTFPs). Random weights are sampled with Xavier initialization, scaled by gain g, with n = 256 and g varied between 1 and 2. Non-trivial fixed points are then added to the model as described in the text. Plotted trajectories correspond to distinct random initial conditions and gain values. The random seed is fixed between trials and the dynamics are pr… view at source ↗
Figure 9
Figure 9. Figure 9: Temporal, spatial, and overall NTK effective rank for a two-dimensional sweep over input dimension and attention dimension. The temporal rank grows primarily with input dimension, while the spatial and overall rank increase strongly with attention width [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Temporal NTK rank versus input dimension for a single-block self-attention toy model with varying head count. Increasing the number of attention heads has only a modest effect on temporal rank, so the same basic temporal bottleneck from the input representation remains. weights to the teacher values, so slow or incomplete learning reflects bias in the induced NTK rather than task misspecification. Unless … view at source ↗
read the original abstract

In training a neural network with gradient descent (GD), each iteration induces a linear operator that governs first-order updates to a model's internal state variables. We define this operator as the Global Empirical Neural Tangent Kernel (NTK). In finite-width networks, the NTK is typically intractable to form, leading prior work to focus on restrictive settings such as tracking outputs only or taking infinite-width limits. Here, we study the structure of the NTK for a range of models. Formulating the model state as the solution to a single global implicit constraint, we derive the NTK as a product of two operators: K, accounting for immediate parameter-to-state interactions, and P, describing internal state-to-state dependencies. For a broad class of weight-based models, including RNNs and transformers, we prove a universal Kronecker-core theorem showing that K admits an exact, computable form given by the Gram matrix of weight-site variables. This core structure reveals that the NTK is structurally bottlenecked, constraining its effective rank and giving rise to a self-referential bias whereby GD preferentially learns within dominant modes of joint hidden and input activity. For recurrent models, we examine the spectrum of the NTK and show when it is biased and low-rank in space or time under the proposed decomposition. We further demonstrate that model dynamics at initialization bias the NTK, restricting learning and preventing task components from being learned effectively. Finally, we show that the NTK associated with a self-attention transformer is likewise structurally constrained to be low-rank. Overall, we show that the NTK possesses tractable structure that explains GD bias toward task solutions and the emergence of low-rank representations. To enable use of the NTK as a practical metric, we build kpflow, a library relying on randomized matrix-free numerical linear algebra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript defines the Global Empirical Neural Tangent Kernel (NTK) as the linear operator induced by each gradient-descent iteration on a model's internal state variables. Formulating the model state as the exact solution to a single global implicit constraint, it derives the NTK as the product of operators K (immediate parameter-to-state interactions) and P (internal state-to-state dependencies). For weight-based models including RNNs and transformers, it proves a universal Kronecker-core theorem asserting that K equals the Gram matrix of weight-site variables, implying a structural rank bottleneck, self-referential bias in which GD preferentially learns dominant modes of joint hidden-input activity, spectral properties for recurrent models, initialization-induced restrictions on learnable task components, and low-rank structure for self-attention transformers. A randomized matrix-free library (kpflow) is provided to make the NTK practical.

Significance. If the derivations and the global-constraint formulation hold exactly for the claimed architectures, the work supplies a tractable structural account of finite-width NTK behavior that explains GD biases and the emergence of low-rank representations without relying on infinite-width limits. The Kronecker-core result and the accompanying computational library constitute concrete, usable contributions that could inform analysis of training dynamics across recurrent and attention-based models.

major comments (3)
  1. [Abstract and derivation of NTK = K P] The derivation of NTK = K P and the exact Kronecker-core theorem (K as Gram matrix of weight-site variables) rests on the model state being formulated as the exact solution to one global implicit constraint. For RNNs the recurrence unfolds over explicit time steps and for transformers self-attention and layer norms are computed sequentially; the manuscript must clarify whether this constraint holds exactly or requires additional fixed-point assumptions not stated for the broad class, because any gap directly undermines the claimed exact computable form, rank bottleneck, and self-referential bias.
  2. [Kronecker-core theorem and bias discussion] The self-referential bias claim (GD preferentially learns within dominant modes of joint hidden and input activity) is presented as a direct consequence of the low-rank structure induced by the Kronecker core. The manuscript should supply the explicit spectral decomposition or mode-identification step that converts the Gram-matrix form of K into this preferential-learning statement, because the bias is load-bearing for the paper's explanation of GD behavior.
  3. [Spectrum and transformer sections] The spectral analysis for recurrent models and the low-rank demonstration for transformers are asserted to follow from the K-P decomposition. The manuscript must state the precise assumptions on the weight-site variables and the P operator that guarantee the reported rank bounds and bias in space/time, because these results are used to support the universal applicability of the theorem.
minor comments (2)
  1. [Introduction] Notation for the operators K and P is introduced without an early summary table relating them to standard NTK components; a brief comparison would aid readability.
  2. [Final section] The kpflow library is mentioned as enabling practical use, but no pseudocode or complexity statement for the randomized matrix-free routines appears in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [Abstract and derivation of NTK = K P] The derivation of NTK = K P and the exact Kronecker-core theorem (K as Gram matrix of weight-site variables) rests on the model state being formulated as the exact solution to one global implicit constraint. For RNNs the recurrence unfolds over explicit time steps and for transformers self-attention and layer norms are computed sequentially; the manuscript must clarify whether this constraint holds exactly or requires additional fixed-point assumptions not stated for the broad class, because any gap directly undermines the claimed exact computable form, rank bottleneck, and self-referential bias.

    Authors: The global implicit constraint is defined directly as the equation satisfied by the full model state vector after the complete forward pass, with the computation graph (including unfolded recurrence or sequential layers) serving as the exact map from parameters to states. For RNNs this means the state equations at every time step are satisfied simultaneously by the unfolded trajectory; for transformers the layer-wise equations (including self-attention and norms) are satisfied by the sequential computation. No iterative fixed-point solver or extra assumptions are introduced beyond the standard forward pass, which solves the constraint by construction. We will add a short clarifying subsection in the methods that states this equivalence explicitly for the architectures considered. revision: yes

  2. Referee: [Kronecker-core theorem and bias discussion] The self-referential bias claim (GD preferentially learns within dominant modes of joint hidden and input activity) is presented as a direct consequence of the low-rank structure induced by the Kronecker core. The manuscript should supply the explicit spectral decomposition or mode-identification step that converts the Gram-matrix form of K into this preferential-learning statement, because the bias is load-bearing for the paper's explanation of GD behavior.

    Authors: Under the Kronecker-core theorem, K is the Gram matrix G = V^TV where the columns of V are the weight-site variables (concatenated input and hidden activations at each parameter location). The eigendecomposition G = U Lambda U^T therefore has eigenvectors U that are precisely the principal components of these joint activity vectors. Because the NTK is the composition KP, its action projects parameter updates onto the dominant subspace spanned by these modes, yielding the stated self-referential bias. We will insert the explicit decomposition together with the mode-identification argument immediately after the theorem statement and reference it in the bias discussion. revision: yes

  3. Referee: [Spectrum and transformer sections] The spectral analysis for recurrent models and the low-rank demonstration for transformers are asserted to follow from the K-P decomposition. The manuscript must state the precise assumptions on the weight-site variables and the P operator that guarantee the reported rank bounds and bias in space/time, because these results are used to support the universal applicability of the theorem.

    Authors: The weight-site variables are finite-dimensional vectors in R^{d_in + d_hidden} formed by concatenating the input and hidden activations at each weight. The operator P is the Jacobian of the internal state-to-state map and is taken to be full rank (or invertible) in the generic case; for recurrent models the spectrum is analyzed on the time-unfolded product of per-step Jacobians. For transformers the self-attention weights define the sites and the low-rank bound follows when the attention Gram is rank-deficient. We will add an explicit paragraph listing these assumptions immediately before the spectral and transformer results, together with a brief note on where the bounds continue to hold under relaxed conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation follows from explicit assumption without reduction by construction

full rationale

The paper defines the Global Empirical NTK as the linear operator governing first-order GD updates to internal state variables. It then states the modeling choice 'Formulating the model state as the solution to a single global implicit constraint' and derives the decomposition NTK = K P together with the Kronecker-core theorem that K equals the Gram matrix of weight-site variables. This is a direct consequence of the stated formulation rather than a self-referential loop in which the result is presupposed. No parameters are fitted on data and then relabeled as predictions, no uniqueness theorems are imported via self-citation, and no ansatz is smuggled through prior work. The claims of rank bottleneck and self-referential bias are logical corollaries of the derived operator structure for the claimed model class. The derivation chain is therefore self-contained under its own premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claims rest on the new definition of the global empirical NTK and the proof of the Kronecker-core theorem; these introduce operators K and P whose independence from prior fitted quantities cannot be confirmed from the abstract alone.

axioms (1)
  • domain assumption Model state is the solution to a single global implicit constraint
    Invoked to derive the NTK as the product of K and P operators.
invented entities (3)
  • Global Empirical NTK no independent evidence
    purpose: Linear operator that governs first-order updates to the model's internal state variables under gradient descent
    Newly defined to extend beyond output-only or infinite-width NTK analyses.
  • K operator no independent evidence
    purpose: Accounts for immediate parameter-to-state interactions
    Component of the NTK decomposition derived from the implicit constraint.
  • P operator no independent evidence
    purpose: Describes internal state-to-state dependencies
    Component of the NTK decomposition derived from the implicit constraint.

pith-pipeline@v0.9.0 · 5646 in / 1532 out tokens · 82472 ms · 2026-05-12T02:47:25.130590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 2 internal anchors

  1. [1]

    arXiv preprint arXiv:2311.10869 , year=

    Evolutionary algorithms as an alternative to backpropagation for supervised training of Biophysical Neural Networks and Neural ODEs , author=. arXiv preprint arXiv:2311.10869 , year=

  2. [2]

    2015 , eprint=

    Deep Residual Learning for Image Recognition , author=. 2015 , eprint=

  3. [3]

    James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

  4. [4]

    and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

    Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =

  5. [5]

    Papyan, Vardan and Han, X. Y. and Donoho, David L. , title =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =

  6. [6]

    Organizing recurrent network dynamics by task-computation to enable continual learning , url =

    Duncker, Lea and Driscoll, Laura and Shenoy, Krishna V and Sahani, Maneesh and Sussillo, David , booktitle =. Organizing recurrent network dynamics by task-computation to enable continual learning , url =

  7. [7]

    and Bialek, William , title =

    Tishby, Naftali and Pereira, Fernando C. and Bialek, William , title =. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , year =

  8. [8]

    Frederickson, Paul and Kaplan, J. L. and Yorke, E. D. and Yorke, J. A. , title =. Journal of Differential Equations , volume =

  9. [9]

    1962 , publisher =

    The Mathematical Theory of Optimal Processes , author =. 1962 , publisher =

  10. [10]

    Rumelhart, D. E. and Hinton, G. E. and Williams, R. J. , title =. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations , year =

  11. [11]

    2016 , eprint=

    Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=

  12. [12]

    Physical Review X , volume=

    Transition to Chaos in Random Neuronal Networks , author=. Physical Review X , volume=. 2015 , publisher=

  13. [13]

    2025 , eprint=

    Dynamically Learning to Integrate in Recurrent Neural Networks , author=. 2025 , eprint=

  14. [14]

    2020 , eprint=

    Lyapunov spectra of chaotic recurrent neural networks , author=. 2020 , eprint=

  15. [15]

    , title =

    Tolmachev, Pavel and Engel, Tatiana A. , title =. 2024 , doi =. https://www.biorxiv.org/content/early/2024/11/24/2024.11.23.625012.full.pdf , journal =

  16. [16]

    2019 , eprint=

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=

  17. [17]

    2019 , eprint=

    Surrogate Gradient Learning in Spiking Neural Networks , author=. 2019 , eprint=

  18. [18]

    Exploring the Impact of Activation Functions in Training Neural

    Tianxiang Gao and Siyuan Sun and Hailiang Liu and Hongyang Gao , booktitle=. Exploring the Impact of Activation Functions in Training Neural. 2025 , url=

  19. [19]

    The Journal of physiology , volume=

    A quantitative description of membrane current and its application to conduction and excitation in nerve , author=. The Journal of physiology , volume=. 1952 , doi=

  20. [20]

    The interplay between randomness and structure during learning in RNNs , url =

    Schuessler, Friedrich and Mastrogiuseppe, Francesca and Dubreuil, Alexis and Ostojic, Srdjan and Barak, Omri , booktitle =. The interplay between randomness and structure during learning in RNNs , url =

  21. [21]

    Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows , author=

    Discretize-Optimize vs. Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows , author=. 2020 , eprint=

  22. [22]

    Neural Networks: Tricks of the Trade , pages =

    Efficient BackProp , author =. Neural Networks: Tricks of the Trade , pages =. 1998 , publisher =

  23. [23]

    Advances in Neural Information Processing Systems , volume =

    Neural Ordinary Differential Equations , author =. Advances in Neural Information Processing Systems , volume =

  24. [24]

    arXiv preprint arXiv:1711.00579 , year =

    A Proposal on Machine Learning via Dynamical Systems , author =. arXiv preprint arXiv:1711.00579 , year =

  25. [25]

    Des premiers travaux de Le Verrier \`a la d\'ecouverte de Neptune

    Stable Architectures for Deep Neural Networks , author =. arXiv preprint arXiv:1710.03688 , year =

  26. [26]

    Optimal Stopping and the Sufficiency of Randomized Threshold Strategies

    Maximum Principle Based Algorithms for Deep Learning , author =. arXiv preprint arXiv:1708.01038 , year =

  27. [27]

    Automatic Differentiation of Algorithms , pages =

    Gradient Calculations for Dynamic Systems , author =. Automatic Differentiation of Algorithms , pages =. 1995 , publisher =

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    International Conference on Machine Learning , year=

    A Convergence Theory for Deep Learning via Over-Parameterization , author=. International Conference on Machine Learning , year=

  30. [30]

    Advances in Neural Information Processing Systems , volume=

    Mean-field theory of two-layer neural networks: dimension-free bounds and kernel limit , author=. Advances in Neural Information Processing Systems , volume=

  31. [31]

    Advances in Neural Information Processing Systems , volume=

    The Recurrent Neural Tangent Kernel , author=. Advances in Neural Information Processing Systems , volume=

  32. [32]

    Proceedings of COMPSTAT'2010 , pages=

    Large-Scale Machine Learning with Stochastic Gradient Descent , author=. Proceedings of COMPSTAT'2010 , pages=. 2010 , publisher=

  33. [33]

    2016 , publisher=

    Deep Learning , author=. 2016 , publisher=

  34. [34]

    Adam: A Method for Stochastic Optimization

    Adam: A Method for Stochastic Optimization , author=. arXiv preprint arXiv:1412.6980 , year=

  35. [35]

    Nature , volume=

    Deep Learning , author=. Nature , volume=. 2015 , publisher=

  36. [36]

    Journal of Computational Physics , volume=

    Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations , author=. Journal of Computational Physics , volume=. 2019 , publisher=

  37. [37]

    Language Models are Few-Shot Learners

    Language Models are Few-Shot Learners , author=. arXiv preprint arXiv:2005.14165 , year=

  38. [38]

    Journal of Machine Learning Research , volume=

    Automatic Differentiation in Machine Learning: a Survey , author=. Journal of Machine Learning Research , volume=

  39. [39]

    International Conference on Learning Representations , year=

    Gradient Descent Provably Optimizes Over-parameterized Neural Networks , author=. International Conference on Learning Representations , year=

  40. [40]

    Nature , volume=

    Learning representations by back-propagating errors , author=. Nature , volume=. 1986 , publisher=

  41. [41]

    Proceedings of COMPSTAT , pages=

    Large-Scale Machine Learning with Stochastic Gradient Descent , author=. Proceedings of COMPSTAT , pages=. 2010 , publisher=

  42. [42]

    1990 , publisher=

    Nonlinear Dynamical Control Systems , author=. 1990 , publisher=

  43. [43]

    2007 , publisher=

    Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting , author=. 2007 , publisher=

  44. [44]

    2014 , publisher=

    Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition , author=. 2014 , publisher=

  45. [45]

    2001 , publisher=

    Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems , author=. 2001 , publisher=

  46. [46]

    Nature , volume=

    Context-dependent computation by recurrent dynamics in prefrontal cortex , author=. Nature , volume=. 2013 , publisher=. doi:10.1038/nature12742 , PMID=

  47. [47]

    Neural Computation , volume=

    Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks , author=. Neural Computation , volume=. 2013 , publisher=. doi:10.1162/NECO\_a\_00409 , PMID=

  48. [48]

    The Twelfth International Conference on Learning Representations , year=

    How connectivity structure shapes rich and lazy learning in neural circuits , author=. The Twelfth International Conference on Learning Representations , year=

  49. [49]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    On Lazy Training in Differentiable Programming , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  50. [50]

    Nature Neuroscience , volume=

    Task representations in neural networks trained to perform many cognitive tasks , author=. Nature Neuroscience , volume=. 2019 , doi=

  51. [51]

    The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation , url =

    Turner, Elia and Barak, Omri , booktitle =. The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation , url =

  52. [52]

    Nature Neuroscience , volume=

    Flexible multitask computation in recurrent networks utilizes shared dynamical motifs , author=. Nature Neuroscience , volume=. 2024 , doi=

  53. [53]

    Open the Black Box of Recurrent Neural Network by Decoding the Internal Dynamics , year=

    Tang, Jiacheng and Yin, Hao and Kang, Qi , booktitle=. Open the Black Box of Recurrent Neural Network by Decoding the Internal Dynamics , year=

  54. [54]

    Current Opinion in Neurobiology , volume=

    From lazy to rich to exclusive task representations in neural networks and neural codes , author=. Current Opinion in Neurobiology , volume=. 2023 , doi=

  55. [55]

    Frontiers in Systems Neuroscience , volume=

    Exploring Flip Flop Memories and Beyond: Training Recurrent Neural Networks with Key Insights , author=. Frontiers in Systems Neuroscience , volume=. 2024 , month=. doi:10.3389/fnsys.2024.1269190 , pmid=

  56. [56]

    Physical Review Letters , volume=

    Chaos in random neural networks , author=. Physical Review Letters , volume=. 1988 , publisher=

  57. [57]

    Physical Review Letters , volume=

    Eigenvalue spectra of random matrices for neural networks , author=. Physical Review Letters , volume=. 2006 , publisher=

  58. [58]

    Neuron , volume=

    Generating coherent patterns of activity from chaotic neural networks , author=. Neuron , volume=. 2009 , publisher=

  59. [59]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Exponential expressivity in deep neural networks through transient chaos , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  60. [60]

    Frontiers in Applied Mathematics and Statistics , volume=

    On Lyapunov Exponents for RNNs: Understanding Information Propagation Using Dynamical Systems Tools , author=. Frontiers in Applied Mathematics and Statistics , volume=. 2022 , doi=

  61. [61]

    Physical Review Letters , volume=

    Finite-time Lyapunov exponents of deep neural networks , author=. Physical Review Letters , volume=. 2024 , publisher=

  62. [62]

    Advances in neural information processing systems , volume=

    On the difficulty of learning chaotic dynamics with RNNs , author=. Advances in neural information processing systems , volume=

  63. [63]

    arXiv preprint arXiv:2402.18377 , year=

    Out-of-domain generalization in dynamical systems reconstruction , author=. arXiv preprint arXiv:2402.18377 , year=

  64. [64]

    Publications Mathématiques de l'IHÉS , volume=

    Ergodic theory of differentiable dynamical systems , author=. Publications Mathématiques de l'IHÉS , volume=. 1979 , publisher=

  65. [65]

    and Bottou, L

    Lecun, Y. and Bottou, L. and Bengio, Y. and Haffner, P. , journal=. Gradient-based learning applied to document recognition , year=

  66. [66]

    Advances in Neural Information Processing Systems , volume=

    Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? , author=. Advances in Neural Information Processing Systems , volume=. 2018 , publisher=

  67. [67]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , author=. arXiv preprint arXiv:1312.6120 , year=

  68. [68]

    NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning , year=

    From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks , author=. NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning , year=

  69. [69]

    Science , volume=

    Nonlinear dimensionality reduction by locally linear embedding , author=. Science , volume=. 2000 , publisher=

  70. [70]

    Nature , volume=

    Neural constraints on learning , author=. Nature , volume=. 2014 , publisher=. doi:10.1038/nature13665 , url=

  71. [71]

    R., Millman, K

    Charles R. Harris and K. Jarrod Millman and St. Array programming with. 2020 , month = sep, journal =. doi:10.1038/s41586-020-2649-2 , publisher =

  72. [72]

    2013 , publisher=

    Matrix Computations , author=. 2013 , publisher=

  73. [73]

    1989 , publisher=

    Introductory Functional Analysis with Applications , author=. 1989 , publisher=

  74. [74]

    Nature Methods , volume=

    SciPy 1.0: fundamental algorithms for scientific computing in Python , author=. Nature Methods , volume=. 2020 , publisher=

  75. [75]

    Proceedings of the 32nd International Conference on Machine Learning , volume=

    Optimizing Neural Networks with Kronecker-factored Approximate Curvature , author=. Proceedings of the 32nd International Conference on Machine Learning , volume=

  76. [76]

    Advances in Neural Information Processing Systems , volume=

    Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

  77. [77]

    2025 , eprint=

    Fast Neural Tangent Kernel Alignment, Norm and Effective Rank via Trace Estimation , author=. 2025 , eprint=

  78. [78]

    Advances in Neural Information Processing Systems , volume=

    Limitations of the Empirical Fisher Approximation for Natural Gradient Descent , author=. Advances in Neural Information Processing Systems , volume=

  79. [79]

    Werbos , title =

    Paul J. Werbos , title =. Proceedings of the IEEE , volume =. 1990 , doi =

  80. [80]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

Showing first 80 references.