arxiv: 2605.08746 · v1 · submitted 2026-05-09 · 💻 cs.LG · math.DS· math.OC

Recognition: 2 theorem links

· Lean Theorem

The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

James Hazelden , Laura Driscoll , Eli Shlizerman , Eric Shea-Brown

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:47 UTC · model grok-4.3

classification 💻 cs.LG math.DSmath.OC

keywords neural tangent kernelgradient descentKronecker coreself-referential biaslow-rank representationsrecurrent neural networkstransformersimplicit constraints

0 comments

The pith

The global empirical NTK decomposes into a Kronecker-core Gram matrix times a state-dependency operator, imposing a low-rank bottleneck on gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to characterize the structure of the global empirical neural tangent kernel that controls first-order updates during gradient descent in neural networks. By expressing the model state as the solution to one global implicit constraint, the NTK factors exactly into an immediate parameter-to-state operator K and an internal state-to-state operator P. For weight-based models including RNNs and transformers, a universal Kronecker-core theorem shows that K equals the Gram matrix of weight-site variables and is therefore exactly computable. This factorization reveals that the NTK is structurally bottlenecked with limited effective rank, which produces a self-referential bias directing learning toward the dominant modes of combined input and hidden activity. Readers care because the result accounts for the emergence of low-rank features and the selective difficulty of learning certain task elements right from initialization.

Core claim

Formulating the model state as the solution to a single global implicit constraint yields the global empirical NTK as the product of operators K and P. For RNNs, transformers and other weight-based models a universal Kronecker-core theorem establishes that K is exactly the Gram matrix of the weight-site variables. The resulting structure shows the NTK is bottlenecked in rank, producing a self-referential bias that directs gradient descent toward the principal modes of joint hidden and input activity. The spectrum of the NTK in recurrent models is accordingly biased and low-rank in space or time, and initialization dynamics further restrict the learnable subspace. The same low-rank constraint

What carries the argument

The universal Kronecker-core theorem, which shows that the immediate parameter-to-state operator K equals the exact Gram matrix of weight-site variables.

Load-bearing premise

The model state can be expressed as the solution to a single global implicit constraint that lets the NTK factor exactly into operators K and P.

What would settle it

Direct numerical computation of the global empirical NTK for a small finite-width RNN or transformer and verification that it does not equal the predicted product of the weight-site Gram matrix and the state-dependency operator would falsify the Kronecker-core claim.

Figures

Figures reproduced from arXiv: 2605.08746 by Eli Shlizerman, Eric Shea-Brown, James Hazelden, Laura Driscoll.

**Figure 1.** Figure 1: Backpropagation of errors for a recurrent model, annotated by the corresponding operators, P, K and NTKS from Proposition 1. The operator P ∗ maps the global error signal to state adjoint sensitivity, describing how the loss depends on the state, h. Then, these adjoints are projected into and out of the parameter space by K, potentially zeroing or misdirecting them. Finally, the corrections, which modify t… view at source ↗

**Figure 2.** Figure 2: Schematic of Theorem 1 for the Example 4.1, with V = cat(H, X), as is the case for the GRU and RNN. If the hidden units and inputs have lowdimensional activity, then this joint state matrix bottlenecks the full NTK. For the discrete recurrent example, consider the discrete-time system ht+1 = f(ht, Wrecht, Winxt+1), with parameters θ = cat(vec(Wrec), vec(Win)), where cat forms the direct sum concatenation… view at source ↗

**Figure 3.** Figure 3: Temporal bottlenecking of the global-state NTK by the Kronecker core V V T . A Schematic of the Memory-Pro task, in which the model must reproduce a two-dimensional stimulus after a delay period. B Cosine similarity between the core, K = V V ∗ ⊗ In, and the NTK, cos(NTKS, V V T ⊗ In), over GD training of the GRU model on the task in A, showing that the two operators share a similar common basis. Here, we u… view at source ↗

**Figure 4.** Figure 4: Self-referential bias can stall SGD on the Memory-Pro task. A-B Outputs before and after training for two GRU initializations. A (Network 1): a default Xavier initialization with weight scale 1, whose hidden dynamics collapse to a single fixed point under zero input during the response period, regardless of the input (“End” in left panel of Figure). B (Network 2): as an illustrative case, an initialization… view at source ↗

**Figure 5.** Figure 5: Recurrent gain and input rank induce distinct spatial and temporal NTK bottlenecks. A Two variants of a student-teacher task. For both, we begin with a vanilla RNN teacher with fixed weights W∗ , W∗ in, W∗ out, all drawn from Xavier normal initialization. In the top task, we freeze the student to have identical weights other than W, which is initialized randomly with gain g and trained with GD. The trainin… view at source ↗

**Figure 6.** Figure 6: Rank bottlenecking by input dimension in a self-attention model. A Self-attention architecture with time-varying inputs X ∈ R nx×nt×nin . As in the main text, the weight-site core of the NTK consists of the concatenated input activity and attention matrix, V = cat(X, A). See Appendix A.2.3 for model and input details. B Because A lies in the same temporal span as X, the temporal rank of the NTK is bottlene… view at source ↗

**Figure 7.** Figure 7: Examples of partial reductions of an operator acting on a 3-tensor domain, R B×T ×H. The operator Φ acts on a domain R B×T ×H, representing batch, time, and hidden-unit axes. Each reduction averages over one or more axes. The reduced operators at the bottom are simple B × B, T × T, and H × H matrices, respectively. These capture how the operator varies across batch inputs, hidden units, or timesteps after … view at source ↗

**Figure 8.** Figure 8: Dynamics of an RNN with added non-trivial fixed points (NTFPs). Random weights are sampled with Xavier initialization, scaled by gain g, with n = 256 and g varied between 1 and 2. Non-trivial fixed points are then added to the model as described in the text. Plotted trajectories correspond to distinct random initial conditions and gain values. The random seed is fixed between trials and the dynamics are pr… view at source ↗

**Figure 9.** Figure 9: Temporal, spatial, and overall NTK effective rank for a two-dimensional sweep over input dimension and attention dimension. The temporal rank grows primarily with input dimension, while the spatial and overall rank increase strongly with attention width [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Temporal NTK rank versus input dimension for a single-block self-attention toy model with varying head count. Increasing the number of attention heads has only a modest effect on temporal rank, so the same basic temporal bottleneck from the input representation remains. weights to the teacher values, so slow or incomplete learning reflects bias in the induced NTK rather than task misspecification. Unless … view at source ↗

read the original abstract

In training a neural network with gradient descent (GD), each iteration induces a linear operator that governs first-order updates to a model's internal state variables. We define this operator as the Global Empirical Neural Tangent Kernel (NTK). In finite-width networks, the NTK is typically intractable to form, leading prior work to focus on restrictive settings such as tracking outputs only or taking infinite-width limits. Here, we study the structure of the NTK for a range of models. Formulating the model state as the solution to a single global implicit constraint, we derive the NTK as a product of two operators: K, accounting for immediate parameter-to-state interactions, and P, describing internal state-to-state dependencies. For a broad class of weight-based models, including RNNs and transformers, we prove a universal Kronecker-core theorem showing that K admits an exact, computable form given by the Gram matrix of weight-site variables. This core structure reveals that the NTK is structurally bottlenecked, constraining its effective rank and giving rise to a self-referential bias whereby GD preferentially learns within dominant modes of joint hidden and input activity. For recurrent models, we examine the spectrum of the NTK and show when it is biased and low-rank in space or time under the proposed decomposition. We further demonstrate that model dynamics at initialization bias the NTK, restricting learning and preventing task components from being learned effectively. Finally, we show that the NTK associated with a self-attention transformer is likewise structurally constrained to be low-rank. Overall, we show that the NTK possesses tractable structure that explains GD bias toward task solutions and the emergence of low-rank representations. To enable use of the NTK as a practical metric, we build kpflow, a library relying on randomized matrix-free numerical linear algebra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The global empirical NTK and Kronecker-core theorem give a concrete decomposition for finite-width cases, but the single global implicit constraint looks too strong for the RNNs and transformers the paper targets.

read the letter

The key takeaway is that this paper defines a global empirical NTK for finite-width networks and proves a Kronecker-core theorem that expresses K as the Gram matrix of weight-site variables for a wide range of models. This structure is used to argue that the NTK is low-rank and that gradient descent exhibits a self-referential bias toward dominant modes of hidden and input activity. What is new here is the decomposition of the NTK into K and P operators under a global implicit constraint, plus the extension to RNNs and transformers with claims of exact computable forms. They also analyze the spectrum for recurrent cases and show low-rank structure in self-attention transformers. The kpflow library for randomized computation of these kernels is a practical addition that could help others test the ideas. The paper does well in connecting the NTK structure to why learning gets stuck in certain patterns at initialization and why low-rank representations emerge. That explanatory angle addresses a real question in the field about training dynamics. The soft spot is the foundational assumption that the model state solves a single global implicit constraint. RNNs unfold over discrete time steps with explicit recurrences, and transformers process layers and attention sequentially rather than as one fixed-point equation. If the paper applies the theorem directly without additional approximations or justifications for these cases, the exactness of the Kronecker core and the bias conclusions may not hold as stated for the target models. The abstract presents it as proven for the broad class, but that step needs careful verification. This work is for researchers focused on theoretical explanations of NTK behavior beyond infinite-width limits and on the origins of learning biases in deep networks. A reader looking for new tools to analyze gradient descent preferences would find the framework useful to engage with. I would recommend sending it for peer review. The claims are specific enough to be checked, and the topic is relevant, though the proofs and assumption handling will likely require revisions.

Referee Report

3 major / 2 minor

Summary. The manuscript defines the Global Empirical Neural Tangent Kernel (NTK) as the linear operator induced by each gradient-descent iteration on a model's internal state variables. Formulating the model state as the exact solution to a single global implicit constraint, it derives the NTK as the product of operators K (immediate parameter-to-state interactions) and P (internal state-to-state dependencies). For weight-based models including RNNs and transformers, it proves a universal Kronecker-core theorem asserting that K equals the Gram matrix of weight-site variables, implying a structural rank bottleneck, self-referential bias in which GD preferentially learns dominant modes of joint hidden-input activity, spectral properties for recurrent models, initialization-induced restrictions on learnable task components, and low-rank structure for self-attention transformers. A randomized matrix-free library (kpflow) is provided to make the NTK practical.

Significance. If the derivations and the global-constraint formulation hold exactly for the claimed architectures, the work supplies a tractable structural account of finite-width NTK behavior that explains GD biases and the emergence of low-rank representations without relying on infinite-width limits. The Kronecker-core result and the accompanying computational library constitute concrete, usable contributions that could inform analysis of training dynamics across recurrent and attention-based models.

major comments (3)

[Abstract and derivation of NTK = K P] The derivation of NTK = K P and the exact Kronecker-core theorem (K as Gram matrix of weight-site variables) rests on the model state being formulated as the exact solution to one global implicit constraint. For RNNs the recurrence unfolds over explicit time steps and for transformers self-attention and layer norms are computed sequentially; the manuscript must clarify whether this constraint holds exactly or requires additional fixed-point assumptions not stated for the broad class, because any gap directly undermines the claimed exact computable form, rank bottleneck, and self-referential bias.
[Kronecker-core theorem and bias discussion] The self-referential bias claim (GD preferentially learns within dominant modes of joint hidden and input activity) is presented as a direct consequence of the low-rank structure induced by the Kronecker core. The manuscript should supply the explicit spectral decomposition or mode-identification step that converts the Gram-matrix form of K into this preferential-learning statement, because the bias is load-bearing for the paper's explanation of GD behavior.
[Spectrum and transformer sections] The spectral analysis for recurrent models and the low-rank demonstration for transformers are asserted to follow from the K-P decomposition. The manuscript must state the precise assumptions on the weight-site variables and the P operator that guarantee the reported rank bounds and bias in space/time, because these results are used to support the universal applicability of the theorem.

minor comments (2)

[Introduction] Notation for the operators K and P is introduced without an early summary table relating them to standard NTK components; a brief comparison would aid readability.
[Final section] The kpflow library is mentioned as enabling practical use, but no pseudocode or complexity statement for the randomized matrix-free routines appears in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [Abstract and derivation of NTK = K P] The derivation of NTK = K P and the exact Kronecker-core theorem (K as Gram matrix of weight-site variables) rests on the model state being formulated as the exact solution to one global implicit constraint. For RNNs the recurrence unfolds over explicit time steps and for transformers self-attention and layer norms are computed sequentially; the manuscript must clarify whether this constraint holds exactly or requires additional fixed-point assumptions not stated for the broad class, because any gap directly undermines the claimed exact computable form, rank bottleneck, and self-referential bias.

Authors: The global implicit constraint is defined directly as the equation satisfied by the full model state vector after the complete forward pass, with the computation graph (including unfolded recurrence or sequential layers) serving as the exact map from parameters to states. For RNNs this means the state equations at every time step are satisfied simultaneously by the unfolded trajectory; for transformers the layer-wise equations (including self-attention and norms) are satisfied by the sequential computation. No iterative fixed-point solver or extra assumptions are introduced beyond the standard forward pass, which solves the constraint by construction. We will add a short clarifying subsection in the methods that states this equivalence explicitly for the architectures considered. revision: yes
Referee: [Kronecker-core theorem and bias discussion] The self-referential bias claim (GD preferentially learns within dominant modes of joint hidden and input activity) is presented as a direct consequence of the low-rank structure induced by the Kronecker core. The manuscript should supply the explicit spectral decomposition or mode-identification step that converts the Gram-matrix form of K into this preferential-learning statement, because the bias is load-bearing for the paper's explanation of GD behavior.

Authors: Under the Kronecker-core theorem, K is the Gram matrix G = V^TV where the columns of V are the weight-site variables (concatenated input and hidden activations at each parameter location). The eigendecomposition G = U Lambda U^T therefore has eigenvectors U that are precisely the principal components of these joint activity vectors. Because the NTK is the composition KP, its action projects parameter updates onto the dominant subspace spanned by these modes, yielding the stated self-referential bias. We will insert the explicit decomposition together with the mode-identification argument immediately after the theorem statement and reference it in the bias discussion. revision: yes
Referee: [Spectrum and transformer sections] The spectral analysis for recurrent models and the low-rank demonstration for transformers are asserted to follow from the K-P decomposition. The manuscript must state the precise assumptions on the weight-site variables and the P operator that guarantee the reported rank bounds and bias in space/time, because these results are used to support the universal applicability of the theorem.

Authors: The weight-site variables are finite-dimensional vectors in R^{d_in + d_hidden} formed by concatenating the input and hidden activations at each weight. The operator P is the Jacobian of the internal state-to-state map and is taken to be full rank (or invertible) in the generic case; for recurrent models the spectrum is analyzed on the time-unfolded product of per-step Jacobians. For transformers the self-attention weights define the sites and the low-rank bound follows when the attention Gram is rank-deficient. We will add an explicit paragraph listing these assumptions immediately before the spectral and transformer results, together with a brief note on where the bounds continue to hold under relaxed conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation follows from explicit assumption without reduction by construction

full rationale

The paper defines the Global Empirical NTK as the linear operator governing first-order GD updates to internal state variables. It then states the modeling choice 'Formulating the model state as the solution to a single global implicit constraint' and derives the decomposition NTK = K P together with the Kronecker-core theorem that K equals the Gram matrix of weight-site variables. This is a direct consequence of the stated formulation rather than a self-referential loop in which the result is presupposed. No parameters are fitted on data and then relabeled as predictions, no uniqueness theorems are imported via self-citation, and no ansatz is smuggled through prior work. The claims of rank bottleneck and self-referential bias are logical corollaries of the derived operator structure for the claimed model class. The derivation chain is therefore self-contained under its own premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claims rest on the new definition of the global empirical NTK and the proof of the Kronecker-core theorem; these introduce operators K and P whose independence from prior fitted quantities cannot be confirmed from the abstract alone.

axioms (1)

domain assumption Model state is the solution to a single global implicit constraint
Invoked to derive the NTK as the product of K and P operators.

invented entities (3)

Global Empirical NTK no independent evidence
purpose: Linear operator that governs first-order updates to the model's internal state variables under gradient descent
Newly defined to extend beyond output-only or infinite-width NTK analyses.
K operator no independent evidence
purpose: Accounts for immediate parameter-to-state interactions
Component of the NTK decomposition derived from the implicit constraint.
P operator no independent evidence
purpose: Describes internal state-to-state dependencies
Component of the NTK decomposition derived from the implicit constraint.

pith-pipeline@v0.9.0 · 5646 in / 1532 out tokens · 82472 ms · 2026-05-12T02:47:25.130590+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Formulating the model state as the solution to a single global implicit constraint, we derive NTKS as a product of two operators: K ... and P ... For a broad class of weight-based models ... universal Kronecker-core theorem showing that K admits an exact, computable form given by the Gram matrix of weight-site variables.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NTKS = P K P* ... K = V V* ⊗ I ... self-referential bias whereby GD preferentially learns within dominant modes of joint hidden and input activity.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2311.10869 , year=

Evolutionary algorithms as an alternative to backpropagation for supervised training of Biophysical Neural Networks and Neural ODEs , author=. arXiv preprint arXiv:2311.10869 , year=

work page arXiv
[2]

2015 , eprint=

Deep Residual Learning for Image Recognition , author=. 2015 , eprint=

work page 2015
[3]

James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

work page
[4]

and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =

work page
[5]

Papyan, Vardan and Han, X. Y. and Donoho, David L. , title =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =

work page 2020
[6]

Organizing recurrent network dynamics by task-computation to enable continual learning , url =

Duncker, Lea and Driscoll, Laura and Shenoy, Krishna V and Sahani, Maneesh and Sussillo, David , booktitle =. Organizing recurrent network dynamics by task-computation to enable continual learning , url =

work page
[7]

and Bialek, William , title =

Tishby, Naftali and Pereira, Fernando C. and Bialek, William , title =. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , year =

work page
[8]

Frederickson, Paul and Kaplan, J. L. and Yorke, E. D. and Yorke, J. A. , title =. Journal of Differential Equations , volume =

work page
[9]

1962 , publisher =

The Mathematical Theory of Optimal Processes , author =. 1962 , publisher =

work page 1962
[10]

Rumelhart, D. E. and Hinton, G. E. and Williams, R. J. , title =. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations , year =

work page
[11]

2016 , eprint=

Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=

work page 2016
[12]

Physical Review X , volume=

Transition to Chaos in Random Neuronal Networks , author=. Physical Review X , volume=. 2015 , publisher=

work page 2015
[13]

2025 , eprint=

Dynamically Learning to Integrate in Recurrent Neural Networks , author=. 2025 , eprint=

work page 2025
[14]

2020 , eprint=

Lyapunov spectra of chaotic recurrent neural networks , author=. 2020 , eprint=

work page 2020
[15]

, title =

Tolmachev, Pavel and Engel, Tatiana A. , title =. 2024 , doi =. https://www.biorxiv.org/content/early/2024/11/24/2024.11.23.625012.full.pdf , journal =

work page 2024
[16]

2019 , eprint=

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=

work page 2019
[17]

2019 , eprint=

Surrogate Gradient Learning in Spiking Neural Networks , author=. 2019 , eprint=

work page 2019
[18]

Exploring the Impact of Activation Functions in Training Neural

Tianxiang Gao and Siyuan Sun and Hailiang Liu and Hongyang Gao , booktitle=. Exploring the Impact of Activation Functions in Training Neural. 2025 , url=

work page 2025
[19]

The Journal of physiology , volume=

A quantitative description of membrane current and its application to conduction and excitation in nerve , author=. The Journal of physiology , volume=. 1952 , doi=

work page 1952
[20]

The interplay between randomness and structure during learning in RNNs , url =

Schuessler, Friedrich and Mastrogiuseppe, Francesca and Dubreuil, Alexis and Ostojic, Srdjan and Barak, Omri , booktitle =. The interplay between randomness and structure during learning in RNNs , url =

work page
[21]

Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows , author=

Discretize-Optimize vs. Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows , author=. 2020 , eprint=

work page 2020
[22]

Neural Networks: Tricks of the Trade , pages =

Efficient BackProp , author =. Neural Networks: Tricks of the Trade , pages =. 1998 , publisher =

work page 1998
[23]

Advances in Neural Information Processing Systems , volume =

Neural Ordinary Differential Equations , author =. Advances in Neural Information Processing Systems , volume =

work page
[24]

arXiv preprint arXiv:1711.00579 , year =

A Proposal on Machine Learning via Dynamical Systems , author =. arXiv preprint arXiv:1711.00579 , year =

work page arXiv
[25]

Des premiers travaux de Le Verrier \`a la d\'ecouverte de Neptune

Stable Architectures for Deep Neural Networks , author =. arXiv preprint arXiv:1710.03688 , year =

work page Pith review arXiv
[26]

Optimal Stopping and the Sufficiency of Randomized Threshold Strategies

Maximum Principle Based Algorithms for Deep Learning , author =. arXiv preprint arXiv:1708.01038 , year =

work page Pith review arXiv
[27]

Automatic Differentiation of Algorithms , pages =

Gradient Calculations for Dynamic Systems , author =. Automatic Differentiation of Algorithms , pages =. 1995 , publisher =

work page 1995
[28]

Advances in Neural Information Processing Systems , volume=

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[29]

International Conference on Machine Learning , year=

A Convergence Theory for Deep Learning via Over-Parameterization , author=. International Conference on Machine Learning , year=

work page
[30]

Advances in Neural Information Processing Systems , volume=

Mean-field theory of two-layer neural networks: dimension-free bounds and kernel limit , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

Advances in Neural Information Processing Systems , volume=

The Recurrent Neural Tangent Kernel , author=. Advances in Neural Information Processing Systems , volume=

work page
[32]

Proceedings of COMPSTAT'2010 , pages=

Large-Scale Machine Learning with Stochastic Gradient Descent , author=. Proceedings of COMPSTAT'2010 , pages=. 2010 , publisher=

work page 2010
[33]

2016 , publisher=

Deep Learning , author=. 2016 , publisher=

work page 2016
[34]

Adam: A Method for Stochastic Optimization

Adam: A Method for Stochastic Optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Nature , volume=

Deep Learning , author=. Nature , volume=. 2015 , publisher=

work page 2015
[36]

Journal of Computational Physics , volume=

Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations , author=. Journal of Computational Physics , volume=. 2019 , publisher=

work page 2019
[37]

Language Models are Few-Shot Learners

Language Models are Few-Shot Learners , author=. arXiv preprint arXiv:2005.14165 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[38]

Journal of Machine Learning Research , volume=

Automatic Differentiation in Machine Learning: a Survey , author=. Journal of Machine Learning Research , volume=

work page
[39]

International Conference on Learning Representations , year=

Gradient Descent Provably Optimizes Over-parameterized Neural Networks , author=. International Conference on Learning Representations , year=

work page
[40]

Nature , volume=

Learning representations by back-propagating errors , author=. Nature , volume=. 1986 , publisher=

work page 1986
[41]

Proceedings of COMPSTAT , pages=

Large-Scale Machine Learning with Stochastic Gradient Descent , author=. Proceedings of COMPSTAT , pages=. 2010 , publisher=

work page 2010
[42]

1990 , publisher=

Nonlinear Dynamical Control Systems , author=. 1990 , publisher=

work page 1990
[43]

2007 , publisher=

Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting , author=. 2007 , publisher=

work page 2007
[44]

2014 , publisher=

Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition , author=. 2014 , publisher=

work page 2014
[45]

2001 , publisher=

Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems , author=. 2001 , publisher=

work page 2001
[46]

Nature , volume=

Context-dependent computation by recurrent dynamics in prefrontal cortex , author=. Nature , volume=. 2013 , publisher=. doi:10.1038/nature12742 , PMID=

work page doi:10.1038/nature12742 2013
[47]

Neural Computation , volume=

Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks , author=. Neural Computation , volume=. 2013 , publisher=. doi:10.1162/NECO\_a\_00409 , PMID=

work page doi:10.1162/neco 2013
[48]

The Twelfth International Conference on Learning Representations , year=

How connectivity structure shapes rich and lazy learning in neural circuits , author=. The Twelfth International Conference on Learning Representations , year=

work page
[49]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

On Lazy Training in Differentiable Programming , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[50]

Nature Neuroscience , volume=

Task representations in neural networks trained to perform many cognitive tasks , author=. Nature Neuroscience , volume=. 2019 , doi=

work page 2019
[51]

The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation , url =

Turner, Elia and Barak, Omri , booktitle =. The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation , url =

work page
[52]

Nature Neuroscience , volume=

Flexible multitask computation in recurrent networks utilizes shared dynamical motifs , author=. Nature Neuroscience , volume=. 2024 , doi=

work page 2024
[53]

Open the Black Box of Recurrent Neural Network by Decoding the Internal Dynamics , year=

Tang, Jiacheng and Yin, Hao and Kang, Qi , booktitle=. Open the Black Box of Recurrent Neural Network by Decoding the Internal Dynamics , year=

work page
[54]

Current Opinion in Neurobiology , volume=

From lazy to rich to exclusive task representations in neural networks and neural codes , author=. Current Opinion in Neurobiology , volume=. 2023 , doi=

work page 2023
[55]

Frontiers in Systems Neuroscience , volume=

Exploring Flip Flop Memories and Beyond: Training Recurrent Neural Networks with Key Insights , author=. Frontiers in Systems Neuroscience , volume=. 2024 , month=. doi:10.3389/fnsys.2024.1269190 , pmid=

work page doi:10.3389/fnsys.2024.1269190 2024
[56]

Physical Review Letters , volume=

Chaos in random neural networks , author=. Physical Review Letters , volume=. 1988 , publisher=

work page 1988
[57]

Physical Review Letters , volume=

Eigenvalue spectra of random matrices for neural networks , author=. Physical Review Letters , volume=. 2006 , publisher=

work page 2006
[58]

Neuron , volume=

Generating coherent patterns of activity from chaotic neural networks , author=. Neuron , volume=. 2009 , publisher=

work page 2009
[59]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Exponential expressivity in deep neural networks through transient chaos , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[60]

Frontiers in Applied Mathematics and Statistics , volume=

On Lyapunov Exponents for RNNs: Understanding Information Propagation Using Dynamical Systems Tools , author=. Frontiers in Applied Mathematics and Statistics , volume=. 2022 , doi=

work page 2022
[61]

Physical Review Letters , volume=

Finite-time Lyapunov exponents of deep neural networks , author=. Physical Review Letters , volume=. 2024 , publisher=

work page 2024
[62]

Advances in neural information processing systems , volume=

On the difficulty of learning chaotic dynamics with RNNs , author=. Advances in neural information processing systems , volume=

work page
[63]

arXiv preprint arXiv:2402.18377 , year=

Out-of-domain generalization in dynamical systems reconstruction , author=. arXiv preprint arXiv:2402.18377 , year=

work page arXiv
[64]

Publications Mathématiques de l'IHÉS , volume=

Ergodic theory of differentiable dynamical systems , author=. Publications Mathématiques de l'IHÉS , volume=. 1979 , publisher=

work page 1979
[65]

and Bottou, L

Lecun, Y. and Bottou, L. and Bengio, Y. and Haffner, P. , journal=. Gradient-based learning applied to document recognition , year=

work page
[66]

Advances in Neural Information Processing Systems , volume=

Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? , author=. Advances in Neural Information Processing Systems , volume=. 2018 , publisher=

work page 2018
[67]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , author=. arXiv preprint arXiv:1312.6120 , year=

work page Pith review arXiv
[68]

NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning , year=

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks , author=. NeurIPS 2024 Workshop on Mathematics of Modern Machine Learning , year=

work page 2024
[69]

Science , volume=

Nonlinear dimensionality reduction by locally linear embedding , author=. Science , volume=. 2000 , publisher=

work page 2000
[70]

Nature , volume=

Neural constraints on learning , author=. Nature , volume=. 2014 , publisher=. doi:10.1038/nature13665 , url=

work page doi:10.1038/nature13665 2014
[71]

R., Millman, K

Charles R. Harris and K. Jarrod Millman and St. Array programming with. 2020 , month = sep, journal =. doi:10.1038/s41586-020-2649-2 , publisher =

work page doi:10.1038/s41586-020-2649-2 2020
[72]

2013 , publisher=

Matrix Computations , author=. 2013 , publisher=

work page 2013
[73]

1989 , publisher=

Introductory Functional Analysis with Applications , author=. 1989 , publisher=

work page 1989
[74]

Nature Methods , volume=

SciPy 1.0: fundamental algorithms for scientific computing in Python , author=. Nature Methods , volume=. 2020 , publisher=

work page 2020
[75]

Proceedings of the 32nd International Conference on Machine Learning , volume=

Optimizing Neural Networks with Kronecker-factored Approximate Curvature , author=. Proceedings of the 32nd International Conference on Machine Learning , volume=

work page
[76]

Advances in Neural Information Processing Systems , volume=

Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[77]

2025 , eprint=

Fast Neural Tangent Kernel Alignment, Norm and Effective Rank via Trace Estimation , author=. 2025 , eprint=

work page 2025
[78]

Advances in Neural Information Processing Systems , volume=

Limitations of the Empirical Fisher Approximation for Natural Gradient Descent , author=. Advances in Neural Information Processing Systems , volume=

work page
[79]

Werbos , title =

Paul J. Werbos , title =. Proceedings of the IEEE , volume =. 1990 , doi =

work page 1990
[80]

Proceedings of the 38th International Conference on Machine Learning , pages =

Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021

Showing first 80 references.