Greg Yang

· 2006 · arXiv 2006.14548

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.

State-Space NTK Collapse Near Bifurcations

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Bifurcations cause sNTK to reduce to a dominant rank-one channel matching normal forms, collapsing effective rank and funneling gradient descent into critical dynamical directions.

Learning Rate Transfer in Normalized Transformers

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.

citing papers explorer

Showing 5 of 5 citing papers.

The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning cs.LG · 2026-05-09 · unverdicted · none · ref 124
The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences cs.LG · 2026-05-06 · unverdicted · none · ref 42
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
State-Space NTK Collapse Near Bifurcations cs.LG · 2026-05-12 · unverdicted · none · ref 123
Bifurcations cause sNTK to reduce to a dominant rank-one channel matching normal forms, collapsing effective rank and funneling gradient descent into critical dynamical directions.
Learning Rate Transfer in Normalized Transformers cs.LG · 2026-04-29 · unverdicted · none · ref 19
νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.
Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training cs.LG · 2026-03-30 · unverdicted · none · ref 22
Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.

Greg Yang

fields

years

verdicts

representative citing papers

citing papers explorer