Neural LoFi models deep learning as layer-wise spectral filtering that selects maximal low-degree correlations, yielding a tractable surrogate for hierarchical representation learning beyond the lazy regime.
Neel Nanda, Lawrence Chan, Tom Liberum, Jess Smith, and Jacob Steinhardt
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7roles
background 2polarities
background 2representative citing papers
Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.
Spherical mean-field Langevin dynamics concentrate near hidden indices in Gaussian multi-index models with a sharp temperature transition at λ ≃ 1 and achieve d/N and Md/N rates in single-index models via Lévy-Milman concentration.
On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
Multi-task grokking in Transformers produces staggered generalization, low-dimensional manifolds, weight-decay phase structure, holographic solutions, and transverse redundancy.
For multi-index polynomials, the top r eigenspace of the AGOP matrix from KRR recovers the central subspace at sample complexity n ~ d^{p+δ} where p is the degree of the informative component.
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
citing papers explorer
No citing papers match the current filters.