Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime

Yuqing Wang

arxiv: 2606.23364 · v1 · pith:GJEAEQCEnew · submitted 2026-06-22 · 💻 cs.LG · math.DS· math.OC

Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime

Yuqing Wang This is my paper

Pith reviewed 2026-06-26 08:51 UTC · model grok-4.3

classification 💻 cs.LG math.DSmath.OC

keywords gradient descent convergenceneural network architecturesbeyond NTK regimetransformersPL inequalitystationary pointsblock-level analysisgeneralized smoothness

0 comments

The pith

Gradient descent converges to the neighborhood of a stationary point from almost all initializations in general neural architectures beyond the NTK regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a convergence analysis for gradient descent that applies to a wide range of neural network architectures, including pre-normalized multi-layer transformers, without relying on the neural tangent kernel approximation. It shows that under mild assumptions, standard learning rates drive the iterates near a stationary point for almost every starting point. The argument works by establishing an iterate-dependent PL-type inequality using analyticity and measure-zero sets, together with trajectory-specific smoothness bounds derived from polynomial generalized smoothness and a local relaxed dissipative condition. The framework is stated at the level of network blocks rather than individual layers. This yields concrete scaling rules for learning rates that depend on depth and bottleneck dimensions under Xavier initialization.

Core claim

Under mild assumptions, for almost all initializations, gradient descent with regular learning rates converges to the neighbourhood of a stationary point for a broad family of neural network architectures including pre-normalized multi-layer transformers. The proof proceeds by establishing an iterate-dependent PL-type inequality through analyticity and measure-zero arguments, and by proving Lipschitz smoothness along the GD trajectory through polynomial generalized smoothness and a local relaxed dissipative condition. The framework is formulated at the level of network blocks. The result is further interpreted under Xavier initialization and practical architectural scaling, and structural no

What carries the argument

Iterate-dependent PL-type inequality established via analyticity and measure-zero arguments, paired with trajectory Lipschitz smoothness from polynomial generalized smoothness and a local relaxed dissipative condition, all operating at the network-block level.

If this is right

Learning-rate scale depends on depth and effective bottleneck dimensions rather than largest width under Xavier initialization.
Residual connections and function composition satisfy structural nondegeneracy conditions inside the framework.
Global minimizers admit a generic characterization within the block-level analysis.
The same block-level argument covers architectures beyond the specific transformer example.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The block-level formulation may simplify analysis of other composite architectures such as convolutional or recurrent networks.
The dependence of learning rate on bottleneck dimension rather than width suggests practical scaling rules that remain stable as models grow wider but keep fixed bottlenecks.
If the measure-zero exceptional set can be made explicit, it would allow deterministic initialization schemes that avoid the bad set with high probability.

Load-bearing premise

The mild assumptions allowing an iterate-dependent PL inequality via analyticity plus trajectory smoothness via generalized polynomial bounds and a local dissipative condition.

What would settle it

A positive-measure set of initializations from which gradient descent on a pre-normalized transformer fails to enter any neighborhood of a stationary point under a standard learning rate.

Figures

Figures reproduced from arXiv: 2606.23364 by Yuqing Wang.

read the original abstract

Training dynamics is central to understanding neural networks, yet its theoretical analysis remains difficult even for simple architectures and becomes substantially more challenging for general modern architectures. In this paper, we propose a convergence framework for analyzing gradient descent (GD) dynamics under a broad family of neural network architectures and datasets beyond the neural tangent kernel (NTK) regime. The framework is formulated at the level of network blocks and covers architectures including pre-normalized multi-layer transformers. More precisely, under mild assumptions, we prove that for almost all initializations, GD with regular learning rates converges to the neighbourhood of a stationary point. This is mainly proved by establishing an iterate-dependent PL-type inequality through analyticity and measure-zero arguments, and by proving Lipschitz smoothness along the GD trajectory through polynomial generalized smoothness and a local relaxed dissipative condition. We further interpret the theorem under Xavier initialization and practical architectural scaling, showing that the learning rate scale depends on the depth and effective bottleneck dimensions rather than the largest width. Finally, we derive structural nondegeneracy implications for residual connections and function composition, and provide a generic characterization of global minimizers within our framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a block-level GD convergence result for pre-norm transformers beyond NTK, but the key assumptions stay unverified without the full derivations.

read the letter

The main thing here is a convergence framework for gradient descent that applies at the network-block level to architectures like pre-normalized transformers. It claims that under mild assumptions, GD with ordinary learning rates reaches a neighborhood of a stationary point for almost all initializations, and it derives depth-dependent learning-rate scaling from Xavier initialization and bottleneck dimensions rather than raw width.

The new pieces are the use of analyticity plus measure-zero arguments to build an iterate-dependent PL inequality, plus polynomial generalized smoothness and a local relaxed dissipative condition to control Lipschitz behavior along the trajectory. The block-level setup lets them handle residual connections and function composition, and they give a generic characterization of global minimizers inside the framework. That is a clear step past standard NTK analyses.

The soft spots sit in the assumptions. The abstract calls them mild, but without the actual definitions, lemmas, or architecture-specific checks it is impossible to tell whether the analyticity and dissipative conditions hold for realistic pre-norm transformers or whether they quietly restrict the result. The lack of explicit error bounds or any numerical checks on the trajectory also leaves the practical reach unclear.

This is for theorists who work on optimization guarantees for modern deep networks. A reader who wants to see how to move past NTK for these models could get something useful if the proofs close, but the current write-up does not yet let you judge that.

It deserves a serious referee to examine the details in the later sections. I would send it out for review.

Referee Report

1 major / 0 minor

Summary. The manuscript develops a convergence framework for gradient descent (GD) on a broad family of neural network architectures, including pre-normalized multi-layer transformers, beyond the neural tangent kernel (NTK) regime. Under mild assumptions, it proves that for almost all initializations, GD with regular learning rates converges to the neighborhood of a stationary point. The proof strategy involves establishing an iterate-dependent PL-type inequality using analyticity and measure-zero arguments, and proving Lipschitz smoothness along the GD trajectory using polynomial generalized smoothness and a local relaxed dissipative condition. The paper also interprets the results under Xavier initialization, discusses learning rate scaling depending on depth and effective bottleneck dimensions, derives structural nondegeneracy implications for residual connections and function composition, and provides a generic characterization of global minimizers.

Significance. If the claims are substantiated, this would be a significant advance by extending convergence guarantees beyond the NTK regime to practical architectures such as transformers. The block-level formulation of the framework and the explicit dependence of learning-rate scale on depth and bottleneck dimensions (rather than width) are practical strengths. The analyticity-based measure-zero argument for the iterate-dependent PL inequality, if verified, offers a novel technical tool for non-NTK analysis.

major comments (1)

[Abstract] Abstract (proof strategy paragraph): The abstract outlines the proof ingredients (analyticity and measure-zero arguments for the iterate-dependent PL inequality; polynomial generalized smoothness plus local relaxed dissipativity for trajectory Lipschitz smoothness) but supplies no derivation details, explicit error bounds, or verification that the mild assumptions hold for the claimed architectures. This prevents assessment of whether the central convergence claim is supported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of extending convergence guarantees beyond the NTK regime to practical architectures such as transformers. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (proof strategy paragraph): The abstract outlines the proof ingredients (analyticity and measure-zero arguments for the iterate-dependent PL inequality; polynomial generalized smoothness plus local relaxed dissipativity for trajectory Lipschitz smoothness) but supplies no derivation details, explicit error bounds, or verification that the mild assumptions hold for the claimed architectures. This prevents assessment of whether the central convergence claim is supported.

Authors: Abstracts are intended to provide a concise high-level overview of contributions and proof strategy rather than full derivations or bounds, which is standard practice. The detailed derivations are contained in the main manuscript: the iterate-dependent PL-type inequality via analyticity and measure-zero arguments appears in Section 3 (Theorem 3.1 and proof); trajectory Lipschitz smoothness via polynomial generalized smoothness and the local relaxed dissipative condition is in Section 4 (Theorem 4.2); verification of the mild assumptions for the claimed architectures (including pre-normalized transformers) together with Xavier initialization and learning-rate scaling on depth and bottleneck dimensions is in Section 5; and the central convergence result with explicit neighborhood error bounds is stated as Theorem 2.1 (with full proof in the appendix). The full manuscript therefore supplies the requested details and allows assessment of the claims. We do not believe revisions to the abstract are necessary but would be willing to add a single sentence referencing the key theorems if the referee prefers. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim is a convergence theorem proved via an iterate-dependent PL inequality (using analyticity + measure-zero arguments) and trajectory Lipschitz smoothness (via polynomial generalized smoothness + local relaxed dissipativity). These are standard external mathematical tools with no reduction to the target result by definition, no fitted inputs renamed as predictions, and no load-bearing self-citations. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Ledger populated from abstract only; central claim rests on several domain assumptions labeled mild but not enumerated.

axioms (2)

domain assumption Mild assumptions on network and data that permit analyticity and measure-zero arguments to establish iterate-dependent PL inequality.
Invoked to prove convergence for almost all initializations.
domain assumption Polynomial generalized smoothness together with local relaxed dissipative condition along the GD trajectory.
Used to establish Lipschitz smoothness during training.

pith-pipeline@v0.9.1-grok · 5723 in / 1240 out tokens · 19906 ms · 2026-06-26T08:51:59.354089+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

129 extracted references · 1 canonical work pages

[1]

Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Truthfulqa: Measuring how models mimic human falsehoods , author=. Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
[2]

2017 , howpublished =

Cam Nugent , title =. 2017 , howpublished =

2017
[3]

Statistics & Probability Letters , volume=

Sparse spatial autoregressions , author=. Statistics & Probability Letters , volume=. 1997 , publisher=

1997
[4]

arXiv preprint arXiv:2410.14602 , year=

How Does Data Diversity Shape the Weight Landscape of Neural Networks? , author=. arXiv preprint arXiv:2410.14602 , year=

arXiv
[5]

Proceedings of the 41st International Conference on Machine Learning , pages=

LESS: selecting influential data for targeted instruction tuning , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
[6]

arXiv preprint arXiv:2411.15349 , year=

Zero-shot coreset selection: Efficient pruning for unlabeled data , author=. arXiv preprint arXiv:2411.15349 , year=

arXiv
[7]

International conference on machine learning , pages=

The loss surface of deep and wide neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[8]

The Thirty Sixth Annual Conference on Learning Theory , pages=

Over-parameterization exponentially slows down gradient descent for learning a single neuron , author=. The Thirty Sixth Annual Conference on Learning Theory , pages=. 2023 , organization=

2023
[9]

Applied and Computational Harmonic Analysis , volume=

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks , author=. Applied and Computational Harmonic Analysis , volume=. 2022 , publisher=

2022
[10]

arXiv preprint arXiv:2601.00417 , year=

Deep delta learning , author=. arXiv preprint arXiv:2601.00417 , year=

Pith/arXiv arXiv
[11]

International Conference on Learning Representations , volume=

Hyper-connections , author=. International Conference on Learning Representations , volume=
[12]

International Conference on Machine Learning , pages=

Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[13]

International Conference on Learning Representations , year=

Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning , author=. International Conference on Learning Representations , year=
[14]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[15]

arXiv preprint arXiv:2109.07958 , year=

Truthfulqa: Measuring how models mimic human falsehoods , author=. arXiv preprint arXiv:2109.07958 , year=

Pith/arXiv arXiv
[16]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv
[17]

ACM computing surveys (CSUR) , volume=

Feature selection: A data perspective , author=. ACM computing surveys (CSUR) , volume=. 2017 , publisher=

2017
[18]

arXiv preprint arXiv:1205.2653 , year=

L2 regularization for learning kernels , author=. arXiv preprint arXiv:1205.2653 , year=

Pith/arXiv arXiv
[19]

Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers , pages=

Large-scale machine learning with stochastic gradient descent , author=. Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers , pages=. 2010 , organization=

2010
[20]

International conference on Machine learning , pages=

Cross-entropy loss functions: Theoretical analysis and applications , author=. International conference on Machine learning , pages=. 2023 , organization=

2023
[21]

arXiv preprint arXiv:1412.6980 , year=

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv
[22]

Computers & Geosciences , volume=

Principal components analysis (PCA) , author=. Computers & Geosciences , volume=. 1993 , publisher=

1993
[23]

arXiv preprint arXiv:2302.13971 , year=

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

Pith/arXiv arXiv
[24]

Transactions on Machine Learning Research , issn=

A Survey on Data Selection for Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024
[25]

Acm transactions on knowledge discovery from data (tkdd) , volume=

l-diversity: Privacy beyond k-anonymity , author=. Acm transactions on knowledge discovery from data (tkdd) , volume=. 2007 , publisher=

2007
[26]

International Conference on Learning Representations , year=

Diversity is All You Need: Learning Skills without a Reward Function , author=. International Conference on Learning Representations , year=
[27]

Ieee Access , volume=

Diversity in machine learning , author=. Ieee Access , volume=. 2019 , publisher=

2019
[28]

IEEE INFOCOM 2003

Exploiting data diversity and multiuser diversity in noncooperative mobile infostation networks , author=. IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428) , volume=. 2003 , organization=

2003
[29]

Advances in neural information processing systems , volume=

On sample-efficient offline reinforcement learning: Data diversity, posterior sampling and beyond , author=. Advances in neural information processing systems , volume=
[30]

European radiology experimental , volume=

Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem , author=. European radiology experimental , volume=. 2020 , publisher=

2020
[31]

Ieee transactions on computers , volume=

Data diversity: An approach to software fault tolerance , author=. Ieee transactions on computers , volume=. 2002 , publisher=

2002
[32]

2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN) , pages=

Security through redundant data diversity , author=. 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN) , pages=. 2008 , organization=

2008
[33]

Proceedings of the 29th international conference on computational linguistics , pages=

Can data diversity enhance learning generalization? , author=. Proceedings of the 29th international conference on computational linguistics , pages=
[34]

arXiv preprint arXiv:2501.19393 , year=

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

Pith/arXiv arXiv
[35]

Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Qingwei Lin and Daxin Jiang , booktitle=. Wizard. 2024 , url=

2024
[36]

Transactions on Machine Learning Research , year=

TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning , author=. Transactions on Machine Learning Research , year=
[37]

Advances in Neural Information Processing Systems , volume=

Convex and non-convex optimization under generalized smoothness , author=. Advances in Neural Information Processing Systems , volume=
[38]

arXiv preprint arXiv:1905.11881 , year=

Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. arXiv preprint arXiv:1905.11881 , year=

arXiv 1905
[39]

MATHEMATICAL INEQUALITIES AND APPLICATIONS , volume=

A remark on the convergence of the Tikhonov regularization without monotonicity , author=. MATHEMATICAL INEQUALITIES AND APPLICATIONS , volume=. 2004 , publisher=

2004
[40]

Hyperbolic polynomials and multiparameter real-analytic perturbation theory , author=
[41]

2000 , publisher=

Computational geometry: algorithms and applications , author=. 2000 , publisher=

2000
[42]

Journal of approximation theory , volume=

A generalization of the Bramble-Hilbert lemma and applications to multivariate interpolation , author=. Journal of approximation theory , volume=. 1979 , publisher=

1979
[43]

SIAM undergraduate research online , volume=

A simple expression for multivariate Lagrange interpolation , author=. SIAM undergraduate research online , volume=
[44]

1999 , publisher=

Discrete versions of Gronwall's lemma and their application to the numerical analysis of parabolic problems , author=. 1999 , publisher=

1999
[45]

2012 , publisher=

Differential topology , author=. 2012 , publisher=

2012
[46]

arXiv preprint arXiv:1512.07276 , year=

The zero set of a real analytic function , author=. arXiv preprint arXiv:1512.07276 , year=

Pith/arXiv arXiv
[47]

International Conference on Learning Representations , year=

Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect , author=. International Conference on Learning Representations , year=
[48]

Econometrica: Journal of the Econometric Society , pages=

Nice demand functions , author=. Econometrica: Journal of the Econometric Society , pages=. 1973 , publisher=

1973
[49]

arXiv preprint arXiv:2003.02218 , year=

The large learning rate phase of deep learning: the catapult mechanism , author=. arXiv preprint arXiv:2003.02218 , year=

arXiv 2003
[50]

arXiv preprint arXiv:2310.17087 , year=

Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult , author=. arXiv preprint arXiv:2310.17087 , year=

arXiv
[51]

SIAM Journal on Numerical Analysis , volume=

Estimation of linear functionals on Sobolev spaces with application to Fourier transforms and spline interpolation , author=. SIAM Journal on Numerical Analysis , volume=. 1970 , publisher=

1970
[52]

Annals of Mathematics , volume=

On Levi's problem and the imbedding of real-analytic manifolds , author=. Annals of Mathematics , volume=. 1958 , publisher=

1958
[53]

International Conference on Machine Learning , pages=

A convergence theory for deep learning via over-parameterization , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[54]

International conference on machine learning , pages=

Gradient descent finds global minima of deep neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[55]

Machine Learning , volume=

Gradient descent optimizes over-parameterized deep ReLU networks , author=. Machine Learning , volume=. 2020 , publisher=

2020
[56]

Advances in neural information processing systems , volume=

Wide neural networks of any depth evolve as linear models under gradient descent , author=. Advances in neural information processing systems , volume=
[57]

Advances in neural information processing systems , volume=

An improved analysis of training over-parameterized deep neural networks , author=. Advances in neural information processing systems , volume=
[58]

International Conference on Learning Representations , year=

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , author=. International Conference on Learning Representations , year=
[59]

International Conference on Learning Representations , year=

How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? , author=. International Conference on Learning Representations , year=
[60]

arXiv preprint arXiv:1906.03593 , year=

Quadratic suffices for over-parametrization via matrix chernoff bound , author=. arXiv preprint arXiv:1906.03593 , year=

arXiv 1906
[61]

IEEE Journal on Selected Areas in Information Theory , volume=

Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks , author=. IEEE Journal on Selected Areas in Information Theory , volume=. 2020 , publisher=

2020
[62]

Advances in neural information processing systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=
[63]

Proceedings of the National Academy of Sciences , volume=

A mean field view of the landscape of two-layers neural networks , author=. Proceedings of the National Academy of Sciences , volume=
[64]

Advances in neural information processing systems , volume=

On the global convergence of gradient descent for over-parameterized models using optimal transport , author=. Advances in neural information processing systems , volume=
[65]

stat , volume=

Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error , author=. stat , volume=
[66]

Advances in Neural Information Processing Systems , volume=

Regularization matters: Generalization and optimization of neural nets vs their induced kernel , author=. Advances in Neural Information Processing Systems , volume=
[67]

Advances in Neural Information Processing Systems , volume=

A generalized neural tangent kernel analysis for two-layer neural networks , author=. Advances in Neural Information Processing Systems , volume=
[68]

SIAM Journal on Applied Mathematics , volume=

Mean field analysis of neural networks: A law of large numbers , author=. SIAM Journal on Applied Mathematics , volume=. 2020 , publisher=

2020
[69]

Conference on learning theory , pages=

Modeling from features: a mean-field framework for over-parameterized deep neural networks , author=. Conference on learning theory , pages=. 2021 , organization=

2021
[70]

arXiv preprint arXiv:2012.09816 , year=

Towards understanding ensemble, knowledge distillation and self-distillation in deep learning , author=. arXiv preprint arXiv:2012.09816 , year=

arXiv 2012
[71]

2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS) , pages=

Feature purification: How adversarial training performs robust deep learning , author=. 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS) , pages=. 2022 , organization=

2021
[72]

arXiv preprint arXiv:2202.06526 , year=

Benign Overfitting in Two-layer Convolutional Neural Networks , author=. arXiv preprint arXiv:2202.06526 , year=

arXiv
[73]

arXiv preprint arXiv:2204.10782 , year=

On feature learning in neural networks with global convergence guarantees , author=. arXiv preprint arXiv:2204.10782 , year=

arXiv
[74]

arXiv preprint arXiv:2503.09565 , year=

Global Convergence and Rich Feature Learning in L -Layer Infinite-Width Neural Networks under P Parametrization , author=. arXiv preprint arXiv:2503.09565 , year=

arXiv
[75]

arXiv preprint arXiv:2011.14522 , year=

Feature learning in infinite-width neural networks , author=. arXiv preprint arXiv:2011.14522 , year=

arXiv 2011
[76]

Advances in Neural Information Processing Systems , volume=

High-dimensional asymptotics of feature learning: How one gradient step improves the representation , author=. Advances in Neural Information Processing Systems , volume=
[77]

Advances in Neural Information Processing Systems , volume=

Self-consistent dynamical field theory of kernel evolution in wide neural networks , author=. Advances in Neural Information Processing Systems , volume=
[78]

International conference on machine learning , pages=

Understanding contrastive representation learning through alignment and uniformity on the hypersphere , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[79]

Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

Towards representation alignment and uniformity in collaborative filtering , author=. Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=
[80]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Feature alignment and uniformity for test time adaptation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Showing first 80 references.

[1] [1]

Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Truthfulqa: Measuring how models mimic human falsehoods , author=. Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

[2] [2]

2017 , howpublished =

Cam Nugent , title =. 2017 , howpublished =

2017

[3] [3]

Statistics & Probability Letters , volume=

Sparse spatial autoregressions , author=. Statistics & Probability Letters , volume=. 1997 , publisher=

1997

[4] [4]

arXiv preprint arXiv:2410.14602 , year=

How Does Data Diversity Shape the Weight Landscape of Neural Networks? , author=. arXiv preprint arXiv:2410.14602 , year=

arXiv

[5] [5]

Proceedings of the 41st International Conference on Machine Learning , pages=

LESS: selecting influential data for targeted instruction tuning , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

[6] [6]

arXiv preprint arXiv:2411.15349 , year=

Zero-shot coreset selection: Efficient pruning for unlabeled data , author=. arXiv preprint arXiv:2411.15349 , year=

arXiv

[7] [7]

International conference on machine learning , pages=

The loss surface of deep and wide neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[8] [8]

The Thirty Sixth Annual Conference on Learning Theory , pages=

Over-parameterization exponentially slows down gradient descent for learning a single neuron , author=. The Thirty Sixth Annual Conference on Learning Theory , pages=. 2023 , organization=

2023

[9] [9]

Applied and Computational Harmonic Analysis , volume=

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks , author=. Applied and Computational Harmonic Analysis , volume=. 2022 , publisher=

2022

[10] [10]

arXiv preprint arXiv:2601.00417 , year=

Deep delta learning , author=. arXiv preprint arXiv:2601.00417 , year=

Pith/arXiv arXiv

[11] [11]

International Conference on Learning Representations , volume=

Hyper-connections , author=. International Conference on Learning Representations , volume=

[12] [12]

International Conference on Machine Learning , pages=

Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[13] [13]

International Conference on Learning Representations , year=

Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning , author=. International Conference on Learning Representations , year=

[14] [14]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602

[15] [15]

arXiv preprint arXiv:2109.07958 , year=

Truthfulqa: Measuring how models mimic human falsehoods , author=. arXiv preprint arXiv:2109.07958 , year=

Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv

[17] [17]

ACM computing surveys (CSUR) , volume=

Feature selection: A data perspective , author=. ACM computing surveys (CSUR) , volume=. 2017 , publisher=

2017

[18] [18]

arXiv preprint arXiv:1205.2653 , year=

L2 regularization for learning kernels , author=. arXiv preprint arXiv:1205.2653 , year=

Pith/arXiv arXiv

[19] [19]

Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers , pages=

Large-scale machine learning with stochastic gradient descent , author=. Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers , pages=. 2010 , organization=

2010

[20] [20]

International conference on Machine learning , pages=

Cross-entropy loss functions: Theoretical analysis and applications , author=. International conference on Machine learning , pages=. 2023 , organization=

2023

[21] [21]

arXiv preprint arXiv:1412.6980 , year=

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv

[22] [22]

Computers & Geosciences , volume=

Principal components analysis (PCA) , author=. Computers & Geosciences , volume=. 1993 , publisher=

1993

[23] [23]

arXiv preprint arXiv:2302.13971 , year=

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

Pith/arXiv arXiv

[24] [24]

Transactions on Machine Learning Research , issn=

A Survey on Data Selection for Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024

[25] [25]

Acm transactions on knowledge discovery from data (tkdd) , volume=

l-diversity: Privacy beyond k-anonymity , author=. Acm transactions on knowledge discovery from data (tkdd) , volume=. 2007 , publisher=

2007

[26] [26]

International Conference on Learning Representations , year=

Diversity is All You Need: Learning Skills without a Reward Function , author=. International Conference on Learning Representations , year=

[27] [27]

Ieee Access , volume=

Diversity in machine learning , author=. Ieee Access , volume=. 2019 , publisher=

2019

[28] [28]

IEEE INFOCOM 2003

Exploiting data diversity and multiuser diversity in noncooperative mobile infostation networks , author=. IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428) , volume=. 2003 , organization=

2003

[29] [29]

Advances in neural information processing systems , volume=

On sample-efficient offline reinforcement learning: Data diversity, posterior sampling and beyond , author=. Advances in neural information processing systems , volume=

[30] [30]

European radiology experimental , volume=

Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem , author=. European radiology experimental , volume=. 2020 , publisher=

2020

[31] [31]

Ieee transactions on computers , volume=

Data diversity: An approach to software fault tolerance , author=. Ieee transactions on computers , volume=. 2002 , publisher=

2002

[32] [32]

2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN) , pages=

Security through redundant data diversity , author=. 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN) , pages=. 2008 , organization=

2008

[33] [33]

Proceedings of the 29th international conference on computational linguistics , pages=

Can data diversity enhance learning generalization? , author=. Proceedings of the 29th international conference on computational linguistics , pages=

[34] [34]

arXiv preprint arXiv:2501.19393 , year=

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

Pith/arXiv arXiv

[35] [35]

Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Qingwei Lin and Daxin Jiang , booktitle=. Wizard. 2024 , url=

2024

[36] [36]

Transactions on Machine Learning Research , year=

TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning , author=. Transactions on Machine Learning Research , year=

[37] [37]

Advances in Neural Information Processing Systems , volume=

Convex and non-convex optimization under generalized smoothness , author=. Advances in Neural Information Processing Systems , volume=

[38] [38]

arXiv preprint arXiv:1905.11881 , year=

Why gradient clipping accelerates training: A theoretical justification for adaptivity , author=. arXiv preprint arXiv:1905.11881 , year=

arXiv 1905

[39] [39]

MATHEMATICAL INEQUALITIES AND APPLICATIONS , volume=

A remark on the convergence of the Tikhonov regularization without monotonicity , author=. MATHEMATICAL INEQUALITIES AND APPLICATIONS , volume=. 2004 , publisher=

2004

[40] [40]

Hyperbolic polynomials and multiparameter real-analytic perturbation theory , author=

[41] [41]

2000 , publisher=

Computational geometry: algorithms and applications , author=. 2000 , publisher=

2000

[42] [42]

Journal of approximation theory , volume=

A generalization of the Bramble-Hilbert lemma and applications to multivariate interpolation , author=. Journal of approximation theory , volume=. 1979 , publisher=

1979

[43] [43]

SIAM undergraduate research online , volume=

A simple expression for multivariate Lagrange interpolation , author=. SIAM undergraduate research online , volume=

[44] [44]

1999 , publisher=

Discrete versions of Gronwall's lemma and their application to the numerical analysis of parabolic problems , author=. 1999 , publisher=

1999

[45] [45]

2012 , publisher=

Differential topology , author=. 2012 , publisher=

2012

[46] [46]

arXiv preprint arXiv:1512.07276 , year=

The zero set of a real analytic function , author=. arXiv preprint arXiv:1512.07276 , year=

Pith/arXiv arXiv

[47] [47]

International Conference on Learning Representations , year=

Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect , author=. International Conference on Learning Representations , year=

[48] [48]

Econometrica: Journal of the Econometric Society , pages=

Nice demand functions , author=. Econometrica: Journal of the Econometric Society , pages=. 1973 , publisher=

1973

[49] [49]

arXiv preprint arXiv:2003.02218 , year=

The large learning rate phase of deep learning: the catapult mechanism , author=. arXiv preprint arXiv:2003.02218 , year=

arXiv 2003

[50] [50]

arXiv preprint arXiv:2310.17087 , year=

Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult , author=. arXiv preprint arXiv:2310.17087 , year=

arXiv

[51] [51]

SIAM Journal on Numerical Analysis , volume=

Estimation of linear functionals on Sobolev spaces with application to Fourier transforms and spline interpolation , author=. SIAM Journal on Numerical Analysis , volume=. 1970 , publisher=

1970

[52] [52]

Annals of Mathematics , volume=

On Levi's problem and the imbedding of real-analytic manifolds , author=. Annals of Mathematics , volume=. 1958 , publisher=

1958

[53] [53]

International Conference on Machine Learning , pages=

A convergence theory for deep learning via over-parameterization , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019

[54] [54]

International conference on machine learning , pages=

Gradient descent finds global minima of deep neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[55] [55]

Machine Learning , volume=

Gradient descent optimizes over-parameterized deep ReLU networks , author=. Machine Learning , volume=. 2020 , publisher=

2020

[56] [56]

Advances in neural information processing systems , volume=

Wide neural networks of any depth evolve as linear models under gradient descent , author=. Advances in neural information processing systems , volume=

[57] [57]

Advances in neural information processing systems , volume=

An improved analysis of training over-parameterized deep neural networks , author=. Advances in neural information processing systems , volume=

[58] [58]

International Conference on Learning Representations , year=

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , author=. International Conference on Learning Representations , year=

[59] [59]

International Conference on Learning Representations , year=

How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? , author=. International Conference on Learning Representations , year=

[60] [60]

arXiv preprint arXiv:1906.03593 , year=

Quadratic suffices for over-parametrization via matrix chernoff bound , author=. arXiv preprint arXiv:1906.03593 , year=

arXiv 1906

[61] [61]

IEEE Journal on Selected Areas in Information Theory , volume=

Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks , author=. IEEE Journal on Selected Areas in Information Theory , volume=. 2020 , publisher=

2020

[62] [62]

Advances in neural information processing systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

[63] [63]

Proceedings of the National Academy of Sciences , volume=

A mean field view of the landscape of two-layers neural networks , author=. Proceedings of the National Academy of Sciences , volume=

[64] [64]

Advances in neural information processing systems , volume=

On the global convergence of gradient descent for over-parameterized models using optimal transport , author=. Advances in neural information processing systems , volume=

[65] [65]

stat , volume=

Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error , author=. stat , volume=

[66] [66]

Advances in Neural Information Processing Systems , volume=

Regularization matters: Generalization and optimization of neural nets vs their induced kernel , author=. Advances in Neural Information Processing Systems , volume=

[67] [67]

Advances in Neural Information Processing Systems , volume=

A generalized neural tangent kernel analysis for two-layer neural networks , author=. Advances in Neural Information Processing Systems , volume=

[68] [68]

SIAM Journal on Applied Mathematics , volume=

Mean field analysis of neural networks: A law of large numbers , author=. SIAM Journal on Applied Mathematics , volume=. 2020 , publisher=

2020

[69] [69]

Conference on learning theory , pages=

Modeling from features: a mean-field framework for over-parameterized deep neural networks , author=. Conference on learning theory , pages=. 2021 , organization=

2021

[70] [70]

arXiv preprint arXiv:2012.09816 , year=

Towards understanding ensemble, knowledge distillation and self-distillation in deep learning , author=. arXiv preprint arXiv:2012.09816 , year=

arXiv 2012

[71] [71]

2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS) , pages=

Feature purification: How adversarial training performs robust deep learning , author=. 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS) , pages=. 2022 , organization=

2021

[72] [72]

arXiv preprint arXiv:2202.06526 , year=

Benign Overfitting in Two-layer Convolutional Neural Networks , author=. arXiv preprint arXiv:2202.06526 , year=

arXiv

[73] [73]

arXiv preprint arXiv:2204.10782 , year=

On feature learning in neural networks with global convergence guarantees , author=. arXiv preprint arXiv:2204.10782 , year=

arXiv

[74] [74]

arXiv preprint arXiv:2503.09565 , year=

Global Convergence and Rich Feature Learning in L -Layer Infinite-Width Neural Networks under P Parametrization , author=. arXiv preprint arXiv:2503.09565 , year=

arXiv

[75] [75]

arXiv preprint arXiv:2011.14522 , year=

Feature learning in infinite-width neural networks , author=. arXiv preprint arXiv:2011.14522 , year=

arXiv 2011

[76] [76]

Advances in Neural Information Processing Systems , volume=

High-dimensional asymptotics of feature learning: How one gradient step improves the representation , author=. Advances in Neural Information Processing Systems , volume=

[77] [77]

Advances in Neural Information Processing Systems , volume=

Self-consistent dynamical field theory of kernel evolution in wide neural networks , author=. Advances in Neural Information Processing Systems , volume=

[78] [78]

International conference on machine learning , pages=

Understanding contrastive representation learning through alignment and uniformity on the hypersphere , author=. International conference on machine learning , pages=. 2020 , organization=

2020

[79] [79]

Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

Towards representation alignment and uniformity in collaborative filtering , author=. Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

[80] [80]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Feature alignment and uniformity for test time adaptation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=