Recognition: 2 theorem links
· Lean TheoremOn the global convergence of gradient descent for wide shallow models with bounded nonlinearities
Pith reviewed 2026-05-12 03:44 UTC · model grok-4.3
The pith
Continuous-time gradient descent on wide shallow models with bounded nonlinearities converges only to global minimizers in the mean-field limit with full-support initialization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
All non-global minimizers are unstable under gradient descent dynamics. When the initial distribution of the parameters has full support and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. The proof proceeds by constructing an escaping active set for models with bounded nonlinearities and scalar output weights, then extending the construction to vector output weights; the mean-field training dynamic is shown to be well-posed and stable with respect to discretization for sub-Gaussian initializations.
What carries the argument
The escaping active set, a collection of directions in parameter space that allows the continuous-time gradient flow to leave any non-global minimizer.
If this is right
- Non-global minimizers become unstable and trajectories can escape them under the flow.
- In the infinite-width limit, the only possible limit points of the dynamics are global minimizers.
- The result covers multi-head attention layers and sigmoid networks with vector-valued outputs.
- The mean-field PDE is well-posed and the continuous-time limit is stable under discretization for sub-Gaussian initial data.
Where Pith is reading between the lines
- The instability mechanism may persist approximately for large but finite widths, suggesting that global convergence remains likely in practical networks initialized with spread-out parameters.
- Similar escaping-set constructions could be attempted for other architectures whose nonlinearities satisfy the boundedness condition.
- The mean-field analysis supplies a testable prediction: the probability of reaching a non-global minimizer should decrease as width grows, for fixed full-support initialization.
Load-bearing premise
The nonlinearities must be bounded and the dynamics must be considered in the continuous-time mean-field limit starting from an initial distribution with full support.
What would settle it
A numerical experiment in which continuous-time or finely discretized gradient descent, started from a Gaussian distribution and with a large but finite number of neurons, converges to a non-global minimizer for a model with bounded activations would falsify the central claim.
read the original abstract
A surprising phenomenon in the training of neural networks is the ability of gradient descent to find global minimizers of the training loss despite its non-convexity. Following earlier works, we investigate this behavior for wide shallow networks. Existing results essentially cover the case of ReLU activations and the case of sigmoid activations with scalar output weights. We study a large class of models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. Building upon [Chizat and Bach, 2018], we prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics. Thus, when the initial distribution of the parameters has full support (which includes the popular Gaussian case), and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. Establishing the instability of non-global minimizers corresponds to the construction of an ``escaping active set'' -- we complete the proof of [Chizat and Bach, 2018] to construct this set for models with bounded nonlinearities and scalar output weights. We also extend this construction to new cases for models with vector output weights. Finally, we show the well-posedness and the stability with respect to discretization of the mean field training dynamic for sub-Gaussian initializations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves global convergence of continuous-time gradient descent to global minimizers for wide shallow models with bounded nonlinearities (including multi-head attention) in the mean-field limit. Building on Chizat and Bach (2018), it completes the escaping-active-set construction to show instability of all non-global minimizers when the initial parameter distribution has full support (e.g., Gaussian), and extends the construction to vector-output weights. It also establishes well-posedness of the mean-field PDE and stability under discretization for sub-Gaussian initializations.
Significance. If the constructions hold, this extends global-convergence guarantees beyond ReLU and scalar-sigmoid cases to a broader class of bounded activations and vector-output architectures, providing a rigorous explanation for why gradient descent reaches global minima in the overparameterized regime. The completion of the prior proof and the vector-weight extension are concrete advances; the well-posedness result for sub-Gaussian measures is a useful technical contribution.
major comments (2)
- [§3.3] §3.3, construction of escaping active set for bounded nonlinearities: the argument that the perturbation can be chosen to strictly decrease the loss while preserving the mean-field measure appears to use boundedness to control the remainder term, but it is not immediately clear whether the same perturbation works uniformly for all non-global critical points or requires a case distinction on the support of the measure.
- [§4.2] §4.2, extension to vector output weights: the choice of the escaping direction in the vector case is constructed explicitly, but the proof that this direction remains admissible under the dynamics for arbitrary output dimension should be cross-checked against the scalar case to confirm no additional regularity on the output weights is implicitly used.
minor comments (3)
- [§2.2] §2.2: the definition of the mean-field loss functional could explicitly state the dependence on the output dimension to make the vector-weight extension easier to follow.
- [Theorem 5.1] Theorem 5.1: the statement of discretization stability would benefit from an explicit constant or rate in terms of the sub-Gaussian parameter.
- The reference list should include the full citation details for Chizat and Bach (2018) to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for their careful reading, positive assessment, and constructive suggestions. We address the two major comments below and will incorporate clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [§3.3] §3.3, construction of escaping active set for bounded nonlinearities: the argument that the perturbation can be chosen to strictly decrease the loss while preserving the mean-field measure appears to use boundedness to control the remainder term, but it is not immediately clear whether the same perturbation works uniformly for all non-global critical points or requires a case distinction on the support of the measure.
Authors: We thank the referee for this observation. Boundedness of the nonlinearity is used to control the remainder when perturbing the loss along the escaping direction. Because the initial measure is assumed to have full support, the same local perturbation construction applies uniformly to every non-global critical point: the full-support property guarantees that the measure can be adjusted in the required directions without needing case distinctions on the support. We will add a short clarifying paragraph after the statement of the main escaping-set result to make this uniformity explicit. revision: yes
-
Referee: [§4.2] §4.2, extension to vector output weights: the choice of the escaping direction in the vector case is constructed explicitly, but the proof that this direction remains admissible under the dynamics for arbitrary output dimension should be cross-checked against the scalar case to confirm no additional regularity on the output weights is implicitly used.
Authors: We appreciate the request for an explicit cross-check. The escaping direction for vector output weights is chosen componentwise, following exactly the same linear-algebraic argument used in the scalar case; no extra regularity on the output weights is invoked beyond the Lipschitz and boundedness assumptions already stated for the scalar setting. The mean-field PDE analysis that establishes admissibility carries over verbatim to any finite output dimension. We will insert a short comparative remark at the beginning of Section 4.2 that recalls the scalar construction and notes the direct extension. revision: yes
Circularity Check
No significant circularity; derivation extends external prior result
full rationale
The paper completes and extends the escaping-active-set construction from Chizat and Bach (2018), a work by different authors, for bounded nonlinearities and vector-output cases including attention. The central instability proof and mean-field limit arguments rely on independent constructions and well-posedness results rather than self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The full-support initial measure condition and discretization stability are established directly without reducing to the target global convergence claim by construction. This is a standard non-circular extension of prior independent work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Nonlinearities are bounded
- domain assumption Initial parameter distribution has full support
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics... construction of an 'escaping active set'
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearWasserstein gradient flow of F... ∂t μt = −div(μt vt) with vt(u) = −∇F'(μt)(u)
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2309.08586 , year=
Replacing softmax with relu in vision transformers , author=. arXiv preprint arXiv:2309.08586 , year=
-
[2]
GLU Variants Improve Transformer
Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[3]
Primer: Searching for efficient transformers for language modeling
Primer: Searching for efficient transformers for language modeling, 2022 , author=. URL https://arxiv. org/abs/2109.08668 , year=
-
[4]
Long and Quanquan Gu , title =
Difan Zou and Philip M. Long and Quanquan Gu , title =. International Conference on Learning Representations (ICLR) , year =
-
[5]
L. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , booktitle =
- [6]
-
[7]
Mathematical Statistics and Learning , volume =
Phan-Minh Nguyen and Huy Tuan Pham , title =. Mathematical Statistics and Learning , volume =
-
[8]
On Lazy Training in Differentiable Programming , booktitle =
Lena. On Lazy Training in Differentiable Programming , booktitle =. 2019 , pages =
work page 2019
-
[9]
Advances in Neural Information Processing Systems (NeurIPS) , volume =
Chaoyue Liu and Libin Zhu and Mikhail Belkin , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =
-
[10]
Greg Yang and Edward J. Hu , title =. International Conference on Machine Learning (ICML) , year =
-
[11]
International Conference on Machine Learning (ICML) , year =
Peter Bartlett and David Helmbold and Philip Long , title =. International Conference on Machine Learning (ICML) , year =
-
[12]
Information and Inference: A Journal of the IMA , volume =
Bubacarr Bah and others , title =. Information and Inference: A Journal of the IMA , volume =
-
[13]
Advances in Neural Information Processing Systems (NeurIPS) , volume =
Yuanzhi Li and Yang Yuan , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2017 , pages =
work page 2017
-
[14]
International Conference on Learning Representations (ICLR) , year =
Zixiang Chen and Yuan Cao and others , title =. International Conference on Learning Representations (ICLR) , year =
-
[15]
International Conference on Machine Learning (ICML) , year =
Quynh Nguyen , title =. International Conference on Machine Learning (ICML) , year =
-
[16]
Journal of Machine Learning Research , volume =
Francis Bach , title =. Journal of Machine Learning Research , volume =
-
[17]
Advances in Neural Information Processing Systems (NeurIPS) , volume =
Behrooz Ghorbani and Song Mei and Theodor Misiakiewicz and Andrea Montanari , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2020 , pages =
work page 2020
-
[18]
Calculus of Variations and Partial Differential Equations , volume =
Weinan E and Stephan Wojtowytsch , title =. Calculus of Variations and Partial Differential Equations , volume =
-
[19]
Eldad Haber and Lars Ruthotto , title =. Inverse Problems , volume =
-
[20]
International Conference on Machine Learning (ICML) , year =
Yiping Lu and Aoxiao Zhong and Quanzheng Li and Bin Dong , title =. International Conference on Machine Learning (ICML) , year =
-
[21]
Andrew M. Saxe and James L. McClelland and Surya Ganguli , title =. International Conference on Learning Representations (ICLR) , year =
-
[22]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Kenji Kawaguchi , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[23]
International Conference on Learning Representations (ICLR) , year =
Moritz Hardt and Tengyu Ma , title =. International Conference on Learning Representations (ICLR) , year =
-
[24]
International Conference on Machine Learning (ICML) , year =
Thomas Laurent and James von Brecht , title =. International Conference on Machine Learning (ICML) , year =
-
[25]
Simon S. Du and Jason D. Lee and Haochuan Li and Liwei Wang and Xiyu Zhai , title =. International Conference on Learning Representations (ICLR) , year =
-
[26]
International Conference on Machine Learning (ICML) , year =
Zeyuan Allen-Zhu and Yuanzhi Li and Zhao Song , title =. International Conference on Machine Learning (ICML) , year =
-
[27]
International Conference on Machine Learning (ICML) , year =
Sanjeev Arora and Nadav Cohen and Wei Hu and Yuping Luo , title =. International Conference on Machine Learning (ICML) , year =
- [28]
-
[29]
Difan Zou and Yuan Cao and Dongruo Zhou and Quanquan Gu , title =. Machine Learning , volume =
-
[30]
arXiv preprint arXiv:1706.06263 , year =
Yuanzhi Li and Yingyu Liang , title =. arXiv preprint arXiv:1706.06263 , year =
-
[31]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Zeyuan Allen-Zhu and Yuanzhi Li and Zhao Song , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[32]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Yuanzhi Li and Yingyu Liang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[33]
Simon S. Du and Xiyu Zhai and Barnab. Gradient descent provably optimizes over-parameterized neural networks , booktitle =
-
[34]
Journal of Machine Learning Research , volume =
Daniel Soudry and Elad Hoffer and Mor Shpigel Nacson and Suriya Gunasekar and Nathan Srebro , title =. Journal of Machine Learning Research , volume =
-
[35]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Suriya Gunasekar and Jason Lee and Daniel Soudry and Nathan Srebro , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[36]
Neural tangent kernel: Convergence and generalization in neural networks , booktitle =
Arthur Jacot and Franck Gabriel and Cl. Neural tangent kernel: Convergence and generalization in neural networks , booktitle =
-
[37]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Jaehoon Lee and Lechao Xiao and Samuel Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[38]
Du and Wei Hu and Zhiyuan Li and Ruslan Salakhutdinov and Ruosong Wang , title =
Sanjeev Arora and Simon S. Du and Wei Hu and Zhiyuan Li and Ruslan Salakhutdinov and Ruosong Wang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[39]
Conference on Learning Theory (COLT) , year =
Song Mei and Andrea Montanari , title =. Conference on Learning Theory (COLT) , year =
-
[40]
SIAM Journal on Applied Mathematics , volume =
Justin Sirignano and Konstantinos Spiliopoulos , title =. SIAM Journal on Applied Mathematics , volume =
-
[41]
IEEE Transactions on Information Theory , volume=
Convex formulation of overparameterized deep neural networks , author=. IEEE Transactions on Information Theory , volume=. 2022 , publisher=
work page 2022
-
[42]
Conference on Learning Theory , pages=
Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis , author=. Conference on Learning Theory , pages=. 2017 , organization=
work page 2017
-
[43]
arXiv preprint arXiv:1805.01648 , year=
Sharp convergence rates for Langevin dynamics in the nonconvex setting , author=. arXiv preprint arXiv:1805.01648 , year=
-
[44]
Annals of Applied Probability , volume =
Florent Malrieu , title =. Annals of Applied Probability , volume =
-
[45]
International Conference on Machine Learning (ICML) , year =
Qianxiao Li and Cheng Tai and Weinan E , title =. International Conference on Machine Learning (ICML) , year =
-
[46]
International Conference on Learning Representations (ICLR) , year =
Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals , title =. International Conference on Learning Representations (ICLR) , year =
-
[47]
arXiv preprint arXiv:1810.02032 , year=
Gradient descent aligns the layers of deep linear networks , author=. arXiv preprint arXiv:1810.02032 , year=
-
[48]
arXiv preprint arXiv:1712.05438 , year=
Stochastic particle gradient descent for infinite ensembles , author=. arXiv preprint arXiv:1712.05438 , year=
-
[49]
International conference on machine learning , pages=
Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[50]
Rethinking attention with performers , booktitle =
Krzysztof Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Andreea Gane and Tam. Rethinking attention with performers , booktitle =
-
[51]
Hancheng Peng and Ramin Hasani and Alexander Amini and Daniela Rus and Thomas Serre , title =. ICLR , year =
-
[52]
Fu and Stefano Ermon and Atri Rudra and Christopher R
Tri Dao and Daniel Y. Fu and Stefano Ermon and Atri Rudra and Christopher R. FlashAttention: Fast and memory-efficient exact attention with IO-awareness , booktitle =
-
[53]
ICLR 2020 Workshop ODE/PDE , year =
Yiping Lu and Zhuohan Li and Di He , title =. ICLR 2020 Workshop ODE/PDE , year =
work page 2020
-
[54]
Tristan Deleu and Yoshua Bengio and Joseph Paul Cohen , title =. 2021 , eprint =
work page 2021
-
[55]
Varre, Aditya and Y. Learning. 2025 , month = aug, number =. doi:10.48550/arXiv.2508.12837 , urldate =. arXiv , keywords =:2508.12837 , primaryclass =
-
[56]
International Conference on Machine Learning (ICML) , year =
Jiri Hron and Yasaman Bahri and Roman Novak and Jeffrey Pennington and Jascha Sohl-Dickstein , title =. International Conference on Machine Learning (ICML) , year =
-
[57]
Advances in neural information processing systems , volume=
What can transformers learn in-context? a case study of simple function classes , author=. Advances in neural information processing systems , volume=
-
[58]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Understanding self-attention mechanism via dynamical system perspective , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[59]
Vershynin, Roman , year =. High-
-
[60]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[61]
Proceedings of Thirty Seventh Conference on Learning Theory , pages =
Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality (extended abstract) , author =. Proceedings of Thirty Seventh Conference on Learning Theory , pages =. 2024 , editor =
work page 2024
-
[62]
Communications on Pure and Applied Mathematics , volume =
Carlier, Guillaume and Dupuy, Arnaud and Galichon, Alfred and Sun, Yifei , year =. Communications on Pure and Applied Mathematics , volume =. doi:10.1002/cpa.22047 , urldate =
-
[63]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Non-asymptotic Convergence of Training Transformers for Next-token Prediction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
- [64]
-
[65]
Chen, Ricky T. Q. and Rubanova, Yulia and Bettencourt, Jesse and Duvenaud, David K , year =. Neural. Advances in
-
[66]
Hu, Kaitong and Ren, Zhenjie and. Mean-Field. 2021 , month = nov, journal =. doi:10.1214/20-AIHP1140 , urldate =
-
[67]
A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks , author =. 2023 , month = oct, journal =. doi:10.4171/msl/42 , urldate =
-
[68]
Wojtowytsch, Stephan , year =. On the. doi:10.48550/arXiv.2005.13530 , urldate =. arXiv , keywords =:2005.13530 , primaryclass =
-
[69]
Understanding the Training of Infinitely Deep and Wide
Barboni, Rapha. Understanding the Training of Infinitely Deep and Wide. 2025 , journal =. doi:10.1002/cpa.70004 , urldate =
-
[70]
Alcalde, Albert and Fantuzzi, Giovanni and Zuazua, Enrique , year =. Exact. doi:10.48550/arXiv.2502.02270 , urldate =. arXiv , keywords =:2502.02270 , primaryclass =
work page internal anchor Pith review doi:10.48550/arxiv.2502.02270
-
[71]
Alcalde, Albert and Fantuzzi, Giovanni and Zuazua, Enrique , year =. Clustering in. SIAM Journal on Mathematics of Data Science , pages =. doi:10.1137/24M167086X , urldate =
-
[72]
Santambrogio, Filippo , year =. \ \. Bulletin of Mathematical Sciences , volume =. doi:10.1007/s13373-017-0101-1 , urldate =
-
[73]
Peyr. Optimal. 2025 , month = may, number =. doi:10.48550/arXiv.2505.06589 , urldate =. arXiv , keywords =:2505.06589 , primaryclass =
-
[74]
A Mean Field View of the Landscape of Two-Layer Neural Networks , author =. 2018 , month = aug, journal =. doi:10.1073/pnas.1806579115 , urldate =
-
[75]
Rotskoff, Grant and. Parameters as Interacting Particles: Long Time Convergence and Asymptotic Error Scaling of Neural Networks , shorttitle =. Advances in. 2018 , volume =
work page 2018
-
[76]
Searching for Activation Functions
Ramachandran, Prajit and Zoph, Barret and Le, Quoc V. , year =. Searching for. doi:10.48550/arXiv.1710.05941 , urldate =. arXiv , keywords =:1710.05941 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1710.05941
-
[77]
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , author =. 2018 , month = nov, journal =. doi:10.1016/j.neunet.2017.12.012 , urldate =
-
[78]
Gaussian Error Linear Units (GELUs)
Hendrycks, Dan and Gimpel, Kevin , year =. Gaussian. doi:10.48550/arXiv.1606.08415 , urldate =. arXiv , keywords =:1606.08415 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1606.08415
- [79]
-
[80]
Maggi, Francesco , year =. Sets of. doi:10.1017/CBO9781139108133 , urldate =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.