arxiv: 2605.10775 · v1 · submitted 2026-05-11 · 🧮 math.OC · cs.LG

Recognition: 2 theorem links

· Lean Theorem

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

Clarice Poon, Gabriel Peyr\'e, Romain Petit

Pith reviewed 2026-05-12 03:44 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords global convergencegradient descentmean-field limitshallow neural networksbounded nonlinearitiesmulti-head attentionescaping active setoptimization

0 comments

The pith

Continuous-time gradient descent on wide shallow models with bounded nonlinearities converges only to global minimizers in the mean-field limit with full-support initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that non-global minimizers of the training loss are unstable for a class of wide shallow models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. It does so by completing the construction of an escaping active set that allows parameters to leave these points under the gradient flow, extending earlier results limited to ReLU activations or scalar output weights. When the initial parameter distribution has full support, as in the Gaussian case, this instability implies that continuous-time gradient descent cannot converge to anything but a global minimizer once the number of hidden units or attention heads becomes large. A reader would care because the argument supplies a mechanism explaining why overparameterized networks trained by gradient descent routinely reach good solutions despite non-convexity.

Core claim

All non-global minimizers are unstable under gradient descent dynamics. When the initial distribution of the parameters has full support and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. The proof proceeds by constructing an escaping active set for models with bounded nonlinearities and scalar output weights, then extending the construction to vector output weights; the mean-field training dynamic is shown to be well-posed and stable with respect to discretization for sub-Gaussian initializations.

What carries the argument

The escaping active set, a collection of directions in parameter space that allows the continuous-time gradient flow to leave any non-global minimizer.

If this is right

Non-global minimizers become unstable and trajectories can escape them under the flow.
In the infinite-width limit, the only possible limit points of the dynamics are global minimizers.
The result covers multi-head attention layers and sigmoid networks with vector-valued outputs.
The mean-field PDE is well-posed and the continuous-time limit is stable under discretization for sub-Gaussian initial data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The instability mechanism may persist approximately for large but finite widths, suggesting that global convergence remains likely in practical networks initialized with spread-out parameters.
Similar escaping-set constructions could be attempted for other architectures whose nonlinearities satisfy the boundedness condition.
The mean-field analysis supplies a testable prediction: the probability of reaching a non-global minimizer should decrease as width grows, for fixed full-support initialization.

Load-bearing premise

The nonlinearities must be bounded and the dynamics must be considered in the continuous-time mean-field limit starting from an initial distribution with full support.

What would settle it

A numerical experiment in which continuous-time or finely discretized gradient descent, started from a Gaussian distribution and with a large but finite number of neurons, converges to a non-global minimizer for a model with bounded activations would falsify the central claim.

read the original abstract

A surprising phenomenon in the training of neural networks is the ability of gradient descent to find global minimizers of the training loss despite its non-convexity. Following earlier works, we investigate this behavior for wide shallow networks. Existing results essentially cover the case of ReLU activations and the case of sigmoid activations with scalar output weights. We study a large class of models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. Building upon [Chizat and Bach, 2018], we prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics. Thus, when the initial distribution of the parameters has full support (which includes the popular Gaussian case), and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. Establishing the instability of non-global minimizers corresponds to the construction of an ``escaping active set'' -- we complete the proof of [Chizat and Bach, 2018] to construct this set for models with bounded nonlinearities and scalar output weights. We also extend this construction to new cases for models with vector output weights. Finally, we show the well-posedness and the stability with respect to discretization of the mean field training dynamic for sub-Gaussian initializations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

This paper finishes the escaping active set construction for bounded nonlinearities and adds new versions for vector outputs including multi-head attention, closing a gap in the Chizat-Bach framework. In the mean-field limit with full-support initial measures, continuous-time gradient flow cannot converge to non-global critical points once instability is shown. The authors also prove well-posedness and discretization stability for sub-Gaussian starts, which helps bridge the continuous analysis to more practical settings.

Referee Report

2 major / 3 minor

Summary. The paper proves global convergence of continuous-time gradient descent to global minimizers for wide shallow models with bounded nonlinearities (including multi-head attention) in the mean-field limit. Building on Chizat and Bach (2018), it completes the escaping-active-set construction to show instability of all non-global minimizers when the initial parameter distribution has full support (e.g., Gaussian), and extends the construction to vector-output weights. It also establishes well-posedness of the mean-field PDE and stability under discretization for sub-Gaussian initializations.

Significance. If the constructions hold, this extends global-convergence guarantees beyond ReLU and scalar-sigmoid cases to a broader class of bounded activations and vector-output architectures, providing a rigorous explanation for why gradient descent reaches global minima in the overparameterized regime. The completion of the prior proof and the vector-weight extension are concrete advances; the well-posedness result for sub-Gaussian measures is a useful technical contribution.

major comments (2)

[§3.3] §3.3, construction of escaping active set for bounded nonlinearities: the argument that the perturbation can be chosen to strictly decrease the loss while preserving the mean-field measure appears to use boundedness to control the remainder term, but it is not immediately clear whether the same perturbation works uniformly for all non-global critical points or requires a case distinction on the support of the measure.
[§4.2] §4.2, extension to vector output weights: the choice of the escaping direction in the vector case is constructed explicitly, but the proof that this direction remains admissible under the dynamics for arbitrary output dimension should be cross-checked against the scalar case to confirm no additional regularity on the output weights is implicitly used.

minor comments (3)

[§2.2] §2.2: the definition of the mean-field loss functional could explicitly state the dependence on the output dimension to make the vector-weight extension easier to follow.
[Theorem 5.1] Theorem 5.1: the statement of discretization stability would benefit from an explicit constant or rate in terms of the sub-Gaussian parameter.
The reference list should include the full citation details for Chizat and Bach (2018) to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment, and constructive suggestions. We address the two major comments below and will incorporate clarifications in the revised manuscript.

read point-by-point responses

Referee: [§3.3] §3.3, construction of escaping active set for bounded nonlinearities: the argument that the perturbation can be chosen to strictly decrease the loss while preserving the mean-field measure appears to use boundedness to control the remainder term, but it is not immediately clear whether the same perturbation works uniformly for all non-global critical points or requires a case distinction on the support of the measure.

Authors: We thank the referee for this observation. Boundedness of the nonlinearity is used to control the remainder when perturbing the loss along the escaping direction. Because the initial measure is assumed to have full support, the same local perturbation construction applies uniformly to every non-global critical point: the full-support property guarantees that the measure can be adjusted in the required directions without needing case distinctions on the support. We will add a short clarifying paragraph after the statement of the main escaping-set result to make this uniformity explicit. revision: yes
Referee: [§4.2] §4.2, extension to vector output weights: the choice of the escaping direction in the vector case is constructed explicitly, but the proof that this direction remains admissible under the dynamics for arbitrary output dimension should be cross-checked against the scalar case to confirm no additional regularity on the output weights is implicitly used.

Authors: We appreciate the request for an explicit cross-check. The escaping direction for vector output weights is chosen componentwise, following exactly the same linear-algebraic argument used in the scalar case; no extra regularity on the output weights is invoked beyond the Lipschitz and boundedness assumptions already stated for the scalar setting. The mean-field PDE analysis that establishes admissibility carries over verbatim to any finite output dimension. We will insert a short comparative remark at the beginning of Section 4.2 that recalls the scalar construction and notes the direct extension. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation extends external prior result

full rationale

The paper completes and extends the escaping-active-set construction from Chizat and Bach (2018), a work by different authors, for bounded nonlinearities and vector-output cases including attention. The central instability proof and mean-field limit arguments rely on independent constructions and well-posedness results rather than self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The full-support initial measure condition and discretization stability are established directly without reducing to the target global convergence claim by construction. This is a standard non-circular extension of prior independent work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard mean-field assumptions for neural networks plus the boundedness of the nonlinearities and full-support initialization; no new entities are postulated.

axioms (2)

domain assumption Nonlinearities are bounded
Required to control the dynamics and construct the escaping active set for the studied models including sigmoid and attention.
domain assumption Initial parameter distribution has full support
Ensures that the continuous-time flow can escape non-global minimizers.

pith-pipeline@v0.9.0 · 5534 in / 1289 out tokens · 63552 ms · 2026-05-12T03:44:00.937858+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics... construction of an 'escaping active set'
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Wasserstein gradient flow of F... ∂t μt = −div(μt vt) with vt(u) = −∇F'(μt)(u)

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2309.08586 , year=

Replacing softmax with relu in vision transformers , author=. arXiv preprint arXiv:2309.08586 , year=

work page arXiv
[2]

GLU Variants Improve Transformer

Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002
[3]

Primer: Searching for efficient transformers for language modeling

Primer: Searching for efficient transformers for language modeling, 2022 , author=. URL https://arxiv. org/abs/2109.08668 , year=

work page arXiv 2022
[4]

Long and Quanquan Gu , title =

Difan Zou and Philip M. Long and Quanquan Gu , title =. International Conference on Learning Representations (ICLR) , year =

work page
[5]

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , booktitle =

L. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , booktitle =

work page
[6]

2020 , journal =

Stephan Wojtowytsch , title =. 2020 , journal =

work page 2020
[7]

Mathematical Statistics and Learning , volume =

Phan-Minh Nguyen and Huy Tuan Pham , title =. Mathematical Statistics and Learning , volume =

work page
[8]

On Lazy Training in Differentiable Programming , booktitle =

Lena. On Lazy Training in Differentiable Programming , booktitle =. 2019 , pages =

work page 2019
[9]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Chaoyue Liu and Libin Zhu and Mikhail Belkin , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page
[10]

Hu , title =

Greg Yang and Edward J. Hu , title =. International Conference on Machine Learning (ICML) , year =

work page
[11]

International Conference on Machine Learning (ICML) , year =

Peter Bartlett and David Helmbold and Philip Long , title =. International Conference on Machine Learning (ICML) , year =

work page
[12]

Information and Inference: A Journal of the IMA , volume =

Bubacarr Bah and others , title =. Information and Inference: A Journal of the IMA , volume =

work page
[13]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Yuanzhi Li and Yang Yuan , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2017 , pages =

work page 2017
[14]

International Conference on Learning Representations (ICLR) , year =

Zixiang Chen and Yuan Cao and others , title =. International Conference on Learning Representations (ICLR) , year =

work page
[15]

International Conference on Machine Learning (ICML) , year =

Quynh Nguyen , title =. International Conference on Machine Learning (ICML) , year =

work page
[16]

Journal of Machine Learning Research , volume =

Francis Bach , title =. Journal of Machine Learning Research , volume =

work page
[17]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Behrooz Ghorbani and Song Mei and Theodor Misiakiewicz and Andrea Montanari , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2020 , pages =

work page 2020
[18]

Calculus of Variations and Partial Differential Equations , volume =

Weinan E and Stephan Wojtowytsch , title =. Calculus of Variations and Partial Differential Equations , volume =

work page
[19]

Inverse Problems , volume =

Eldad Haber and Lars Ruthotto , title =. Inverse Problems , volume =

work page
[20]

International Conference on Machine Learning (ICML) , year =

Yiping Lu and Aoxiao Zhong and Quanzheng Li and Bin Dong , title =. International Conference on Machine Learning (ICML) , year =

work page
[21]

Saxe and James L

Andrew M. Saxe and James L. McClelland and Surya Ganguli , title =. International Conference on Learning Representations (ICLR) , year =

work page
[22]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Kenji Kawaguchi , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[23]

International Conference on Learning Representations (ICLR) , year =

Moritz Hardt and Tengyu Ma , title =. International Conference on Learning Representations (ICLR) , year =

work page
[24]

International Conference on Machine Learning (ICML) , year =

Thomas Laurent and James von Brecht , title =. International Conference on Machine Learning (ICML) , year =

work page
[25]

Du and Jason D

Simon S. Du and Jason D. Lee and Haochuan Li and Liwei Wang and Xiyu Zhai , title =. International Conference on Learning Representations (ICLR) , year =

work page
[26]

International Conference on Machine Learning (ICML) , year =

Zeyuan Allen-Zhu and Yuanzhi Li and Zhao Song , title =. International Conference on Machine Learning (ICML) , year =

work page
[27]

International Conference on Machine Learning (ICML) , year =

Sanjeev Arora and Nadav Cohen and Wei Hu and Yuping Luo , title =. International Conference on Machine Learning (ICML) , year =

work page
[28]

ICLR 2020 , year =

Ziwei Ji and Matus Telgarsky , title =. ICLR 2020 , year =

work page 2020
[29]

Machine Learning , volume =

Difan Zou and Yuan Cao and Dongruo Zhou and Quanquan Gu , title =. Machine Learning , volume =

work page
[30]

arXiv preprint arXiv:1706.06263 , year =

Yuanzhi Li and Yingyu Liang , title =. arXiv preprint arXiv:1706.06263 , year =

work page arXiv
[31]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Zeyuan Allen-Zhu and Yuanzhi Li and Zhao Song , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[32]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Yuanzhi Li and Yingyu Liang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[33]

Du and Xiyu Zhai and Barnab

Simon S. Du and Xiyu Zhai and Barnab. Gradient descent provably optimizes over-parameterized neural networks , booktitle =

work page
[34]

Journal of Machine Learning Research , volume =

Daniel Soudry and Elad Hoffer and Mor Shpigel Nacson and Suriya Gunasekar and Nathan Srebro , title =. Journal of Machine Learning Research , volume =

work page
[35]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Suriya Gunasekar and Jason Lee and Daniel Soudry and Nathan Srebro , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[36]

Neural tangent kernel: Convergence and generalization in neural networks , booktitle =

Arthur Jacot and Franck Gabriel and Cl. Neural tangent kernel: Convergence and generalization in neural networks , booktitle =

work page
[37]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Jaehoon Lee and Lechao Xiao and Samuel Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[38]

Du and Wei Hu and Zhiyuan Li and Ruslan Salakhutdinov and Ruosong Wang , title =

Sanjeev Arora and Simon S. Du and Wei Hu and Zhiyuan Li and Ruslan Salakhutdinov and Ruosong Wang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[39]

Conference on Learning Theory (COLT) , year =

Song Mei and Andrea Montanari , title =. Conference on Learning Theory (COLT) , year =

work page
[40]

SIAM Journal on Applied Mathematics , volume =

Justin Sirignano and Konstantinos Spiliopoulos , title =. SIAM Journal on Applied Mathematics , volume =

work page
[41]

IEEE Transactions on Information Theory , volume=

Convex formulation of overparameterized deep neural networks , author=. IEEE Transactions on Information Theory , volume=. 2022 , publisher=

work page 2022
[42]

Conference on Learning Theory , pages=

Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis , author=. Conference on Learning Theory , pages=. 2017 , organization=

work page 2017
[43]

arXiv preprint arXiv:1805.01648 , year=

Sharp convergence rates for Langevin dynamics in the nonconvex setting , author=. arXiv preprint arXiv:1805.01648 , year=

work page arXiv
[44]

Annals of Applied Probability , volume =

Florent Malrieu , title =. Annals of Applied Probability , volume =

work page
[45]

International Conference on Machine Learning (ICML) , year =

Qianxiao Li and Cheng Tai and Weinan E , title =. International Conference on Machine Learning (ICML) , year =

work page
[46]

International Conference on Learning Representations (ICLR) , year =

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals , title =. International Conference on Learning Representations (ICLR) , year =

work page
[47]

arXiv preprint arXiv:1810.02032 , year=

Gradient descent aligns the layers of deep linear networks , author=. arXiv preprint arXiv:1810.02032 , year=

work page arXiv
[48]

arXiv preprint arXiv:1712.05438 , year=

Stochastic particle gradient descent for infinite ensembles , author=. arXiv preprint arXiv:1712.05438 , year=

work page arXiv
[49]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[50]

Rethinking attention with performers , booktitle =

Krzysztof Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Andreea Gane and Tam. Rethinking attention with performers , booktitle =

work page
[51]

ICLR , year =

Hancheng Peng and Ramin Hasani and Alexander Amini and Daniela Rus and Thomas Serre , title =. ICLR , year =

work page
[52]

Fu and Stefano Ermon and Atri Rudra and Christopher R

Tri Dao and Daniel Y. Fu and Stefano Ermon and Atri Rudra and Christopher R. FlashAttention: Fast and memory-efficient exact attention with IO-awareness , booktitle =

work page
[53]

ICLR 2020 Workshop ODE/PDE , year =

Yiping Lu and Zhuohan Li and Di He , title =. ICLR 2020 Workshop ODE/PDE , year =

work page 2020
[54]

2021 , eprint =

Tristan Deleu and Yoshua Bengio and Joseph Paul Cohen , title =. 2021 , eprint =

work page 2021
[55]

Learning

Varre, Aditya and Y. Learning. 2025 , month = aug, number =. doi:10.48550/arXiv.2508.12837 , urldate =. arXiv , keywords =:2508.12837 , primaryclass =

work page doi:10.48550/arxiv.2508.12837 2025
[56]

International Conference on Machine Learning (ICML) , year =

Jiri Hron and Yasaman Bahri and Roman Novak and Jeffrey Pennington and Jascha Sohl-Dickstein , title =. International Conference on Machine Learning (ICML) , year =

work page
[57]

Advances in neural information processing systems , volume=

What can transformers learn in-context? a case study of simple function classes , author=. Advances in neural information processing systems , volume=

work page
[58]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Understanding self-attention mechanism via dynamical system perspective , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[59]

Vershynin, Roman , year =. High-

work page
[60]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[61]

Proceedings of Thirty Seventh Conference on Learning Theory , pages =

Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality (extended abstract) , author =. Proceedings of Thirty Seventh Conference on Learning Theory , pages =. 2024 , editor =

work page 2024
[62]

Communications on Pure and Applied Mathematics , volume =

Carlier, Guillaume and Dupuy, Arnaud and Galichon, Alfred and Sun, Yifei , year =. Communications on Pure and Applied Mathematics , volume =. doi:10.1002/cpa.22047 , urldate =

work page doi:10.1002/cpa.22047
[63]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Non-asymptotic Convergence of Training Transformers for Next-token Prediction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[64]

Gradient

Ambrosio, Luigi and Gigli, Nicola and Savar. Gradient. 2009 , month = sep, publisher =

work page 2009
[65]

Chen, Ricky T. Q. and Rubanova, Yulia and Bettencourt, Jesse and Duvenaud, David K , year =. Neural. Advances in

work page
[66]

Mean-Field

Hu, Kaitong and Ren, Zhenjie and. Mean-Field. 2021 , month = nov, journal =. doi:10.1214/20-AIHP1140 , urldate =

work page doi:10.1214/20-aihp1140 2021
[67]

2023 , month = oct, journal =

A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks , author =. 2023 , month = oct, journal =. doi:10.4171/msl/42 , urldate =

work page doi:10.4171/msl/42 2023
[68]

Wojtowytsch, Stephan , year =. On the. doi:10.48550/arXiv.2005.13530 , urldate =. arXiv , keywords =:2005.13530 , primaryclass =

work page doi:10.48550/arxiv.2005.13530 2005
[69]

Understanding the Training of Infinitely Deep and Wide

Barboni, Rapha. Understanding the Training of Infinitely Deep and Wide. 2025 , journal =. doi:10.1002/cpa.70004 , urldate =

work page doi:10.1002/cpa.70004 2025
[70]

Alcalde, Albert and Fantuzzi, Giovanni and Zuazua, Enrique , year =. Exact. doi:10.48550/arXiv.2502.02270 , urldate =. arXiv , keywords =:2502.02270 , primaryclass =

work page internal anchor Pith review doi:10.48550/arxiv.2502.02270
[71]

Clustering in

Alcalde, Albert and Fantuzzi, Giovanni and Zuazua, Enrique , year =. Clustering in. SIAM Journal on Mathematics of Data Science , pages =. doi:10.1137/24M167086X , urldate =

work page doi:10.1137/24m167086x
[72]

Santambrogio, Filippo , year =. \ \. Bulletin of Mathematical Sciences , volume =. doi:10.1007/s13373-017-0101-1 , urldate =

work page doi:10.1007/s13373-017-0101-1
[73]

Peyr. Optimal. 2025 , month = may, number =. doi:10.48550/arXiv.2505.06589 , urldate =. arXiv , keywords =:2505.06589 , primaryclass =

work page doi:10.48550/arxiv.2505.06589 2025
[74]

2018 , month = aug, journal =

A Mean Field View of the Landscape of Two-Layer Neural Networks , author =. 2018 , month = aug, journal =. doi:10.1073/pnas.1806579115 , urldate =

work page doi:10.1073/pnas.1806579115 2018
[75]

Parameters as Interacting Particles: Long Time Convergence and Asymptotic Error Scaling of Neural Networks , shorttitle =

Rotskoff, Grant and. Parameters as Interacting Particles: Long Time Convergence and Asymptotic Error Scaling of Neural Networks , shorttitle =. Advances in. 2018 , volume =

work page 2018
[76]

Searching for Activation Functions

Ramachandran, Prajit and Zoph, Barret and Le, Quoc V. , year =. Searching for. doi:10.48550/arXiv.1710.05941 , urldate =. arXiv , keywords =:1710.05941 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1710.05941
[77]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , author =. 2018 , month = nov, journal =. doi:10.1016/j.neunet.2017.12.012 , urldate =

work page doi:10.1016/j.neunet.2017.12.012 2018
[78]

Gaussian Error Linear Units (GELUs)

Hendrycks, Dan and Gimpel, Kevin , year =. Gaussian. doi:10.48550/arXiv.1606.08415 , urldate =. arXiv , keywords =:1606.08415 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1606.08415
[79]

Implicit

Chizat, L. Implicit. Proceedings of. 2020 , month = jul, pages =

work page 2020
[80]

Maggi, Francesco , year =. Sets of. doi:10.1017/CBO9781139108133 , urldate =

work page doi:10.1017/cbo9781139108133

Showing first 80 references.