pith. machine review for the scientific record. sign in

arxiv: 2605.10775 · v1 · submitted 2026-05-11 · 🧮 math.OC · cs.LG

Recognition: 2 theorem links

· Lean Theorem

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

Clarice Poon, Gabriel Peyr\'e, Romain Petit

Pith reviewed 2026-05-12 03:44 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords global convergencegradient descentmean-field limitshallow neural networksbounded nonlinearitiesmulti-head attentionescaping active setoptimization
0
0 comments X

The pith

Continuous-time gradient descent on wide shallow models with bounded nonlinearities converges only to global minimizers in the mean-field limit with full-support initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that non-global minimizers of the training loss are unstable for a class of wide shallow models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. It does so by completing the construction of an escaping active set that allows parameters to leave these points under the gradient flow, extending earlier results limited to ReLU activations or scalar output weights. When the initial parameter distribution has full support, as in the Gaussian case, this instability implies that continuous-time gradient descent cannot converge to anything but a global minimizer once the number of hidden units or attention heads becomes large. A reader would care because the argument supplies a mechanism explaining why overparameterized networks trained by gradient descent routinely reach good solutions despite non-convexity.

Core claim

All non-global minimizers are unstable under gradient descent dynamics. When the initial distribution of the parameters has full support and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. The proof proceeds by constructing an escaping active set for models with bounded nonlinearities and scalar output weights, then extending the construction to vector output weights; the mean-field training dynamic is shown to be well-posed and stable with respect to discretization for sub-Gaussian initializations.

What carries the argument

The escaping active set, a collection of directions in parameter space that allows the continuous-time gradient flow to leave any non-global minimizer.

If this is right

  • Non-global minimizers become unstable and trajectories can escape them under the flow.
  • In the infinite-width limit, the only possible limit points of the dynamics are global minimizers.
  • The result covers multi-head attention layers and sigmoid networks with vector-valued outputs.
  • The mean-field PDE is well-posed and the continuous-time limit is stable under discretization for sub-Gaussian initial data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The instability mechanism may persist approximately for large but finite widths, suggesting that global convergence remains likely in practical networks initialized with spread-out parameters.
  • Similar escaping-set constructions could be attempted for other architectures whose nonlinearities satisfy the boundedness condition.
  • The mean-field analysis supplies a testable prediction: the probability of reaching a non-global minimizer should decrease as width grows, for fixed full-support initialization.

Load-bearing premise

The nonlinearities must be bounded and the dynamics must be considered in the continuous-time mean-field limit starting from an initial distribution with full support.

What would settle it

A numerical experiment in which continuous-time or finely discretized gradient descent, started from a Gaussian distribution and with a large but finite number of neurons, converges to a non-global minimizer for a model with bounded activations would falsify the central claim.

read the original abstract

A surprising phenomenon in the training of neural networks is the ability of gradient descent to find global minimizers of the training loss despite its non-convexity. Following earlier works, we investigate this behavior for wide shallow networks. Existing results essentially cover the case of ReLU activations and the case of sigmoid activations with scalar output weights. We study a large class of models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. Building upon [Chizat and Bach, 2018], we prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics. Thus, when the initial distribution of the parameters has full support (which includes the popular Gaussian case), and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. Establishing the instability of non-global minimizers corresponds to the construction of an ``escaping active set'' -- we complete the proof of [Chizat and Bach, 2018] to construct this set for models with bounded nonlinearities and scalar output weights. We also extend this construction to new cases for models with vector output weights. Finally, we show the well-posedness and the stability with respect to discretization of the mean field training dynamic for sub-Gaussian initializations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proves global convergence of continuous-time gradient descent to global minimizers for wide shallow models with bounded nonlinearities (including multi-head attention) in the mean-field limit. Building on Chizat and Bach (2018), it completes the escaping-active-set construction to show instability of all non-global minimizers when the initial parameter distribution has full support (e.g., Gaussian), and extends the construction to vector-output weights. It also establishes well-posedness of the mean-field PDE and stability under discretization for sub-Gaussian initializations.

Significance. If the constructions hold, this extends global-convergence guarantees beyond ReLU and scalar-sigmoid cases to a broader class of bounded activations and vector-output architectures, providing a rigorous explanation for why gradient descent reaches global minima in the overparameterized regime. The completion of the prior proof and the vector-weight extension are concrete advances; the well-posedness result for sub-Gaussian measures is a useful technical contribution.

major comments (2)
  1. [§3.3] §3.3, construction of escaping active set for bounded nonlinearities: the argument that the perturbation can be chosen to strictly decrease the loss while preserving the mean-field measure appears to use boundedness to control the remainder term, but it is not immediately clear whether the same perturbation works uniformly for all non-global critical points or requires a case distinction on the support of the measure.
  2. [§4.2] §4.2, extension to vector output weights: the choice of the escaping direction in the vector case is constructed explicitly, but the proof that this direction remains admissible under the dynamics for arbitrary output dimension should be cross-checked against the scalar case to confirm no additional regularity on the output weights is implicitly used.
minor comments (3)
  1. [§2.2] §2.2: the definition of the mean-field loss functional could explicitly state the dependence on the output dimension to make the vector-weight extension easier to follow.
  2. [Theorem 5.1] Theorem 5.1: the statement of discretization stability would benefit from an explicit constant or rate in terms of the sub-Gaussian parameter.
  3. The reference list should include the full citation details for Chizat and Bach (2018) to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment, and constructive suggestions. We address the two major comments below and will incorporate clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [§3.3] §3.3, construction of escaping active set for bounded nonlinearities: the argument that the perturbation can be chosen to strictly decrease the loss while preserving the mean-field measure appears to use boundedness to control the remainder term, but it is not immediately clear whether the same perturbation works uniformly for all non-global critical points or requires a case distinction on the support of the measure.

    Authors: We thank the referee for this observation. Boundedness of the nonlinearity is used to control the remainder when perturbing the loss along the escaping direction. Because the initial measure is assumed to have full support, the same local perturbation construction applies uniformly to every non-global critical point: the full-support property guarantees that the measure can be adjusted in the required directions without needing case distinctions on the support. We will add a short clarifying paragraph after the statement of the main escaping-set result to make this uniformity explicit. revision: yes

  2. Referee: [§4.2] §4.2, extension to vector output weights: the choice of the escaping direction in the vector case is constructed explicitly, but the proof that this direction remains admissible under the dynamics for arbitrary output dimension should be cross-checked against the scalar case to confirm no additional regularity on the output weights is implicitly used.

    Authors: We appreciate the request for an explicit cross-check. The escaping direction for vector output weights is chosen componentwise, following exactly the same linear-algebraic argument used in the scalar case; no extra regularity on the output weights is invoked beyond the Lipschitz and boundedness assumptions already stated for the scalar setting. The mean-field PDE analysis that establishes admissibility carries over verbatim to any finite output dimension. We will insert a short comparative remark at the beginning of Section 4.2 that recalls the scalar construction and notes the direct extension. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation extends external prior result

full rationale

The paper completes and extends the escaping-active-set construction from Chizat and Bach (2018), a work by different authors, for bounded nonlinearities and vector-output cases including attention. The central instability proof and mean-field limit arguments rely on independent constructions and well-posedness results rather than self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The full-support initial measure condition and discretization stability are established directly without reducing to the target global convergence claim by construction. This is a standard non-circular extension of prior independent work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard mean-field assumptions for neural networks plus the boundedness of the nonlinearities and full-support initialization; no new entities are postulated.

axioms (2)
  • domain assumption Nonlinearities are bounded
    Required to control the dynamics and construct the escaping active set for the studied models including sigmoid and attention.
  • domain assumption Initial parameter distribution has full support
    Ensures that the continuous-time flow can escape non-global minimizers.

pith-pipeline@v0.9.0 · 5534 in / 1289 out tokens · 63552 ms · 2026-05-12T03:44:00.937858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 4 internal anchors

  1. [1]

    arXiv preprint arXiv:2309.08586 , year=

    Replacing softmax with relu in vision transformers , author=. arXiv preprint arXiv:2309.08586 , year=

  2. [2]

    GLU Variants Improve Transformer

    Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=

  3. [3]

    Primer: Searching for efficient transformers for language modeling

    Primer: Searching for efficient transformers for language modeling, 2022 , author=. URL https://arxiv. org/abs/2109.08668 , year=

  4. [4]

    Long and Quanquan Gu , title =

    Difan Zou and Philip M. Long and Quanquan Gu , title =. International Conference on Learning Representations (ICLR) , year =

  5. [5]

    On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , booktitle =

    L. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , booktitle =

  6. [6]

    2020 , journal =

    Stephan Wojtowytsch , title =. 2020 , journal =

  7. [7]

    Mathematical Statistics and Learning , volume =

    Phan-Minh Nguyen and Huy Tuan Pham , title =. Mathematical Statistics and Learning , volume =

  8. [8]

    On Lazy Training in Differentiable Programming , booktitle =

    Lena. On Lazy Training in Differentiable Programming , booktitle =. 2019 , pages =

  9. [9]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Chaoyue Liu and Libin Zhu and Mikhail Belkin , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  10. [10]

    Hu , title =

    Greg Yang and Edward J. Hu , title =. International Conference on Machine Learning (ICML) , year =

  11. [11]

    International Conference on Machine Learning (ICML) , year =

    Peter Bartlett and David Helmbold and Philip Long , title =. International Conference on Machine Learning (ICML) , year =

  12. [12]

    Information and Inference: A Journal of the IMA , volume =

    Bubacarr Bah and others , title =. Information and Inference: A Journal of the IMA , volume =

  13. [13]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Yuanzhi Li and Yang Yuan , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2017 , pages =

  14. [14]

    International Conference on Learning Representations (ICLR) , year =

    Zixiang Chen and Yuan Cao and others , title =. International Conference on Learning Representations (ICLR) , year =

  15. [15]

    International Conference on Machine Learning (ICML) , year =

    Quynh Nguyen , title =. International Conference on Machine Learning (ICML) , year =

  16. [16]

    Journal of Machine Learning Research , volume =

    Francis Bach , title =. Journal of Machine Learning Research , volume =

  17. [17]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Behrooz Ghorbani and Song Mei and Theodor Misiakiewicz and Andrea Montanari , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2020 , pages =

  18. [18]

    Calculus of Variations and Partial Differential Equations , volume =

    Weinan E and Stephan Wojtowytsch , title =. Calculus of Variations and Partial Differential Equations , volume =

  19. [19]

    Inverse Problems , volume =

    Eldad Haber and Lars Ruthotto , title =. Inverse Problems , volume =

  20. [20]

    International Conference on Machine Learning (ICML) , year =

    Yiping Lu and Aoxiao Zhong and Quanzheng Li and Bin Dong , title =. International Conference on Machine Learning (ICML) , year =

  21. [21]

    Saxe and James L

    Andrew M. Saxe and James L. McClelland and Surya Ganguli , title =. International Conference on Learning Representations (ICLR) , year =

  22. [22]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Kenji Kawaguchi , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  23. [23]

    International Conference on Learning Representations (ICLR) , year =

    Moritz Hardt and Tengyu Ma , title =. International Conference on Learning Representations (ICLR) , year =

  24. [24]

    International Conference on Machine Learning (ICML) , year =

    Thomas Laurent and James von Brecht , title =. International Conference on Machine Learning (ICML) , year =

  25. [25]

    Du and Jason D

    Simon S. Du and Jason D. Lee and Haochuan Li and Liwei Wang and Xiyu Zhai , title =. International Conference on Learning Representations (ICLR) , year =

  26. [26]

    International Conference on Machine Learning (ICML) , year =

    Zeyuan Allen-Zhu and Yuanzhi Li and Zhao Song , title =. International Conference on Machine Learning (ICML) , year =

  27. [27]

    International Conference on Machine Learning (ICML) , year =

    Sanjeev Arora and Nadav Cohen and Wei Hu and Yuping Luo , title =. International Conference on Machine Learning (ICML) , year =

  28. [28]

    ICLR 2020 , year =

    Ziwei Ji and Matus Telgarsky , title =. ICLR 2020 , year =

  29. [29]

    Machine Learning , volume =

    Difan Zou and Yuan Cao and Dongruo Zhou and Quanquan Gu , title =. Machine Learning , volume =

  30. [30]

    arXiv preprint arXiv:1706.06263 , year =

    Yuanzhi Li and Yingyu Liang , title =. arXiv preprint arXiv:1706.06263 , year =

  31. [31]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Zeyuan Allen-Zhu and Yuanzhi Li and Zhao Song , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  32. [32]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Yuanzhi Li and Yingyu Liang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  33. [33]

    Du and Xiyu Zhai and Barnab

    Simon S. Du and Xiyu Zhai and Barnab. Gradient descent provably optimizes over-parameterized neural networks , booktitle =

  34. [34]

    Journal of Machine Learning Research , volume =

    Daniel Soudry and Elad Hoffer and Mor Shpigel Nacson and Suriya Gunasekar and Nathan Srebro , title =. Journal of Machine Learning Research , volume =

  35. [35]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Suriya Gunasekar and Jason Lee and Daniel Soudry and Nathan Srebro , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  36. [36]

    Neural tangent kernel: Convergence and generalization in neural networks , booktitle =

    Arthur Jacot and Franck Gabriel and Cl. Neural tangent kernel: Convergence and generalization in neural networks , booktitle =

  37. [37]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Jaehoon Lee and Lechao Xiao and Samuel Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  38. [38]

    Du and Wei Hu and Zhiyuan Li and Ruslan Salakhutdinov and Ruosong Wang , title =

    Sanjeev Arora and Simon S. Du and Wei Hu and Zhiyuan Li and Ruslan Salakhutdinov and Ruosong Wang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  39. [39]

    Conference on Learning Theory (COLT) , year =

    Song Mei and Andrea Montanari , title =. Conference on Learning Theory (COLT) , year =

  40. [40]

    SIAM Journal on Applied Mathematics , volume =

    Justin Sirignano and Konstantinos Spiliopoulos , title =. SIAM Journal on Applied Mathematics , volume =

  41. [41]

    IEEE Transactions on Information Theory , volume=

    Convex formulation of overparameterized deep neural networks , author=. IEEE Transactions on Information Theory , volume=. 2022 , publisher=

  42. [42]

    Conference on Learning Theory , pages=

    Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis , author=. Conference on Learning Theory , pages=. 2017 , organization=

  43. [43]

    arXiv preprint arXiv:1805.01648 , year=

    Sharp convergence rates for Langevin dynamics in the nonconvex setting , author=. arXiv preprint arXiv:1805.01648 , year=

  44. [44]

    Annals of Applied Probability , volume =

    Florent Malrieu , title =. Annals of Applied Probability , volume =

  45. [45]

    International Conference on Machine Learning (ICML) , year =

    Qianxiao Li and Cheng Tai and Weinan E , title =. International Conference on Machine Learning (ICML) , year =

  46. [46]

    International Conference on Learning Representations (ICLR) , year =

    Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals , title =. International Conference on Learning Representations (ICLR) , year =

  47. [47]

    arXiv preprint arXiv:1810.02032 , year=

    Gradient descent aligns the layers of deep linear networks , author=. arXiv preprint arXiv:1810.02032 , year=

  48. [48]

    arXiv preprint arXiv:1712.05438 , year=

    Stochastic particle gradient descent for infinite ensembles , author=. arXiv preprint arXiv:1712.05438 , year=

  49. [49]

    International conference on machine learning , pages=

    Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

  50. [50]

    Rethinking attention with performers , booktitle =

    Krzysztof Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Andreea Gane and Tam. Rethinking attention with performers , booktitle =

  51. [51]

    ICLR , year =

    Hancheng Peng and Ramin Hasani and Alexander Amini and Daniela Rus and Thomas Serre , title =. ICLR , year =

  52. [52]

    Fu and Stefano Ermon and Atri Rudra and Christopher R

    Tri Dao and Daniel Y. Fu and Stefano Ermon and Atri Rudra and Christopher R. FlashAttention: Fast and memory-efficient exact attention with IO-awareness , booktitle =

  53. [53]

    ICLR 2020 Workshop ODE/PDE , year =

    Yiping Lu and Zhuohan Li and Di He , title =. ICLR 2020 Workshop ODE/PDE , year =

  54. [54]

    2021 , eprint =

    Tristan Deleu and Yoshua Bengio and Joseph Paul Cohen , title =. 2021 , eprint =

  55. [55]

    Learning

    Varre, Aditya and Y. Learning. 2025 , month = aug, number =. doi:10.48550/arXiv.2508.12837 , urldate =. arXiv , keywords =:2508.12837 , primaryclass =

  56. [56]

    International Conference on Machine Learning (ICML) , year =

    Jiri Hron and Yasaman Bahri and Roman Novak and Jeffrey Pennington and Jascha Sohl-Dickstein , title =. International Conference on Machine Learning (ICML) , year =

  57. [57]

    Advances in neural information processing systems , volume=

    What can transformers learn in-context? a case study of simple function classes , author=. Advances in neural information processing systems , volume=

  58. [58]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Understanding self-attention mechanism via dynamical system perspective , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  59. [59]

    Vershynin, Roman , year =. High-

  60. [60]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  61. [61]

    Proceedings of Thirty Seventh Conference on Learning Theory , pages =

    Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality (extended abstract) , author =. Proceedings of Thirty Seventh Conference on Learning Theory , pages =. 2024 , editor =

  62. [62]

    Communications on Pure and Applied Mathematics , volume =

    Carlier, Guillaume and Dupuy, Arnaud and Galichon, Alfred and Sun, Yifei , year =. Communications on Pure and Applied Mathematics , volume =. doi:10.1002/cpa.22047 , urldate =

  63. [63]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Non-asymptotic Convergence of Training Transformers for Next-token Prediction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  64. [64]

    Gradient

    Ambrosio, Luigi and Gigli, Nicola and Savar. Gradient. 2009 , month = sep, publisher =

  65. [65]

    Chen, Ricky T. Q. and Rubanova, Yulia and Bettencourt, Jesse and Duvenaud, David K , year =. Neural. Advances in

  66. [66]

    Mean-Field

    Hu, Kaitong and Ren, Zhenjie and. Mean-Field. 2021 , month = nov, journal =. doi:10.1214/20-AIHP1140 , urldate =

  67. [67]

    2023 , month = oct, journal =

    A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks , author =. 2023 , month = oct, journal =. doi:10.4171/msl/42 , urldate =

  68. [68]

    Wojtowytsch, Stephan , year =. On the. doi:10.48550/arXiv.2005.13530 , urldate =. arXiv , keywords =:2005.13530 , primaryclass =

  69. [69]

    Understanding the Training of Infinitely Deep and Wide

    Barboni, Rapha. Understanding the Training of Infinitely Deep and Wide. 2025 , journal =. doi:10.1002/cpa.70004 , urldate =

  70. [70]

    Alcalde, Albert and Fantuzzi, Giovanni and Zuazua, Enrique , year =. Exact. doi:10.48550/arXiv.2502.02270 , urldate =. arXiv , keywords =:2502.02270 , primaryclass =

  71. [71]

    Clustering in

    Alcalde, Albert and Fantuzzi, Giovanni and Zuazua, Enrique , year =. Clustering in. SIAM Journal on Mathematics of Data Science , pages =. doi:10.1137/24M167086X , urldate =

  72. [72]

    Santambrogio, Filippo , year =. \ \. Bulletin of Mathematical Sciences , volume =. doi:10.1007/s13373-017-0101-1 , urldate =

  73. [73]

    Peyr. Optimal. 2025 , month = may, number =. doi:10.48550/arXiv.2505.06589 , urldate =. arXiv , keywords =:2505.06589 , primaryclass =

  74. [74]

    2018 , month = aug, journal =

    A Mean Field View of the Landscape of Two-Layer Neural Networks , author =. 2018 , month = aug, journal =. doi:10.1073/pnas.1806579115 , urldate =

  75. [75]

    Parameters as Interacting Particles: Long Time Convergence and Asymptotic Error Scaling of Neural Networks , shorttitle =

    Rotskoff, Grant and. Parameters as Interacting Particles: Long Time Convergence and Asymptotic Error Scaling of Neural Networks , shorttitle =. Advances in. 2018 , volume =

  76. [76]

    Searching for Activation Functions

    Ramachandran, Prajit and Zoph, Barret and Le, Quoc V. , year =. Searching for. doi:10.48550/arXiv.1710.05941 , urldate =. arXiv , keywords =:1710.05941 , primaryclass =

  77. [77]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , author =. 2018 , month = nov, journal =. doi:10.1016/j.neunet.2017.12.012 , urldate =

  78. [78]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, Dan and Gimpel, Kevin , year =. Gaussian. doi:10.48550/arXiv.1606.08415 , urldate =. arXiv , keywords =:1606.08415 , primaryclass =

  79. [79]

    Implicit

    Chizat, L. Implicit. Proceedings of. 2020 , month = jul, pages =

  80. [80]

    Maggi, Francesco , year =. Sets of. doi:10.1017/CBO9781139108133 , urldate =

Showing first 80 references.