Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Dynamic Mode Decomposition shows that short contiguous spans of Vision Transformer blocks can be approximated by a low-rank linear operator K with high predictive fidelity for p<=4 steps, but this approximation fails to outperform an identity baseline when propagated to the final layer.
citing papers explorer
-
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
-
Dynamic Mode Decomposition along Depth in Vision Transformers
Dynamic Mode Decomposition shows that short contiguous spans of Vision Transformer blocks can be approximated by a low-rank linear operator K with high predictive fidelity for p<=4 steps, but this approximation fails to outperform an identity baseline when propagated to the final layer.