The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Eitan Gronich; Gal Vardi

arxiv: 2602.16340 · v3 · pith:W4JVQ3RAnew · submitted 2026-02-18 · 💻 cs.LG · stat.ML

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Eitan Gronich , Gal Vardi This is my paper

classification 💻 cs.LG stat.ML

keywords normbiasdescenthomogeneousmarginmodelssteepestadam

0 comments

read the original abstract

We study the implicit bias of momentum-based optimizers on smooth homogeneous models. We show that \textit{momentum steepest descent} algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are \textit{approximate} steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
math.OC 2026-05 unverdicted novelty 6.0

Establishes matching lower and upper oracle complexity bounds for scale-invariant methods with spectral norm under heavy-tailed noise, plus improved rates with higher-order smoothness, and practical tests on neural networks.
Convergence of Spectral Descent for Non-smooth Optimization
cs.LG 2026-05 unverdicted novelty 5.0

Proves linear convergence of Spectral Descent (SD) and Truncated SD for non-smooth convex problems under stated conditions, sublinear rates for regularized versions via Frank-Wolfe, and recovery guarantees for robust ...