A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
Provable complexity improvement of adagrad over sgd: Upper and lower bounds in stochastic non-convex optimization
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
citing papers explorer
-
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Optimal Projection-Free Adaptive SGD for Matrix Optimization
Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.