SGD, approximations of Newton's method, natural gradient descent, and Adam are proven compatible with evolutionary dynamics when augmented with DLS noise, turning them into valid in silico simulations of asexual Darwinian evolution.
Kakade , booktitle=
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Adam's adaptive preconditioning and first-moment averaging improve high-probability tracking error in noise-dominated nonstationary regimes but can increase it under strong drift, where SGD achieves a smaller floor, with explicit beta-dependent bounds.
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
citing papers explorer
-
Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles
SGD, approximations of Newton's method, natural gradient descent, and Adam are proven compatible with evolutionary dynamics when augmented with DLS noise, turning them into valid in silico simulations of asexual Darwinian evolution.
-
Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization
Adam's adaptive preconditioning and first-moment averaging improve high-probability tracking error in noise-dominated nonstationary regimes but can increase it under strong drift, where SGD achieves a smaller floor, with explicit beta-dependent bounds.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.