Kakade , booktitle=

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson · 2025

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles

cs.NE · 2026-05-06 · unverdicted · novelty 7.0 · 2 refs

SGD, approximations of Newton's method, natural gradient descent, and Adam are proven compatible with evolutionary dynamics when augmented with DLS noise, turning them into valid in silico simulations of asexual Darwinian evolution.

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

stat.ML · 2026-05-05 · unverdicted · novelty 6.0

Adam's adaptive preconditioning and first-moment averaging improve high-probability tracking error in noise-dominated nonstationary regimes but can increase it under strong drift, where SGD achieves a smaller floor, with explicit beta-dependent bounds.

Muon is Scalable for LLM Training

cs.LG · 2025-02-24 · unverdicted · novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

citing papers explorer

Showing 3 of 3 citing papers.

Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles cs.NE · 2026-05-06 · unverdicted · none · ref 43 · 2 links
SGD, approximations of Newton's method, natural gradient descent, and Adam are proven compatible with evolutionary dynamics when augmented with DLS noise, turning them into valid in silico simulations of asexual Darwinian evolution.
Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization stat.ML · 2026-05-05 · unverdicted · none · ref 16
Adam's adaptive preconditioning and first-moment averaging improve high-probability tracking error in noise-dominated nonstationary regimes but can increase it under strong drift, where SGD achieves a smaller floor, with explicit beta-dependent bounds.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 83
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

Kakade , booktitle=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer