{"total":10,"items":[{"citing_arxiv_id":"2605.16622","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Does Weight Decay Enhance Training Stability?","primary_cat":"cs.LG","submitted_at":"2026-05-15T20:43:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14200","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization","primary_cat":"cs.LG","submitted_at":"2026-05-13T23:32:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09552","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Phases of Muon: When Muon Eclipses SignSGD","primary_cat":"math.OC","submitted_at":"2026-05-10T14:11:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"relationships between diagonal and non-diagonal optimization methods. Future work includes extending the analysis to incorporate momentum, studying power-law random features models, and deriving explicit compute-optimal neural scaling laws. In particular, our work does not capture the effects of feature learning, or non-linear dynamical effects like the edge-of-stability [16, 17] which are important in practice. Incorporating these effects is especially important in scenarios where the different implicit biases ofMuonandAdamcan lead to differences in generalization. The results in the anisotropic case suggest that there is an important interplay between the batch size, data distribution, and the ability to usefully apply non-diagonal"},{"citing_arxiv_id":"2605.07870","ref_index":26,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer","primary_cat":"cond-mat.dis-nn","submitted_at":"2026-05-08T15:28:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructuring for large-output tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kyunghyun Cho*, and Krzysztof Geras*. The break-even point on optimization trajectories of deep neural networks. InInternational Conference on Learning Representations, 2020. [25] Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021. [26] Jeremy M. Cohen, B. Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David E. Cardoze, Zachary Nado, George E. Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability.ArXiv, abs/2207.14484, 2022. [27] Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisiting the"},{"citing_arxiv_id":"2605.06821","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Rod Flow Model for Adam at the Edge of Stability","primary_cat":"cs.LG","submitted_at":"2026-05-07T18:21:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Rod flow models for Adam and related optimizers track discrete iterates at the edge of stability more accurately than standard stable flows across tested ML architectures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14669","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Zeroth-Order Optimization at the Edge of Stability","primary_cat":"cs.LG","submitted_at":"2026-04-16T06:23:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Zeroth-order methods achieve mean-square stability when the step size satisfies a condition involving the entire Hessian spectrum, with full-batch ZO optimizers operating at the edge of stability and large steps regularizing the Hessian trace.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14108","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Momentum Further Constrains Sharpness at the Edge of Stochastic Stability","primary_cat":"cs.LG","submitted_at":"2026-04-15T17:28:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"At each iteration, a mini-batch Bt of size b is sampled uni- formly at random, and the stochastic gradient is gt = 1 b X i∈Bt ∇θℓ(θt;x i).(4) We use the heavy-ball (HB) momentum formulation stan- dard in deep learning libraries, rather than an EMA-style update. The two algorithms considered are: • SGD with Polyak Momentum (HB or SGDM): vt+1 =βv t +g t, θt+1 =θ t −ηv t+1, (5) with momentum β∈[0,1) , learning rate η >0 , and v0 = 0. • SGD with Nesterov Acceleration (NAG or SGDN): vt+1 =βv t +g t(θt −βηv t), θt+1 =θ t −ηv t+1. (6) Let LB(θ) = 1 |B| P i∈B ℓ(θ;x i) be the mini-batch loss for a batch B⊆ D of size b drawn from the mini-batch sampling distribution Pb. Define the mini-batch gradient gB(θ) = ∇LB(θ)and mini-batch HessianH B(θ) =∇ 2LB(θ). 2Empirically, even whenηeff is matched so that curvature statis- tics stabilize similarly, SGD and SGDM remain separated in pa- rameter/function space; see Appendix F. Figure 3.Stabilization levels of Batch Sharpness and λmax across varying batch sizes for an MLP trained with SGDM (top) and SGDN (bottom) at η= 0.005 and β= 0.9 . The critical batch size, defined heuristically as the threshold at which training dynamics enter the large-batch regime, is marked for each optimizer. Notably, SGDN reaches this regime at a batch size almost an order of magnitude smaller than SGDM. 2.2. The Value of Momentum The added value of momentum.Polyak heavy-ball mo- mentum and Nesterov acceleration are ubiquitous in modern deep learning-often as explicit buffers (SGDM/SGDN) or implicitly inside adaptive methods-and are frequently key to fast and stable training in practice (Krizhevsky et al., 2012; Sutskever et al., 2013; Gitman et al., 2019; Fu et al., 2023). A large body of work has proposed complementary explanations for why momentum helps (see Appendix A for further related work):(i) stability enlargement / effective step-size rescaling, where momentum can enlarge the us- able learning-rate range, and in some regimes SGDM can be r"},{"citing_arxiv_id":"2505.24275","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GradPower: Powering Gradients for Faster Language Model Pre-Training","primary_cat":"cs.LG","submitted_at":"2025-05-30T06:49:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.13196","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Physics-Inspired Optimizer: Velocity Regularized Adam","primary_cat":"cs.LG","submitted_at":"2025-05-19T14:51:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.14048","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models","primary_cat":"cs.LG","submitted_at":"2023-06-24T20:11:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Cheung, Simon JD Prince, and Yanshuai Cao. Optimizing deeper transformers on small datasets. arXiv preprint arXiv:2012.15355, 2020. [88] Chen Zhu, Renkun Ni, Zheng Xu, Kezhi Kong, W Ronny Huang, and Tom Goldstein. Gradinit: Learning to initialize neural networks for stable and efficient training. Advances in Neural Information Processing Systems, 34:16410-16422, 2021. [89] Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022. [90] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers."}],"limit":50,"offset":0}