Anytime Training with Schedule-Free Spectral Optimization

Anuj Apte; Junhyung Lyle Kim; Niraj Kumar; Pranav Deshpande; Shouvanik Chakrabarti

arxiv: 2605.23061 · v1 · pith:CH6ZJZ5Unew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Anytime Training with Schedule-Free Spectral Optimization

Anuj Apte , Pranav Deshpande , Niraj Kumar , Shouvanik Chakrabarti , Junhyung Lyle Kim This is my paper

Pith reviewed 2026-05-25 05:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords schedule-free optimizationspectral optimizerlanguage model traininganytime trainingweight decayAdamWChinchilla horizonsneural network optimizers

0 comments

The pith

Schedule-free spectral optimization matches tuned AdamW on 125M and 772M parameter models using one hyperparameter configuration across 1-8x Chinchilla horizons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard neural network training ties learning-rate schedules to a fixed horizon, creating path dependence and forcing costly re-tuning when data volume changes. Schedule-free methods remove explicit schedules to address this, but prior versions like SF-AdamW still fall short of well-tuned AdamW. The paper introduces SF-NorMuon, a schedule-free spectral optimizer that closes the gap by matching or exceeding tuned AdamW performance on 125M and 772M parameter language models with a single hyperparameter setup. This holds across training lengths from 1x to 8x Chinchilla horizons. The approach lets practitioners obtain high-quality checkpoints at any point without committing to a total horizon in advance.

Core claim

SF-NorMuon is a schedule-free spectral optimizer that, with a single hyperparameter configuration, matches or exceeds tuned AdamW on 125M and 772M parameter language models across 1 to 8 times Chinchilla horizons. The work proves a stationarity guarantee for schedule-free spectral dynamics and shows that weight decay applied at the fast iterate is essential for long-horizon stability. This removes the need to commit to a training horizon upfront while preserving competitive performance.

What carries the argument

SF-NorMuon, a schedule-free spectral optimizer that applies weight decay at the fast iterate within schedule-free spectral dynamics.

Load-bearing premise

Weight decay applied at the fast iterate is essential for maintaining long-horizon stability in schedule-free spectral dynamics.

What would settle it

Training runs of SF-NorMuon on 772M parameter models over 8x Chinchilla horizons that exhibit instability or performance drops when weight decay is removed from the fast iterate.

Figures

Figures reproduced from arXiv: 2605.23061 by Anuj Apte, Junhyung Lyle Kim, Niraj Kumar, Pranav Deshpande, Shouvanik Chakrabarti.

**Figure 1.** Figure 1: Comparison of SF-NorMuon (this work), SF-AdamW, and tuned AdamW baselines for LLaMA-2-style transformers trained on FineWeb-100B. Left: 125M model with AdamW learning rate tuned per horizon with cosine schedule. Right: 772M model with a single optimized AdamW configuration across horizons. Dashed lines highlight that learning rate schedule for a long horizon is sub-optimal for a smaller token budget. The g… view at source ↗

**Figure 2.** Figure 2: Weight decay strategies for schedule-free optimizers. Left: SF-AdamW with no decay (orange) diverges; decay at Y (blue) exhibits the best performance up to ∼30B tokens (c.f., [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Importance of explicit momentum and row-wise normalization for SF-NorMuon. Left: Ablating explicit momentum (µ = 0) significantly degrades performance, with final loss increasing from 3.14 to 3.28. This validates the importance of smoothing the gradient before computing the polar factor, as discussed in Section 2. Right: Ablating row-wise normalization leads to a smaller gap, with SF-NorMuon reaching the s… view at source ↗

**Figure 4.** Figure 4: Quasi steady-state analysis for SF-NorMuon with decay at Z (averaged over layers). Left: The ratio |ρt+1 − ρt|/ρt remains below 1% after warmup, validating the quasi steady-state hypothesis (Hypothesis 3.1). Right: Alignment αt (purple) and RMS(Z) (red) over training. The theoretical prediction λρt = 0.1[−αt + p α 2 t + 2ηλ] from Lemma 3.2 (blue dashed) closely tracks the observed values. This training run… view at source ↗

**Figure 5.** Figure 5: Learning rate sweep comparison between SF-AdamW and SF-NorMuon. Improvement across learning rates. A key practical advantage of SF-NorMuon is robustness to the choice of learning rate [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Learning rate sweep for AdamW with cosine scheduling on the 125M model across four training [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗

**Figure 7.** Figure 7: Optimizer comparison on the 125M model. Scheduled methods (NorMuon, AdamW) are shown with [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗

read the original abstract

Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SF-NorMuon delivers schedule-free training that matches tuned AdamW across horizons on 125M and 772M models with one fixed hyperparameter set.

read the letter

SF-NorMuon is the main takeaway: the paper introduces a schedule-free spectral optimizer that, using a single hyperparameter configuration, matches or exceeds horizon-tuned AdamW on 125M and 772M parameter language models over 1x to 8x Chinchilla horizons. It also supplies a stationarity proof for the underlying dynamics and flags weight decay on the fast iterate as important for long-horizon stability. This directly targets the re-tuning burden that comes with fixed training horizons. The combination of schedule-free methods with spectral optimization into a named method is the concrete new piece, and it improves on the noted shortcomings of SF-AdamW. The empirical results on two model sizes and multiple horizons stand out as the practical strength, showing that high-quality checkpoints can be obtained without committing to a horizon in advance. The theory adds grounding by linking the dynamics to stationarity and isolating one role for weight decay. Soft spots are limited but real. The abstract gives little on ablations, implementation specifics for the spectral component, or error analysis, so it is not yet clear how much this differs from prior momentum or spectral baselines or how sensitive the matching is to other training choices. The stationarity result and the weight-decay condition would need checking in the full derivation to see whether they are tight or mainly supportive. The tested scales are reasonable but not frontier. This paper is for ML practitioners and engineers who train language models under variable data or compute budgets and want to cut down on horizon-specific tuning. Readers working on anytime or continual training setups would find the empirical claim useful. It deserves peer review because the central performance claim is concrete and testable, the theory is present, and the problem it addresses is common in real workflows.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes SF-NorMuon, a schedule-free spectral optimizer, claiming that a single hyperparameter configuration matches or exceeds per-horizon-tuned AdamW on 125M and 772M parameter language models across 1--8× Chinchilla horizons. It further states a stationarity guarantee for schedule-free spectral dynamics and identifies weight decay applied at the fast iterate as essential for long-horizon stability, enabling horizon-independent high-quality checkpoints.

Significance. If the empirical matching holds with the reported single-configuration robustness and the stationarity result is non-vacuous, the work would meaningfully advance practical anytime optimization by reducing horizon-dependent retuning costs and supporting continual learning. The theoretical identification of the weight-decay placement provides a concrete design principle that could guide further schedule-free methods.

major comments (2)

[Abstract] Abstract: the central empirical claim (SF-NorMuon matches or exceeds tuned AdamW with one fixed hyperparameter set across two model sizes and multiple horizons) is stated without reference to any table, figure, ablation, or error analysis, preventing assessment of effect size, variance, or whether the result is load-bearing for the 'closes this gap' conclusion.
[Abstract] Abstract (theoretical analysis section): the stationarity guarantee and the claim that weight decay at the fast iterate is 'essential for long-horizon stability' are asserted without any displayed equations, assumptions, or derivation outline, so it is impossible to verify whether the condition is derived or imposed by construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on the abstract. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim (SF-NorMuon matches or exceeds tuned AdamW with one fixed hyperparameter set across two model sizes and multiple horizons) is stated without reference to any table, figure, ablation, or error analysis, preventing assessment of effect size, variance, or whether the result is load-bearing for the 'closes this gap' conclusion.

Authors: We agree the abstract would be clearer with explicit pointers to the supporting evidence. The performance comparisons (including means and standard deviations over three random seeds) appear in Section 4, Tables 1–2 and Figures 2–4; hyperparameter robustness and weight-decay ablations are in Section 5. In the revision we will add concise parenthetical citations to these results within the abstract. revision: yes
Referee: [Abstract] Abstract (theoretical analysis section): the stationarity guarantee and the claim that weight decay at the fast iterate is 'essential for long-horizon stability' are asserted without any displayed equations, assumptions, or derivation outline, so it is impossible to verify whether the condition is derived or imposed by construction.

Authors: The stationarity guarantee is stated and proved as Theorem 3.1 (with the key assumption of weight decay applied to the fast iterate). The necessity of this placement is shown by a counter-example in Appendix B when weight decay is instead applied to the slow iterate. The abstract summarizes the result at high level; we will add a reference to Theorem 3.1 in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's primary result is the empirical observation that a single fixed hyperparameter configuration of SF-NorMuon matches or exceeds per-horizon-tuned AdamW on 125M and 772M models across 1-8× Chinchilla horizons. The mentioned stationarity proof and weight-decay-at-fast-iterate condition are presented as supporting theoretical analysis, not as the load-bearing premise whose equations reduce to the performance numbers by construction. No self-definitional steps, fitted-input-called-prediction patterns, or load-bearing self-citation chains appear in the abstract or described material that would equate any claimed prediction to its own inputs. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.0 · 5734 in / 1047 out tokens · 19444 ms · 2026-05-25T05:36:00.727202+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 28 internal anchors

[1]

Springer New York, 2009.isbn: 9780387848587.doi: 10.1007/978-0-387-84858-7.url: http://dx.doi.org/ 10.1007/978-0-387-84858-7

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learning. Springer New York, 2009.isbn: 9780387848587.doi: 10.1007/978-0-387-84858-7.url: http://dx.doi.org/ 10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7.url: 2009
[2]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning.http://www.deeplearningbook. org. MIT Press, 2016

work page 2016
[3]

Bishop and Hugh Bishop.Deep Learning: Foundations and Concepts

Christopher M. Bishop and Hugh Bishop.Deep Learning: Foundations and Concepts. Springer In- ternational Publishing, 2024.isbn: 9783031454684.doi: 10.1007/978-3-031-45468-4 .url: http: //dx.doi.org/10.1007/978-3-031-45468-4

work page doi:10.1007/978-3-031-45468-4 2024
[4]

Language models are few-shot learners

Tom Brown et al. “Language models are few-shot learners”. In:Advances in neural information processing systems33 (2020), pp. 1877–1901

work page 2020
[5]

Hugo Touvron et al.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv:2302.13971 [cs.CL].url:https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Wayne Xin Zhao et al.A Survey of Large Language Models. 2023. eprint:arXiv:2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Shervin Minaee et al.Large Language Models: A Survey. 2024. eprint:arXiv:2402.06196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Anthony Brohan et al.RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. 2023. eprint:arXiv:2307.15818

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Yueen Ma et al.A Survey on Vision-Language-Action Models for Embodied AI. 2024. eprint:arXiv: 2405.14093

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Moo Jin Kim et al.OpenVLA: An Open-Source Vision-Language-Action Model. 2024. eprint:arXiv: 2406.09246

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Continual lifelong learning with neural networks: A review

German I. Parisi et al. “Continual lifelong learning with neural networks: A review”. In:Neural Networks113 (May 2019), pp. 54–71.issn: 0893-6080.doi: 10.1016/j.neunet.2019.01.012 .url: http://dx.doi.org/10.1016/j.neunet.2019.01.012

work page doi:10.1016/j.neunet.2019.01.012 2019
[12]

Liyuan Wang et al.A Comprehensive Survey of Continual Learning: Theory, Method and Application

work page
[13]

arXiv:2302.00487 [cs.LG].url:https://arxiv.org/abs/2302.00487

work page arXiv
[14]

Overcoming catastrophic forgetting in neural networks

James Kirkpatrick et al. “Overcoming catastrophic forgetting in neural networks”. In:Proceedings of the National Academy of Sciences114.13 (Mar. 2017), pp. 3521–3526.issn: 1091-6490.doi:10.1073/pnas. 1611835114.url:http://dx.doi.org/10.1073/pnas.1611835114

work page doi:10.1073/pnas 2017
[15]

doi: 10.1109/tpami.2021.3057446

Matthias Delange et al. “A continual learning survey: Defying forgetting in classification tasks”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence(2021).issn: 1939-3539.doi: 10. 1109/tpami.2021.3057446.url:http://dx.doi.org/10.1109/TPAMI.2021.3057446

work page doi:10.1109/tpami.2021.3057446 2021
[16]

Step-size optimization for continual learning

Thomas Degris et al. “Step-size optimization for continual learning”. In:arXiv preprint arXiv:2401.17401 (2024)

work page arXiv 2024
[17]

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Alexandru Meterez et al. “Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging”. In:arXiv preprint arXiv:2602.03702(2026)

work page arXiv 2026
[18]

The road less scheduled

Aaron Defazio et al. “The road less scheduled”. In:Advances in Neural Information Processing Systems 37 (2024), pp. 9974–10007

work page 2024
[19]

How far away are truly hyperparameter-free learning algorithms?

Priya Kasimbeg et al. “How far away are truly hyperparameter-free learning algorithms?” In:arXiv preprint arXiv:2505.24005(2025)

work page arXiv 2025
[20]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. “SGDR: Stochastic Gradient Descent with Warm Restarts”. In: International Conference on Learning Representations. 2017.url:https://openreview.net/forum? id=Skq89Scxx

work page 2017
[21]

Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv:2203.15556 [cs.CL].url:https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Shengding Hu et al.MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. 2024. arXiv:2404.06395 [cs.CL].url:https://arxiv.org/abs/2404.06395. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

A method for solving the convex programming problem with convergence rate O (1/k2)

Yurii Nesterov. “A method for solving the convex programming problem with convergence rate O (1/k2)”. In:Dokl akad nauk Sssr. Vol. 269. 1983, p. 543

work page 1983
[24]

An optimal method for stochastic composite optimization

Guanghui Lan. “An optimal method for stochastic composite optimization”. In:Mathematical Program- ming133.1 (2012), pp. 365–397

work page 2012
[25]

Alexander Hägele et al.Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

work page
[26]

arXiv:2405.18392 [cs.LG].url:https://arxiv.org/abs/2405.18392

work page arXiv
[27]

Yunshui Li et al.Model Merging in Pre-training of Large Language Models. 2025. arXiv:2505.12082 [cs.CL].url:https://arxiv.org/abs/2505.12082

work page arXiv 2025
[28]

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov et al. “Averaging weights leads to wider optima and better generalization”. In:arXiv preprint arXiv:1803.05407(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Accelerating neural network training: An analysis of the AlgoPerf competition

Priya Kasimbeg et al. “Accelerating neural network training: An analysis of the AlgoPerf competition”. In: The Thirteenth International Conference on Learning Representations. 2025.url:https://openreview. net/forum?id=CtM5xjRSfm

work page 2025
[30]

Scaling laws and compute-optimal training beyond fixed training durations

Alexander Hägele et al. “Scaling laws and compute-optimal training beyond fixed training durations”. In:Advances in Neural Information Processing Systems37 (2024), pp. 76232–76264

work page 2024
[31]

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Minhak Song et al. “Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training”. In:High-dimensional Learning Dynamics 2025. 2025.url: https : / / openreview.net/forum?id=b5HYeRzG9M

work page 2025
[32]

2026.url:https://openreview.net/forum?id=Jw7khYzYzl

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi.Benchmarking Optimizers for Large Language Model Pretraining. 2026.url:https://openreview.net/forum?id=Jw7khYzYzl

work page 2026
[33]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:International Conference on Learning Representations (ICLR). 2015.url:https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[34]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In:International Confer- ence on Learning Representations. 2019.url:https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019
[35]

github.io/posts/muon/

Keller Jordan et al.Muon: An optimizer for hidden layers in neural networks.https://kellerjordan. github.io/posts/muon/. Accessed: 2026-01-25. 2024

work page 2026
[36]

GitHub repository, master branch

Keller Jordan et al.Muon (GitHub repository): An optimizer for hidden layers in neural networks. GitHub repository, master branch. 2024.url:https://github.com/KellerJordan/Muon (visited on 01/25/2026)

work page 2024
[37]

Chongjie Si, Debing Zhang, and Wei Shen.AdaMuon: Adaptive Muon Optimizer. 2025. arXiv:2507. 11005 [cs.LG].url:https://arxiv.org/abs/2507.11005

work page arXiv 2025
[38]

Zichong Li et al.NorMuon: Making Muon more efficient and scalable. 2025. arXiv:2510.05491 [cs.LG]. url:https://arxiv.org/abs/2510.05491

work page arXiv 2025
[39]

Dion: Distributed Orthonormalized Updates

Kwangjun Ahn et al. “Dion: Distributed Orthonormalized Updates”. In:arXiv preprint: 2504.05295 (2025)

work page arXiv 2025
[40]

Kwangjun Ahn, Noah Amsel, and John Langford.Dion2: A Simple Method to Shrink Matrix in Muon

work page
[41]

arXiv:2512.16928 [cs.LG].url:https://arxiv.org/abs/2512.16928

work page arXiv
[42]

Kang An et al.ASGO: Adaptive Structured Gradient Optimization. 2025. arXiv:2503.20762 [cs.LG]. url:https://arxiv.org/abs/2503.20762

work page arXiv 2025
[43]

Liliang Ren et al.Rethinking Language Model Scaling under Transferable Hypersphere Optimization

work page
[44]

arXiv:2603.28743 [cs.LG].url:https://arxiv.org/abs/2603.28743

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Noah Amsel et al.The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm. 2026. arXiv:2505.16932 [cs.LG].url:https://arxiv.org/abs/2505.16932

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Ziyue Liu et al.Muon2: Boosting Muon via Adaptive Second-Moment Preconditioning. 2026. arXiv: 2604.09967 [cs.LG].url:https://arxiv.org/abs/2604.09967

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Ahmed Khaled et al.MuonBP: Faster Muon via Block-Periodic Orthogonalization. 2025. arXiv:2510. 16981 [cs.LG].url:https://arxiv.org/abs/2510.16981. 13

work page arXiv 2025
[48]

Alexey Kravatskiy et al.The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization. 2025. arXiv:2512.09678 [math.OC].url:https://arxiv.org/abs/2512.09678

work page arXiv 2025
[49]

Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv:2502 . 16982 [cs.LG].url: https://arxiv.org/abs/2502.16982

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Kimi Team.Kimi K2.5: Visual Agentic Intelligence. 2026. arXiv:2602.02276 [cs.CL].url: https: //arxiv.org/abs/2602.02276

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

GLM-5-Team.GLM-5: from Vibe Coding to Agentic Engineering. 2026. arXiv:2602.15763 [cs.LG]. url:https://arxiv.org/abs/2602.15763

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Varun Singh et al.Arcee Trinity Large Technical Report. 2026. arXiv:2602.17004 [cs.LG] .url: https://arxiv.org/abs/2602.17004

work page arXiv 2026
[53]

Levent Sagun et al.Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. 2017. eprint:arXiv:1706.04454

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. “An Investigation into Neural Net Optimization via Hessian Eigenvalue Density”. In:Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 2232–2241.url: https : / / proceed...

work page 2019
[55]

When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

Damek Davis and Dmitriy Drusvyatskiy.When do spectral gradient updates help in deep learning?2026. arXiv:2512.04299 [cs.LG].url:https://arxiv.org/abs/2512.04299

work page arXiv 2026
[56]

The surprising agreement between convex optimization theory and learning-rate scheduling for large model training

Fabian Schaipp et al. “The surprising agreement between convex optimization theory and learning-rate scheduling for large model training”. In:arXiv preprint arXiv:2501.18965(2025)

work page arXiv 2025
[57]

A simple weight decay can improve generalization

Anders Krogh and John A. Hertz. “A simple weight decay can improve generalization”. In:Proceedings of the 5th International Conference on Neural Information Processing Systems. NIPS’91. Denver, Colorado: Morgan Kaufmann Publishers Inc., 1991, pp. 950–957.isbn: 1558602224

work page 1991
[58]

arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

Francesco D’Angelo et al.Why Do We Need Weight Decay in Modern Deep Learning?2024. arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

work page arXiv 2024
[59]

Shikai Qiu et al.Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales. 2026. arXiv:2512.05620 [cs.LG].url:https://arxiv.org/abs/2512.05620

work page arXiv 2026
[60]

Da Chang, Yongxiang Liu, and Ganzhao Yuan.On the Convergence of Muon and Beyond. 2026. arXiv: 2509.15816 [cs.LG].url:https://arxiv.org/abs/2509.15816

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Wei Shen et al.On the Convergence Analysis of Muon. 2026. arXiv:2505.23737 [stat.ML] .url: https://arxiv.org/abs/2505.23737

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka.Convergence Bound and Critical Batch Size of Muon Optimizer. 2025. arXiv:2507.01598 [cs.LG].url:https://arxiv.org/abs/2507.01598

work page arXiv 2025
[63]

Jiaxiang Li and Mingyi Hong.A Note on the Convergence of Muon. 2025. arXiv:2502.02900 [math.OC]. url:https://arxiv.org/abs/2502.02900

work page arXiv 2025
[64]

Jeremy Bernstein and Laker Newhouse.Old Optimizer, New Norm: An Anthology. 2024. arXiv:2409. 20325 [cs.LG].url:https://arxiv.org/abs/2409.20325

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Thomas Pethick et al.Training Deep Learning Models with Norm-Constrained LMOs. 2025. arXiv: 2502.07529 [cs.LG].url:https://arxiv.org/abs/2502.07529

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Symbolic discovery of optimization algorithms

Xiangning Chen et al. “Symbolic discovery of optimization algorithms”. In:Advances in neural informa- tion processing systems36 (2023), pp. 49205–49233

work page 2023
[67]

Vineet Gupta, Tomer Koren, and Yoram Singer.Shampoo: Preconditioned Stochastic Tensor Optimiza- tion. 2018. arXiv:1802.09568 [cs.LG].url:https://arxiv.org/abs/1802.09568

work page internal anchor Pith review Pith/arXiv arXiv 2018
[68]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal et al. “Accurate, large minibatch sgd: Training imagenet in 1 hour”. In:arXiv preprint arXiv:1706.02677(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[69]

Why warmup the learning rate? underlying mechanisms and improvements

Dayal Singh Kalra and Maissam Barkeshli. “Why warmup the learning rate? underlying mechanisms and improvements”. In:Advances in Neural Information Processing Systems37 (2024), pp. 111760–111801. 14

work page 2024
[70]

Learning-rate-free learning by d-adaptation

Aaron Defazio and Konstantin Mishchenko. “Learning-rate-free learning by d-adaptation”. In:Interna- tional Conference on Machine Learning. PMLR. 2023, pp. 7449–7479

work page 2023
[71]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer et al. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer”. In:International Conference on Learning Representations. 2017

work page 2017
[72]

GShard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin et al. “GShard: Scaling giant models with conditional computation and automatic sharding”. In:International Conference on Learning Representations. 2021

work page 2021
[73]

Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. “Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity”. In:Journal of Machine Learning Research23.120 (2022), pp. 1–39

work page 2022
[74]

Twan van Laarhoven.L2 Regularization versus Batch and Weight Normalization. 2017. arXiv:1706. 05350 [cs.LG].url:https://arxiv.org/abs/1706.05350

work page internal anchor Pith review Pith/arXiv arXiv 2017
[75]

Aaron Defazio.Why Gradients Rapidly Increase Near the End of Training. 2025. arXiv:2506.02285 [cs.LG].url:https://arxiv.org/abs/2506.02285

work page arXiv 2025
[76]

Hugo Touvron et al.Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023. arXiv:2307.09288 [cs.CL].url:https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

Jianlin Su et al.RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. arXiv: 2104.09864 [cs.CL].url:https://arxiv.org/abs/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Biao Zhang and Rico Sennrich.Root Mean Square Layer Normalization. 2019. arXiv:1910.07467 [cs.LG].url:https://arxiv.org/abs/1910.07467

work page internal anchor Pith review Pith/arXiv arXiv 2019
[79]

Zhengyan Zhang et al.ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs. 2024. arXiv:2402.03804 [cs.LG].url:https://arxiv.org/abs/2402.03804

work page arXiv 2024
[80]

Andrej Karpathy.NanoGPT.https://github.com/karpathy/nanoGPT. 2022

work page 2022

Showing first 80 references.

[1] [1]

Springer New York, 2009.isbn: 9780387848587.doi: 10.1007/978-0-387-84858-7.url: http://dx.doi.org/ 10.1007/978-0-387-84858-7

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learning. Springer New York, 2009.isbn: 9780387848587.doi: 10.1007/978-0-387-84858-7.url: http://dx.doi.org/ 10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7.url: 2009

[2] [2]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning.http://www.deeplearningbook. org. MIT Press, 2016

work page 2016

[3] [3]

Bishop and Hugh Bishop.Deep Learning: Foundations and Concepts

Christopher M. Bishop and Hugh Bishop.Deep Learning: Foundations and Concepts. Springer In- ternational Publishing, 2024.isbn: 9783031454684.doi: 10.1007/978-3-031-45468-4 .url: http: //dx.doi.org/10.1007/978-3-031-45468-4

work page doi:10.1007/978-3-031-45468-4 2024

[4] [4]

Language models are few-shot learners

Tom Brown et al. “Language models are few-shot learners”. In:Advances in neural information processing systems33 (2020), pp. 1877–1901

work page 2020

[5] [5]

Hugo Touvron et al.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv:2302.13971 [cs.CL].url:https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Wayne Xin Zhao et al.A Survey of Large Language Models. 2023. eprint:arXiv:2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Shervin Minaee et al.Large Language Models: A Survey. 2024. eprint:arXiv:2402.06196

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Anthony Brohan et al.RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. 2023. eprint:arXiv:2307.15818

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Yueen Ma et al.A Survey on Vision-Language-Action Models for Embodied AI. 2024. eprint:arXiv: 2405.14093

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Moo Jin Kim et al.OpenVLA: An Open-Source Vision-Language-Action Model. 2024. eprint:arXiv: 2406.09246

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Continual lifelong learning with neural networks: A review

German I. Parisi et al. “Continual lifelong learning with neural networks: A review”. In:Neural Networks113 (May 2019), pp. 54–71.issn: 0893-6080.doi: 10.1016/j.neunet.2019.01.012 .url: http://dx.doi.org/10.1016/j.neunet.2019.01.012

work page doi:10.1016/j.neunet.2019.01.012 2019

[12] [12]

Liyuan Wang et al.A Comprehensive Survey of Continual Learning: Theory, Method and Application

work page

[13] [13]

arXiv:2302.00487 [cs.LG].url:https://arxiv.org/abs/2302.00487

work page arXiv

[14] [14]

Overcoming catastrophic forgetting in neural networks

James Kirkpatrick et al. “Overcoming catastrophic forgetting in neural networks”. In:Proceedings of the National Academy of Sciences114.13 (Mar. 2017), pp. 3521–3526.issn: 1091-6490.doi:10.1073/pnas. 1611835114.url:http://dx.doi.org/10.1073/pnas.1611835114

work page doi:10.1073/pnas 2017

[15] [15]

doi: 10.1109/tpami.2021.3057446

Matthias Delange et al. “A continual learning survey: Defying forgetting in classification tasks”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence(2021).issn: 1939-3539.doi: 10. 1109/tpami.2021.3057446.url:http://dx.doi.org/10.1109/TPAMI.2021.3057446

work page doi:10.1109/tpami.2021.3057446 2021

[16] [16]

Step-size optimization for continual learning

Thomas Degris et al. “Step-size optimization for continual learning”. In:arXiv preprint arXiv:2401.17401 (2024)

work page arXiv 2024

[17] [17]

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Alexandru Meterez et al. “Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging”. In:arXiv preprint arXiv:2602.03702(2026)

work page arXiv 2026

[18] [18]

The road less scheduled

Aaron Defazio et al. “The road less scheduled”. In:Advances in Neural Information Processing Systems 37 (2024), pp. 9974–10007

work page 2024

[19] [19]

How far away are truly hyperparameter-free learning algorithms?

Priya Kasimbeg et al. “How far away are truly hyperparameter-free learning algorithms?” In:arXiv preprint arXiv:2505.24005(2025)

work page arXiv 2025

[20] [20]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. “SGDR: Stochastic Gradient Descent with Warm Restarts”. In: International Conference on Learning Representations. 2017.url:https://openreview.net/forum? id=Skq89Scxx

work page 2017

[21] [21]

Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv:2203.15556 [cs.CL].url:https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Shengding Hu et al.MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. 2024. arXiv:2404.06395 [cs.CL].url:https://arxiv.org/abs/2404.06395. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

A method for solving the convex programming problem with convergence rate O (1/k2)

Yurii Nesterov. “A method for solving the convex programming problem with convergence rate O (1/k2)”. In:Dokl akad nauk Sssr. Vol. 269. 1983, p. 543

work page 1983

[24] [24]

An optimal method for stochastic composite optimization

Guanghui Lan. “An optimal method for stochastic composite optimization”. In:Mathematical Program- ming133.1 (2012), pp. 365–397

work page 2012

[25] [25]

Alexander Hägele et al.Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

work page

[26] [26]

arXiv:2405.18392 [cs.LG].url:https://arxiv.org/abs/2405.18392

work page arXiv

[27] [27]

Yunshui Li et al.Model Merging in Pre-training of Large Language Models. 2025. arXiv:2505.12082 [cs.CL].url:https://arxiv.org/abs/2505.12082

work page arXiv 2025

[28] [28]

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov et al. “Averaging weights leads to wider optima and better generalization”. In:arXiv preprint arXiv:1803.05407(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Accelerating neural network training: An analysis of the AlgoPerf competition

Priya Kasimbeg et al. “Accelerating neural network training: An analysis of the AlgoPerf competition”. In: The Thirteenth International Conference on Learning Representations. 2025.url:https://openreview. net/forum?id=CtM5xjRSfm

work page 2025

[30] [30]

Scaling laws and compute-optimal training beyond fixed training durations

Alexander Hägele et al. “Scaling laws and compute-optimal training beyond fixed training durations”. In:Advances in Neural Information Processing Systems37 (2024), pp. 76232–76264

work page 2024

[31] [31]

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Minhak Song et al. “Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training”. In:High-dimensional Learning Dynamics 2025. 2025.url: https : / / openreview.net/forum?id=b5HYeRzG9M

work page 2025

[32] [32]

2026.url:https://openreview.net/forum?id=Jw7khYzYzl

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi.Benchmarking Optimizers for Large Language Model Pretraining. 2026.url:https://openreview.net/forum?id=Jw7khYzYzl

work page 2026

[33] [33]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:International Conference on Learning Representations (ICLR). 2015.url:https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015

[34] [34]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In:International Confer- ence on Learning Representations. 2019.url:https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019

[35] [35]

github.io/posts/muon/

Keller Jordan et al.Muon: An optimizer for hidden layers in neural networks.https://kellerjordan. github.io/posts/muon/. Accessed: 2026-01-25. 2024

work page 2026

[36] [36]

GitHub repository, master branch

Keller Jordan et al.Muon (GitHub repository): An optimizer for hidden layers in neural networks. GitHub repository, master branch. 2024.url:https://github.com/KellerJordan/Muon (visited on 01/25/2026)

work page 2024

[37] [37]

Chongjie Si, Debing Zhang, and Wei Shen.AdaMuon: Adaptive Muon Optimizer. 2025. arXiv:2507. 11005 [cs.LG].url:https://arxiv.org/abs/2507.11005

work page arXiv 2025

[38] [38]

Zichong Li et al.NorMuon: Making Muon more efficient and scalable. 2025. arXiv:2510.05491 [cs.LG]. url:https://arxiv.org/abs/2510.05491

work page arXiv 2025

[39] [39]

Dion: Distributed Orthonormalized Updates

Kwangjun Ahn et al. “Dion: Distributed Orthonormalized Updates”. In:arXiv preprint: 2504.05295 (2025)

work page arXiv 2025

[40] [40]

Kwangjun Ahn, Noah Amsel, and John Langford.Dion2: A Simple Method to Shrink Matrix in Muon

work page

[41] [41]

arXiv:2512.16928 [cs.LG].url:https://arxiv.org/abs/2512.16928

work page arXiv

[42] [42]

Kang An et al.ASGO: Adaptive Structured Gradient Optimization. 2025. arXiv:2503.20762 [cs.LG]. url:https://arxiv.org/abs/2503.20762

work page arXiv 2025

[43] [43]

Liliang Ren et al.Rethinking Language Model Scaling under Transferable Hypersphere Optimization

work page

[44] [44]

arXiv:2603.28743 [cs.LG].url:https://arxiv.org/abs/2603.28743

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Noah Amsel et al.The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm. 2026. arXiv:2505.16932 [cs.LG].url:https://arxiv.org/abs/2505.16932

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Ziyue Liu et al.Muon2: Boosting Muon via Adaptive Second-Moment Preconditioning. 2026. arXiv: 2604.09967 [cs.LG].url:https://arxiv.org/abs/2604.09967

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Ahmed Khaled et al.MuonBP: Faster Muon via Block-Periodic Orthogonalization. 2025. arXiv:2510. 16981 [cs.LG].url:https://arxiv.org/abs/2510.16981. 13

work page arXiv 2025

[48] [48]

Alexey Kravatskiy et al.The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization. 2025. arXiv:2512.09678 [math.OC].url:https://arxiv.org/abs/2512.09678

work page arXiv 2025

[49] [49]

Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv:2502 . 16982 [cs.LG].url: https://arxiv.org/abs/2502.16982

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Kimi Team.Kimi K2.5: Visual Agentic Intelligence. 2026. arXiv:2602.02276 [cs.CL].url: https: //arxiv.org/abs/2602.02276

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

GLM-5-Team.GLM-5: from Vibe Coding to Agentic Engineering. 2026. arXiv:2602.15763 [cs.LG]. url:https://arxiv.org/abs/2602.15763

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

Varun Singh et al.Arcee Trinity Large Technical Report. 2026. arXiv:2602.17004 [cs.LG] .url: https://arxiv.org/abs/2602.17004

work page arXiv 2026

[53] [53]

Levent Sagun et al.Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. 2017. eprint:arXiv:1706.04454

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [54]

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. “An Investigation into Neural Net Optimization via Hessian Eigenvalue Density”. In:Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 2232–2241.url: https : / / proceed...

work page 2019

[55] [55]

When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

Damek Davis and Dmitriy Drusvyatskiy.When do spectral gradient updates help in deep learning?2026. arXiv:2512.04299 [cs.LG].url:https://arxiv.org/abs/2512.04299

work page arXiv 2026

[56] [56]

The surprising agreement between convex optimization theory and learning-rate scheduling for large model training

Fabian Schaipp et al. “The surprising agreement between convex optimization theory and learning-rate scheduling for large model training”. In:arXiv preprint arXiv:2501.18965(2025)

work page arXiv 2025

[57] [57]

A simple weight decay can improve generalization

Anders Krogh and John A. Hertz. “A simple weight decay can improve generalization”. In:Proceedings of the 5th International Conference on Neural Information Processing Systems. NIPS’91. Denver, Colorado: Morgan Kaufmann Publishers Inc., 1991, pp. 950–957.isbn: 1558602224

work page 1991

[58] [58]

arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

Francesco D’Angelo et al.Why Do We Need Weight Decay in Modern Deep Learning?2024. arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

work page arXiv 2024

[59] [59]

Shikai Qiu et al.Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales. 2026. arXiv:2512.05620 [cs.LG].url:https://arxiv.org/abs/2512.05620

work page arXiv 2026

[60] [60]

Da Chang, Yongxiang Liu, and Ganzhao Yuan.On the Convergence of Muon and Beyond. 2026. arXiv: 2509.15816 [cs.LG].url:https://arxiv.org/abs/2509.15816

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Wei Shen et al.On the Convergence Analysis of Muon. 2026. arXiv:2505.23737 [stat.ML] .url: https://arxiv.org/abs/2505.23737

work page internal anchor Pith review Pith/arXiv arXiv 2026

[62] [62]

Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka.Convergence Bound and Critical Batch Size of Muon Optimizer. 2025. arXiv:2507.01598 [cs.LG].url:https://arxiv.org/abs/2507.01598

work page arXiv 2025

[63] [63]

Jiaxiang Li and Mingyi Hong.A Note on the Convergence of Muon. 2025. arXiv:2502.02900 [math.OC]. url:https://arxiv.org/abs/2502.02900

work page arXiv 2025

[64] [64]

Jeremy Bernstein and Laker Newhouse.Old Optimizer, New Norm: An Anthology. 2024. arXiv:2409. 20325 [cs.LG].url:https://arxiv.org/abs/2409.20325

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Thomas Pethick et al.Training Deep Learning Models with Norm-Constrained LMOs. 2025. arXiv: 2502.07529 [cs.LG].url:https://arxiv.org/abs/2502.07529

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Symbolic discovery of optimization algorithms

Xiangning Chen et al. “Symbolic discovery of optimization algorithms”. In:Advances in neural informa- tion processing systems36 (2023), pp. 49205–49233

work page 2023

[67] [67]

Vineet Gupta, Tomer Koren, and Yoram Singer.Shampoo: Preconditioned Stochastic Tensor Optimiza- tion. 2018. arXiv:1802.09568 [cs.LG].url:https://arxiv.org/abs/1802.09568

work page internal anchor Pith review Pith/arXiv arXiv 2018

[68] [68]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal et al. “Accurate, large minibatch sgd: Training imagenet in 1 hour”. In:arXiv preprint arXiv:1706.02677(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[69] [69]

Why warmup the learning rate? underlying mechanisms and improvements

Dayal Singh Kalra and Maissam Barkeshli. “Why warmup the learning rate? underlying mechanisms and improvements”. In:Advances in Neural Information Processing Systems37 (2024), pp. 111760–111801. 14

work page 2024

[70] [70]

Learning-rate-free learning by d-adaptation

Aaron Defazio and Konstantin Mishchenko. “Learning-rate-free learning by d-adaptation”. In:Interna- tional Conference on Machine Learning. PMLR. 2023, pp. 7449–7479

work page 2023

[71] [71]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer et al. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer”. In:International Conference on Learning Representations. 2017

work page 2017

[72] [72]

GShard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin et al. “GShard: Scaling giant models with conditional computation and automatic sharding”. In:International Conference on Learning Representations. 2021

work page 2021

[73] [73]

Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. “Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity”. In:Journal of Machine Learning Research23.120 (2022), pp. 1–39

work page 2022

[74] [74]

Twan van Laarhoven.L2 Regularization versus Batch and Weight Normalization. 2017. arXiv:1706. 05350 [cs.LG].url:https://arxiv.org/abs/1706.05350

work page internal anchor Pith review Pith/arXiv arXiv 2017

[75] [75]

Aaron Defazio.Why Gradients Rapidly Increase Near the End of Training. 2025. arXiv:2506.02285 [cs.LG].url:https://arxiv.org/abs/2506.02285

work page arXiv 2025

[76] [76]

Hugo Touvron et al.Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023. arXiv:2307.09288 [cs.CL].url:https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[77] [77]

Jianlin Su et al.RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. arXiv: 2104.09864 [cs.CL].url:https://arxiv.org/abs/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [78]

Biao Zhang and Rico Sennrich.Root Mean Square Layer Normalization. 2019. arXiv:1910.07467 [cs.LG].url:https://arxiv.org/abs/1910.07467

work page internal anchor Pith review Pith/arXiv arXiv 2019

[79] [79]

Zhengyan Zhang et al.ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs. 2024. arXiv:2402.03804 [cs.LG].url:https://arxiv.org/abs/2402.03804

work page arXiv 2024

[80] [80]

Andrej Karpathy.NanoGPT.https://github.com/karpathy/nanoGPT. 2022

work page 2022