pith. sign in

arxiv: 2605.23061 · v1 · pith:CH6ZJZ5Unew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Anytime Training with Schedule-Free Spectral Optimization

Pith reviewed 2026-05-25 05:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML
keywords schedule-free optimizationspectral optimizerlanguage model traininganytime trainingweight decayAdamWChinchilla horizonsneural network optimizers
0
0 comments X

The pith

Schedule-free spectral optimization matches tuned AdamW on 125M and 772M parameter models using one hyperparameter configuration across 1-8x Chinchilla horizons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard neural network training ties learning-rate schedules to a fixed horizon, creating path dependence and forcing costly re-tuning when data volume changes. Schedule-free methods remove explicit schedules to address this, but prior versions like SF-AdamW still fall short of well-tuned AdamW. The paper introduces SF-NorMuon, a schedule-free spectral optimizer that closes the gap by matching or exceeding tuned AdamW performance on 125M and 772M parameter language models with a single hyperparameter setup. This holds across training lengths from 1x to 8x Chinchilla horizons. The approach lets practitioners obtain high-quality checkpoints at any point without committing to a total horizon in advance.

Core claim

SF-NorMuon is a schedule-free spectral optimizer that, with a single hyperparameter configuration, matches or exceeds tuned AdamW on 125M and 772M parameter language models across 1 to 8 times Chinchilla horizons. The work proves a stationarity guarantee for schedule-free spectral dynamics and shows that weight decay applied at the fast iterate is essential for long-horizon stability. This removes the need to commit to a training horizon upfront while preserving competitive performance.

What carries the argument

SF-NorMuon, a schedule-free spectral optimizer that applies weight decay at the fast iterate within schedule-free spectral dynamics.

Load-bearing premise

Weight decay applied at the fast iterate is essential for maintaining long-horizon stability in schedule-free spectral dynamics.

What would settle it

Training runs of SF-NorMuon on 772M parameter models over 8x Chinchilla horizons that exhibit instability or performance drops when weight decay is removed from the fast iterate.

Figures

Figures reproduced from arXiv: 2605.23061 by Anuj Apte, Junhyung Lyle Kim, Niraj Kumar, Pranav Deshpande, Shouvanik Chakrabarti.

Figure 1
Figure 1. Figure 1: Comparison of SF-NorMuon (this work), SF-AdamW, and tuned AdamW baselines for LLaMA-2-style transformers trained on FineWeb-100B. Left: 125M model with AdamW learning rate tuned per horizon with cosine schedule. Right: 772M model with a single optimized AdamW configuration across horizons. Dashed lines highlight that learning rate schedule for a long horizon is sub-optimal for a smaller token budget. The g… view at source ↗
Figure 2
Figure 2. Figure 2: Weight decay strategies for schedule-free optimizers. Left: SF-AdamW with no decay (orange) diverges; decay at Y (blue) exhibits the best performance up to ∼30B tokens (c.f., [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Importance of explicit momentum and row-wise normalization for SF-NorMuon. Left: Ablating explicit momentum (µ = 0) significantly degrades performance, with final loss increasing from 3.14 to 3.28. This validates the importance of smoothing the gradient before computing the polar factor, as discussed in Section 2. Right: Ablating row-wise normalization leads to a smaller gap, with SF-NorMuon reaching the s… view at source ↗
Figure 4
Figure 4. Figure 4: Quasi steady-state analysis for SF-NorMuon with decay at Z (averaged over layers). Left: The ratio |ρt+1 − ρt|/ρt remains below 1% after warmup, validating the quasi steady-state hypothesis (Hypothesis 3.1). Right: Alignment αt (purple) and RMS(Z) (red) over training. The theoretical prediction λρt = 0.1[−αt + p α 2 t + 2ηλ] from Lemma 3.2 (blue dashed) closely tracks the observed values. This training run… view at source ↗
Figure 5
Figure 5. Figure 5: Learning rate sweep comparison be￾tween SF-AdamW and SF-NorMuon. Improvement across learning rates. A key practical advantage of SF-NorMuon is robustness to the choice of learning rate [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Learning rate sweep for AdamW with cosine scheduling on the 125M model across four training [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Optimizer comparison on the 125M model. Scheduled methods (NorMuon, AdamW) are shown with [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗
read the original abstract

Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes SF-NorMuon, a schedule-free spectral optimizer, claiming that a single hyperparameter configuration matches or exceeds per-horizon-tuned AdamW on 125M and 772M parameter language models across 1--8× Chinchilla horizons. It further states a stationarity guarantee for schedule-free spectral dynamics and identifies weight decay applied at the fast iterate as essential for long-horizon stability, enabling horizon-independent high-quality checkpoints.

Significance. If the empirical matching holds with the reported single-configuration robustness and the stationarity result is non-vacuous, the work would meaningfully advance practical anytime optimization by reducing horizon-dependent retuning costs and supporting continual learning. The theoretical identification of the weight-decay placement provides a concrete design principle that could guide further schedule-free methods.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim (SF-NorMuon matches or exceeds tuned AdamW with one fixed hyperparameter set across two model sizes and multiple horizons) is stated without reference to any table, figure, ablation, or error analysis, preventing assessment of effect size, variance, or whether the result is load-bearing for the 'closes this gap' conclusion.
  2. [Abstract] Abstract (theoretical analysis section): the stationarity guarantee and the claim that weight decay at the fast iterate is 'essential for long-horizon stability' are asserted without any displayed equations, assumptions, or derivation outline, so it is impossible to verify whether the condition is derived or imposed by construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on the abstract. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim (SF-NorMuon matches or exceeds tuned AdamW with one fixed hyperparameter set across two model sizes and multiple horizons) is stated without reference to any table, figure, ablation, or error analysis, preventing assessment of effect size, variance, or whether the result is load-bearing for the 'closes this gap' conclusion.

    Authors: We agree the abstract would be clearer with explicit pointers to the supporting evidence. The performance comparisons (including means and standard deviations over three random seeds) appear in Section 4, Tables 1–2 and Figures 2–4; hyperparameter robustness and weight-decay ablations are in Section 5. In the revision we will add concise parenthetical citations to these results within the abstract. revision: yes

  2. Referee: [Abstract] Abstract (theoretical analysis section): the stationarity guarantee and the claim that weight decay at the fast iterate is 'essential for long-horizon stability' are asserted without any displayed equations, assumptions, or derivation outline, so it is impossible to verify whether the condition is derived or imposed by construction.

    Authors: The stationarity guarantee is stated and proved as Theorem 3.1 (with the key assumption of weight decay applied to the fast iterate). The necessity of this placement is shown by a counter-example in Appendix B when weight decay is instead applied to the slow iterate. The abstract summarizes the result at high level; we will add a reference to Theorem 3.1 in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's primary result is the empirical observation that a single fixed hyperparameter configuration of SF-NorMuon matches or exceeds per-horizon-tuned AdamW on 125M and 772M models across 1-8× Chinchilla horizons. The mentioned stationarity proof and weight-decay-at-fast-iterate condition are presented as supporting theoretical analysis, not as the load-bearing premise whose equations reduce to the performance numbers by construction. No self-definitional steps, fitted-input-called-prediction patterns, or load-bearing self-citation chains appear in the abstract or described material that would equate any claimed prediction to its own inputs. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.0 · 5734 in / 1047 out tokens · 19444 ms · 2026-05-25T05:36:00.727202+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 28 internal anchors

  1. [1]

    Springer New York, 2009.isbn: 9780387848587.doi: 10.1007/978-0-387-84858-7.url: http://dx.doi.org/ 10.1007/978-0-387-84858-7

    Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The Elements of Statistical Learning. Springer New York, 2009.isbn: 9780387848587.doi: 10.1007/978-0-387-84858-7.url: http://dx.doi.org/ 10.1007/978-0-387-84858-7

  2. [2]

    Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning.http://www.deeplearningbook. org. MIT Press, 2016

  3. [3]

    Bishop and Hugh Bishop.Deep Learning: Foundations and Concepts

    Christopher M. Bishop and Hugh Bishop.Deep Learning: Foundations and Concepts. Springer In- ternational Publishing, 2024.isbn: 9783031454684.doi: 10.1007/978-3-031-45468-4 .url: http: //dx.doi.org/10.1007/978-3-031-45468-4

  4. [4]

    Language models are few-shot learners

    Tom Brown et al. “Language models are few-shot learners”. In:Advances in neural information processing systems33 (2020), pp. 1877–1901

  5. [5]

    Hugo Touvron et al.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv:2302.13971 [cs.CL].url:https://arxiv.org/abs/2302.13971

  6. [6]

    Wayne Xin Zhao et al.A Survey of Large Language Models. 2023. eprint:arXiv:2303.18223

  7. [7]

    Shervin Minaee et al.Large Language Models: A Survey. 2024. eprint:arXiv:2402.06196

  8. [8]

    Anthony Brohan et al.RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. 2023. eprint:arXiv:2307.15818

  9. [9]

    Yueen Ma et al.A Survey on Vision-Language-Action Models for Embodied AI. 2024. eprint:arXiv: 2405.14093

  10. [10]

    Moo Jin Kim et al.OpenVLA: An Open-Source Vision-Language-Action Model. 2024. eprint:arXiv: 2406.09246

  11. [11]

    Continual lifelong learning with neural networks: A review

    German I. Parisi et al. “Continual lifelong learning with neural networks: A review”. In:Neural Networks113 (May 2019), pp. 54–71.issn: 0893-6080.doi: 10.1016/j.neunet.2019.01.012 .url: http://dx.doi.org/10.1016/j.neunet.2019.01.012

  12. [12]

    Liyuan Wang et al.A Comprehensive Survey of Continual Learning: Theory, Method and Application

  13. [13]

    arXiv:2302.00487 [cs.LG].url:https://arxiv.org/abs/2302.00487

  14. [14]

    Overcoming catastrophic forgetting in neural networks

    James Kirkpatrick et al. “Overcoming catastrophic forgetting in neural networks”. In:Proceedings of the National Academy of Sciences114.13 (Mar. 2017), pp. 3521–3526.issn: 1091-6490.doi:10.1073/pnas. 1611835114.url:http://dx.doi.org/10.1073/pnas.1611835114

  15. [15]

    doi: 10.1109/tpami.2021.3057446

    Matthias Delange et al. “A continual learning survey: Defying forgetting in classification tasks”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence(2021).issn: 1939-3539.doi: 10. 1109/tpami.2021.3057446.url:http://dx.doi.org/10.1109/TPAMI.2021.3057446

  16. [16]

    Step-size optimization for continual learning

    Thomas Degris et al. “Step-size optimization for continual learning”. In:arXiv preprint arXiv:2401.17401 (2024)

  17. [17]

    Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

    Alexandru Meterez et al. “Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging”. In:arXiv preprint arXiv:2602.03702(2026)

  18. [18]

    The road less scheduled

    Aaron Defazio et al. “The road less scheduled”. In:Advances in Neural Information Processing Systems 37 (2024), pp. 9974–10007

  19. [19]

    How far away are truly hyperparameter-free learning algorithms?

    Priya Kasimbeg et al. “How far away are truly hyperparameter-free learning algorithms?” In:arXiv preprint arXiv:2505.24005(2025)

  20. [20]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. “SGDR: Stochastic Gradient Descent with Warm Restarts”. In: International Conference on Learning Representations. 2017.url:https://openreview.net/forum? id=Skq89Scxx

  21. [21]

    Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv:2203.15556 [cs.CL].url:https://arxiv.org/abs/2203.15556

  22. [22]

    Shengding Hu et al.MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. 2024. arXiv:2404.06395 [cs.CL].url:https://arxiv.org/abs/2404.06395. 12

  23. [23]

    A method for solving the convex programming problem with convergence rate O (1/k2)

    Yurii Nesterov. “A method for solving the convex programming problem with convergence rate O (1/k2)”. In:Dokl akad nauk Sssr. Vol. 269. 1983, p. 543

  24. [24]

    An optimal method for stochastic composite optimization

    Guanghui Lan. “An optimal method for stochastic composite optimization”. In:Mathematical Program- ming133.1 (2012), pp. 365–397

  25. [25]

    Alexander Hägele et al.Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

  26. [26]

    arXiv:2405.18392 [cs.LG].url:https://arxiv.org/abs/2405.18392

  27. [27]

    Yunshui Li et al.Model Merging in Pre-training of Large Language Models. 2025. arXiv:2505.12082 [cs.CL].url:https://arxiv.org/abs/2505.12082

  28. [28]

    Averaging Weights Leads to Wider Optima and Better Generalization

    Pavel Izmailov et al. “Averaging weights leads to wider optima and better generalization”. In:arXiv preprint arXiv:1803.05407(2018)

  29. [29]

    Accelerating neural network training: An analysis of the AlgoPerf competition

    Priya Kasimbeg et al. “Accelerating neural network training: An analysis of the AlgoPerf competition”. In: The Thirteenth International Conference on Learning Representations. 2025.url:https://openreview. net/forum?id=CtM5xjRSfm

  30. [30]

    Scaling laws and compute-optimal training beyond fixed training durations

    Alexander Hägele et al. “Scaling laws and compute-optimal training beyond fixed training durations”. In:Advances in Neural Information Processing Systems37 (2024), pp. 76232–76264

  31. [31]

    Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

    Minhak Song et al. “Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training”. In:High-dimensional Learning Dynamics 2025. 2025.url: https : / / openreview.net/forum?id=b5HYeRzG9M

  32. [32]

    2026.url:https://openreview.net/forum?id=Jw7khYzYzl

    Andrei Semenov, Matteo Pagliardini, and Martin Jaggi.Benchmarking Optimizers for Large Language Model Pretraining. 2026.url:https://openreview.net/forum?id=Jw7khYzYzl

  33. [33]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:International Conference on Learning Representations (ICLR). 2015.url:https://arxiv.org/abs/1412.6980

  34. [34]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In:International Confer- ence on Learning Representations. 2019.url:https://openreview.net/forum?id=Bkg6RiCqY7

  35. [35]

    github.io/posts/muon/

    Keller Jordan et al.Muon: An optimizer for hidden layers in neural networks.https://kellerjordan. github.io/posts/muon/. Accessed: 2026-01-25. 2024

  36. [36]

    GitHub repository, master branch

    Keller Jordan et al.Muon (GitHub repository): An optimizer for hidden layers in neural networks. GitHub repository, master branch. 2024.url:https://github.com/KellerJordan/Muon (visited on 01/25/2026)

  37. [37]

    Chongjie Si, Debing Zhang, and Wei Shen.AdaMuon: Adaptive Muon Optimizer. 2025. arXiv:2507. 11005 [cs.LG].url:https://arxiv.org/abs/2507.11005

  38. [38]

    Zichong Li et al.NorMuon: Making Muon more efficient and scalable. 2025. arXiv:2510.05491 [cs.LG]. url:https://arxiv.org/abs/2510.05491

  39. [39]

    Dion: Distributed Orthonormalized Updates

    Kwangjun Ahn et al. “Dion: Distributed Orthonormalized Updates”. In:arXiv preprint: 2504.05295 (2025)

  40. [40]

    Kwangjun Ahn, Noah Amsel, and John Langford.Dion2: A Simple Method to Shrink Matrix in Muon

  41. [41]

    arXiv:2512.16928 [cs.LG].url:https://arxiv.org/abs/2512.16928

  42. [42]

    Kang An et al.ASGO: Adaptive Structured Gradient Optimization. 2025. arXiv:2503.20762 [cs.LG]. url:https://arxiv.org/abs/2503.20762

  43. [43]

    Liliang Ren et al.Rethinking Language Model Scaling under Transferable Hypersphere Optimization

  44. [44]

    arXiv:2603.28743 [cs.LG].url:https://arxiv.org/abs/2603.28743

  45. [45]

    Noah Amsel et al.The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm. 2026. arXiv:2505.16932 [cs.LG].url:https://arxiv.org/abs/2505.16932

  46. [46]

    Ziyue Liu et al.Muon2: Boosting Muon via Adaptive Second-Moment Preconditioning. 2026. arXiv: 2604.09967 [cs.LG].url:https://arxiv.org/abs/2604.09967

  47. [47]

    Ahmed Khaled et al.MuonBP: Faster Muon via Block-Periodic Orthogonalization. 2025. arXiv:2510. 16981 [cs.LG].url:https://arxiv.org/abs/2510.16981. 13

  48. [48]

    Alexey Kravatskiy et al.The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization. 2025. arXiv:2512.09678 [math.OC].url:https://arxiv.org/abs/2512.09678

  49. [49]

    Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv:2502 . 16982 [cs.LG].url: https://arxiv.org/abs/2502.16982

  50. [50]

    Kimi Team.Kimi K2.5: Visual Agentic Intelligence. 2026. arXiv:2602.02276 [cs.CL].url: https: //arxiv.org/abs/2602.02276

  51. [51]

    GLM-5-Team.GLM-5: from Vibe Coding to Agentic Engineering. 2026. arXiv:2602.15763 [cs.LG]. url:https://arxiv.org/abs/2602.15763

  52. [52]

    Varun Singh et al.Arcee Trinity Large Technical Report. 2026. arXiv:2602.17004 [cs.LG] .url: https://arxiv.org/abs/2602.17004

  53. [53]

    Levent Sagun et al.Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. 2017. eprint:arXiv:1706.04454

  54. [54]

    An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

    Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. “An Investigation into Neural Net Optimization via Hessian Eigenvalue Density”. In:Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 2232–2241.url: https : / / proceed...

  55. [55]

    When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299,

    Damek Davis and Dmitriy Drusvyatskiy.When do spectral gradient updates help in deep learning?2026. arXiv:2512.04299 [cs.LG].url:https://arxiv.org/abs/2512.04299

  56. [56]

    The surprising agreement between convex optimization theory and learning-rate scheduling for large model training

    Fabian Schaipp et al. “The surprising agreement between convex optimization theory and learning-rate scheduling for large model training”. In:arXiv preprint arXiv:2501.18965(2025)

  57. [57]

    A simple weight decay can improve generalization

    Anders Krogh and John A. Hertz. “A simple weight decay can improve generalization”. In:Proceedings of the 5th International Conference on Neural Information Processing Systems. NIPS’91. Denver, Colorado: Morgan Kaufmann Publishers Inc., 1991, pp. 950–957.isbn: 1558602224

  58. [58]

    arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

    Francesco D’Angelo et al.Why Do We Need Weight Decay in Modern Deep Learning?2024. arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

  59. [59]

    Shikai Qiu et al.Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales. 2026. arXiv:2512.05620 [cs.LG].url:https://arxiv.org/abs/2512.05620

  60. [60]

    Da Chang, Yongxiang Liu, and Ganzhao Yuan.On the Convergence of Muon and Beyond. 2026. arXiv: 2509.15816 [cs.LG].url:https://arxiv.org/abs/2509.15816

  61. [61]

    Wei Shen et al.On the Convergence Analysis of Muon. 2026. arXiv:2505.23737 [stat.ML] .url: https://arxiv.org/abs/2505.23737

  62. [62]

    Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka.Convergence Bound and Critical Batch Size of Muon Optimizer. 2025. arXiv:2507.01598 [cs.LG].url:https://arxiv.org/abs/2507.01598

  63. [63]

    Jiaxiang Li and Mingyi Hong.A Note on the Convergence of Muon. 2025. arXiv:2502.02900 [math.OC]. url:https://arxiv.org/abs/2502.02900

  64. [64]

    Jeremy Bernstein and Laker Newhouse.Old Optimizer, New Norm: An Anthology. 2024. arXiv:2409. 20325 [cs.LG].url:https://arxiv.org/abs/2409.20325

  65. [65]

    Thomas Pethick et al.Training Deep Learning Models with Norm-Constrained LMOs. 2025. arXiv: 2502.07529 [cs.LG].url:https://arxiv.org/abs/2502.07529

  66. [66]

    Symbolic discovery of optimization algorithms

    Xiangning Chen et al. “Symbolic discovery of optimization algorithms”. In:Advances in neural informa- tion processing systems36 (2023), pp. 49205–49233

  67. [67]

    Vineet Gupta, Tomer Koren, and Yoram Singer.Shampoo: Preconditioned Stochastic Tensor Optimiza- tion. 2018. arXiv:1802.09568 [cs.LG].url:https://arxiv.org/abs/1802.09568

  68. [68]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal et al. “Accurate, large minibatch sgd: Training imagenet in 1 hour”. In:arXiv preprint arXiv:1706.02677(2017)

  69. [69]

    Why warmup the learning rate? underlying mechanisms and improvements

    Dayal Singh Kalra and Maissam Barkeshli. “Why warmup the learning rate? underlying mechanisms and improvements”. In:Advances in Neural Information Processing Systems37 (2024), pp. 111760–111801. 14

  70. [70]

    Learning-rate-free learning by d-adaptation

    Aaron Defazio and Konstantin Mishchenko. “Learning-rate-free learning by d-adaptation”. In:Interna- tional Conference on Machine Learning. PMLR. 2023, pp. 7449–7479

  71. [71]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer et al. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer”. In:International Conference on Learning Representations. 2017

  72. [72]

    GShard: Scaling giant models with conditional computation and automatic sharding

    Dmitry Lepikhin et al. “GShard: Scaling giant models with conditional computation and automatic sharding”. In:International Conference on Learning Representations. 2021

  73. [73]

    Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. “Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity”. In:Journal of Machine Learning Research23.120 (2022), pp. 1–39

  74. [74]

    Twan van Laarhoven.L2 Regularization versus Batch and Weight Normalization. 2017. arXiv:1706. 05350 [cs.LG].url:https://arxiv.org/abs/1706.05350

  75. [75]

    Aaron Defazio.Why Gradients Rapidly Increase Near the End of Training. 2025. arXiv:2506.02285 [cs.LG].url:https://arxiv.org/abs/2506.02285

  76. [76]

    Hugo Touvron et al.Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023. arXiv:2307.09288 [cs.CL].url:https://arxiv.org/abs/2307.09288

  77. [77]

    Jianlin Su et al.RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. arXiv: 2104.09864 [cs.CL].url:https://arxiv.org/abs/2104.09864

  78. [78]

    Biao Zhang and Rico Sennrich.Root Mean Square Layer Normalization. 2019. arXiv:1910.07467 [cs.LG].url:https://arxiv.org/abs/1910.07467

  79. [79]

    Zhengyan Zhang et al.ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs. 2024. arXiv:2402.03804 [cs.LG].url:https://arxiv.org/abs/2402.03804

  80. [80]

    Andrej Karpathy.NanoGPT.https://github.com/karpathy/nanoGPT. 2022

Showing first 80 references.