pith. machine review for the scientific record. sign in

arxiv: 2605.09176 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

Aditya Ranganath

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords large language modelsoptimizersAdamWmemory-efficient optimizersmatrix-based optimizersbenchmarkingscale-aware evaluationadaptive optimization
0
0 comments X

The pith

Optimizer research for large language models is shifting from single-algorithm speedup claims to rigorous scale-aware multi-dimensional comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews how optimizer design for LLMs has evolved beyond the standard Adam algorithm to address memory limits, curvature, sign-based updates, and matrix structure. It sorts the literature into classical first-order methods, adaptive variants, memory-efficient forms, second-order approaches, sign-based and discovered methods, low-rank projections, and matrix-based optimizers such as Muon. The central argument is that progress now requires benchmarks that simultaneously measure convergence, stability, memory overhead, token efficiency, wall-clock time, and implementation complexity rather than isolated accuracy gains. A sympathetic reader would care because training at extreme scale consumes enormous resources, so methods that improve any of these dimensions matter for feasibility and cost. The paper also spells out practical considerations for making such comparisons fair across model sizes and hardware.

Core claim

The paper claims that optimizer research for LLMs is entering a new phase in which single-algorithm speedup claims give way to systematic, scale-aware evaluations that jointly assess convergence, stability, memory footprint, and implementation complexity, supported by a taxonomy that groups work into classical first-order, adaptive, memory-efficient, second-order and curvature-aware, sign-based, low-rank and projection-based, and matrix-based categories.

What carries the argument

A taxonomy organizing optimizers into seven categories plus a benchmarking methodology that jointly evaluates hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream performance.

Load-bearing premise

The body of published optimizer papers is representative enough to support clean non-overlapping categories and to justify the proposed shift in benchmarking practices.

What would settle it

A widely adopted optimizer that cannot be placed in any of the seven categories or a large-scale study showing that single-algorithm speedup claims still drive most practical LLM training decisions would undermine the argument that the field has entered a new multi-dimensional benchmarking phase.

read the original abstract

Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale. Although Adam remains the dominant optimizer for large-scale language-model pretraining and fine-tuning, recent work has revisited nearly every component of the optimization stack: adaptive moment estimation, decoupled weight decay, memory footprint, curvature approximation, sign-based updates, large-batch stability, low-rank gradient structure, and matrix-wise orthogonalized updates. This survey reviews optimizer design for large language models through a systems-and-optimization lens. We organize the literature into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers such as Muon. We also discuss benchmarking methodology, including hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream evaluation. We argue that optimizer research for LLMs is entering a new phase: moving from single-algorithm speedup claims toward rigorous, scale-aware comparisons that jointly evaluate convergence, stability, memory, and implementation complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey reviewing optimizer algorithms for large language models through a systems-and-optimization lens. It organizes prior work into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers (e.g., Muon). The paper also covers benchmarking methodology across axes such as hyperparameter fairness, scale dependence, wall-clock and token efficiency, memory overhead, and downstream evaluation, and argues that the field is shifting from isolated single-algorithm speedup claims to rigorous, scale-aware, multi-dimensional comparisons of convergence, stability, memory, and implementation complexity.

Significance. If the literature coverage proves representative, the survey provides a useful organizing framework for the rapidly expanding body of LLM optimizer research and usefully flags the shortcomings of single-metric claims. The descriptive nature of the work (no new derivations, fitted parameters, or self-referential claims) avoids circularity and positions it as a potential reference for guiding future scale-aware benchmarking.

major comments (1)
  1. The central claim that optimizer research is entering a 'new phase' of scale-aware multi-dimensional evaluation depends on the surveyed literature being sufficiently representative; the manuscript should therefore state explicit search methodology, inclusion criteria, or date range for the reviewed papers (in the section that introduces the seven categories) so readers can assess potential gaps or selection bias.
minor comments (2)
  1. The abstract lists seven categories but does not indicate their relative sizes or highlight one or two representative papers per category; adding a short summary table in the introduction would improve readability.
  2. In the benchmarking discussion, the distinction between 'wall-clock efficiency' and 'token efficiency' is introduced but not illustrated with a concrete example from the cited literature; a brief case study would clarify the practical difference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. The suggestion to document the literature search process is well-taken and will improve the transparency of the survey.

read point-by-point responses
  1. Referee: The central claim that optimizer research is entering a 'new phase' of scale-aware multi-dimensional evaluation depends on the surveyed literature being sufficiently representative; the manuscript should therefore state explicit search methodology, inclusion criteria, or date range for the reviewed papers (in the section that introduces the seven categories) so readers can assess potential gaps or selection bias.

    Authors: We agree that the claim of a shift toward scale-aware, multi-dimensional evaluation is stronger when readers can evaluate the scope of the surveyed literature. In the revised manuscript we will insert a short subsection immediately preceding the introduction of the seven categories. This subsection will state: (i) the primary search venues and databases used (arXiv, NeurIPS, ICML, ICLR, and major optimizer-related workshops), (ii) the time window (papers appearing between January 2018 and December 2024), (iii) inclusion criteria (works that propose or rigorously evaluate optimizers specifically for large-scale language-model training, with emphasis on memory, stability, or wall-clock considerations), and (iv) exclusion criteria (purely theoretical analyses without empirical LLM-scale results, or works focused exclusively on vision or reinforcement-learning domains). We believe this addition will allow readers to assess coverage and potential selection effects without altering the descriptive nature of the survey. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely descriptive survey with no derivations or self-referential reductions

full rationale

The manuscript is a literature survey that organizes existing optimizer research into categories (classical first-order, adaptive, memory-efficient, second-order, sign-based, low-rank, matrix-based) and discusses benchmarking axes without presenting any original derivations, equations, fitted parameters, or predictive claims that could reduce to inputs by construction. No load-bearing steps rely on self-citation chains, ansatzes smuggled via prior work, or renaming of results; the central argument is a perspective on field evolution supported by external citations to prior literature. The paper is self-contained as a review and exhibits none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no new technical claims, parameters, or entities introduced.

pith-pipeline@v0.9.0 · 5492 in / 1067 out tokens · 35647 ms · 2026-05-12T03:57:32.933712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 8 internal anchors

  1. [1]

    arXiv preprint arXiv:2002.09018 , year=

    Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2021. URLhttps://arxiv.org/abs/2002.09018

  2. [2]

    signsgd: Compressed optimisation for non-convex problems

    Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: Compressed optimisation for non-convex problems. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 560–569. PMLR, 2018. URL https://proceedings.mlr. press/v80/bernstein18a.html

  3. [3]

    SIAM Review , author =

    L´ eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173. URL https://doi.org/10.1137/16M1080173

  4. [4]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020. URL https://arxiv.org/abs/2005. 14165

  5. [5]

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2302.06675

  6. [6]

    PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023. URLhttps://jmlr.org/papers/v24/22-1144.html

  7. [7]

    8-bit optimizers via block-wise quantization

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InInternational Conference on Learning Representations,

  8. [8]

    60 Navigating LLM V alley

    URLhttps://openreview.net/forum?id=shpkpVXzo3h. 60 Navigating LLM V alley

  9. [9]

    BERT: Pre- training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics, pages 4171–4186, 2019. URL https://aclanthology.org/ N19-1423/

  10. [10]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

  11. [11]

    Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12: 2121–2159, 2011

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12: 2121–2159, 2011. URLhttps://jmlr.org/papers/v12/duchi11a.html

  12. [12]

    Practical quasi-newton methods for training deep neural networks.arXiv preprint arXiv:2006.08877, 2020

    Donald Goldfarb, Yi Ren, and Achraf Bahamou. Practical quasi-newton methods for training deep neural networks.arXiv preprint arXiv:2006.08877, 2020. URL https://arxiv.org/abs/2006.08877

  13. [13]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018. URLhttps://proceedings.mlr.press/v80/gupta18a.html

  14. [14]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023. URLhttps://arxiv.org/abs/2310.06825

  15. [15]

    Muon: An optimizer for hidden layers in neural networks

    Keller Jordan. Muon: An optimizer for hidden layers in neural networks. Blog post and software repository, 2024. URL https://kellerjordan.github.io/posts/muon/. Introduces MomentUm Orthogonalized by Newton–Schulz

  16. [16]

    Convergence of Muon with Newton-Schulz.arXiv preprint arXiv:2601.19156, 2026

    Gyu Yeol Kim and Min-hwan Oh. Convergence of Muon with newton–schulz.arXiv preprint arXiv:2601.19156, 2026. URLhttps://arxiv.org/abs/2601.19156

  17. [17]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv. org/abs/1412.6980

  18. [18]

    Mathematical Programming , author =

    Dong C. Liu and Jorge Nocedal. On the limited memory method for large scale opti- mization.Mathematical Programming, 45(1):503–528, 1989. doi: 10.1007/BF01589116

  19. [19]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training

    Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=3xHDeA8Noi. 61 Ranganath

  20. [20]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

  21. [21]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInter- national Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=Bkg6RiCqY7

  22. [22]

    Full parameter fine-tuning for large language models with limited resources

    Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URLhttps://aclanthology.org/2024.acl-long.445/

  23. [23]

    Mixed precision training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. InInternational Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=r1gs9JgRZ

  24. [24]

    Global convergence of online limited memory bfgs.Journal of Machine Learning Research, 16:3151–3181, 2015

    Aryan Mokhtari and Alejandro Ribeiro. Global convergence of online limited memory bfgs.Journal of Machine Learning Research, 16:3151–3181, 2015. URL https://jmlr. org/papers/v16/mokhtari15a.html

  25. [25]

    A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023

    Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, Binh Tang, Diana Liskovich, Puxin Xu, Yuchen Zhang, Melanie Kambadur, Stephen Roller, and Susan Zhang. A theory on adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023. URLhttps:/...

  26. [26]

    Efficient large-scale language model training on GPU clusters using Megatron-LM.arXiv preprint arXiv:2104.04473,

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using Megatron-LM. InProceedings of the International Conference for High Perform...

  27. [27]

    mL- BFGS: A momentum-based l-bfgs for distributed large-scale neural network optimization

    Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, and Salman Avestimehr. mL- BFGS: A momentum-based l-bfgs for distributed large-scale neural network optimization. arXiv preprint arXiv:2307.13744, 2023. URLhttps://arxiv.org/abs/2307.13744

  28. [28]

    Wright.Numerical Optimization

    Jorge Nocedal and Stephen J. Wright.Numerical Optimization. Springer, 2 edition,

  29. [29]

    doi: 10.1007/978-0-387-40065-5

  30. [30]

    PyTorch DistributedDataParallel design note

    PyTorch Contributors. PyTorch DistributedDataParallel design note. https://docs. pytorch.org/docs/stable/notes/ddp.html, 2024. Accessed 2026-05-09. 62 Navigating LLM V alley

  31. [31]

    ZeRO: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InProceedings of the Interna- tional Conference for High Performance Computing, Networking, Storage and Analysis,

  32. [32]

    URLhttps://arxiv.org/abs/1910.02054

  33. [33]

    Symmetric rank-one quasi-newton methods for deep learning using cubic regularization.arXiv preprint arXiv:2502.12298, 2025

    Aditya Ranganath, Mukesh Singhal, and Roummel Marcia. Symmetric rank-one quasi-newton methods for deep learning using cubic regularization.arXiv preprint arXiv:2502.12298, 2025. URLhttps://arxiv.org/abs/2502.12298

  34. [34]

    Reddi, Satyen Kale, and Sanjiv Kumar

    Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. InInternational Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7f-RZ

  35. [35]

    An overview of gradient descent optimization algorithms,

    Sebastian Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016. URLhttps://arxiv.org/abs/1609.04747

  36. [36]

    Schraudolph, Jin Yu, and Simon G¨ unter

    Nicol N. Schraudolph, Jin Yu, and Simon G¨ unter. A stochastic quasi-newton method for online convex optimization. InProceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007. URL https://proceedings.mlr.press/ v2/schraudolph07a.html

  37. [37]

    Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025

    Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025. URL https://arxiv.org/abs/2509.01440

  38. [38]

    Horovod: fast and easy distributed deep learning in TensorFlow

    Alexander Sergeev and Mike Del Balso. Horovod: Fast and easy distributed deep learning in tensorflow.arXiv preprint arXiv:1802.05799, 2018. URL https://arxiv. org/abs/1802.05799

  39. [39]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4596–4604. PMLR, 2018. URLhttps://proceedings.mlr.press/v80/shazeer18a.html

  40. [40]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. URL https://arxiv. org/abs/1909.08053

  41. [41]

    A survey of optimization methods from a machine learning perspective.IEEE Transactions on Cybernetics, 50(8):3668– 3681, 2020

    Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao. A survey of optimization methods from a machine learning perspective.IEEE Transactions on Cybernetics, 50(8):3668– 3681, 2020. doi: 10.1109/TCYB.2019.2950779. URL https://arxiv.org/abs/1906. 06821

  42. [42]

    Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2023

    Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2023. doi: 10.1145/3530811

  43. [43]

    Lecture 6.5—rmsprop: Divide the gradient by a running average of its recent magnitude

    Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. 63 Ranganath

  44. [44]

    Understanding why adam outperforms sgd: Gradient heterogeneity in transformers.arXiv preprint arXiv:2502.00213, 2025

    Akiyoshi Tomihari and Issei Sato. Understanding why adam outperforms sgd: Gradient heterogeneity in transformers.arXiv preprint arXiv:2502.00213, 2025. URL https: //arxiv.org/abs/2502.00213

  45. [45]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. URLhttps://arxiv.org/abs/2302.13971

  46. [47]

    URLhttps://arxiv.org/abs/2307.09288

  47. [48]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017. URL https://arxiv.org/abs/1706. 03762

  48. [49]

    A survey on efficient training of transformers

    Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhuoran Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. A survey on efficient training of transformers. arXiv preprint arXiv:2302.01107, 2023. URLhttps://arxiv.org/abs/2302.01107

  49. [50]

    Stable adam optimization for 16-bit neural networks training.arXiv preprint arXiv:2307.16189, 2023

    Yifei Wang et al. Stable adam optimization for 16-bit neural networks training.arXiv preprint arXiv:2307.16189, 2023. URL https://arxiv.org/abs/2307.16189. Author metadata should be verified before final submission

  50. [51]

    Large batch optimization for deep learning: Training BERT in 76 minutes

    Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojana- palli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training BERT in 76 minutes. InInternational Confer- ence on Learning Representations, 2020. URL https://openreview.net/forum?id= Syx4wnEtvH

  51. [52]

    Why transformers need adam: A hessian perspective

    Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Why transformers need adam: A hessian perspective. InAdvances in Neural Information Processing Systems,

  52. [53]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ ee0e45ff4de76cbfdf07015a7839f339-Abstract-Conference.html

  53. [54]

    Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun

    Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more. InInternational Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=iBExhaU3Lc

  54. [55]

    GaLore: Memory-efficient LLM training by gradient low-rank projection

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. InProceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2024. URL https://proceedings. mlr.press/v235/zhao24s.html. 64 Navigating ...

  55. [56]

    Deconstructing what makes a good optimizer for autoregressive language mod- els

    Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham Kakade. Deconstructing what makes a good optimizer for autoregressive language mod- els. InInternational Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=0J9gL2DVO4

  56. [57]

    Adaptive loss scaling for mixed precision training.arXiv preprint arXiv:1910.12385, 2019

    Ruizhe Zhao, Brian Vogel, and Tanvir Ahmed. Adaptive loss scaling for mixed precision training.arXiv preprint arXiv:1910.12385, 2019. URL https://arxiv.org/abs/1910. 12385

  57. [58]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, and Shen Li. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023. URLhttps://arxiv....

  58. [59]

    Pan, Zhangyang Wang, and Jinwon Lee

    Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, and Jinwon Lee. APOLLO: SGD-like memory, AdamW-level performance.arXiv preprint arXiv:2412.05270, 2024. URL https: //arxiv.org/abs/2412.05270. 65