arxiv: 2605.09176 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

Aditya Ranganath

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords large language modelsoptimizersAdamWmemory-efficient optimizersmatrix-based optimizersbenchmarkingscale-aware evaluationadaptive optimization

0 comments

The pith

Optimizer research for large language models is shifting from single-algorithm speedup claims to rigorous scale-aware multi-dimensional comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews how optimizer design for LLMs has evolved beyond the standard Adam algorithm to address memory limits, curvature, sign-based updates, and matrix structure. It sorts the literature into classical first-order methods, adaptive variants, memory-efficient forms, second-order approaches, sign-based and discovered methods, low-rank projections, and matrix-based optimizers such as Muon. The central argument is that progress now requires benchmarks that simultaneously measure convergence, stability, memory overhead, token efficiency, wall-clock time, and implementation complexity rather than isolated accuracy gains. A sympathetic reader would care because training at extreme scale consumes enormous resources, so methods that improve any of these dimensions matter for feasibility and cost. The paper also spells out practical considerations for making such comparisons fair across model sizes and hardware.

Core claim

The paper claims that optimizer research for LLMs is entering a new phase in which single-algorithm speedup claims give way to systematic, scale-aware evaluations that jointly assess convergence, stability, memory footprint, and implementation complexity, supported by a taxonomy that groups work into classical first-order, adaptive, memory-efficient, second-order and curvature-aware, sign-based, low-rank and projection-based, and matrix-based categories.

What carries the argument

A taxonomy organizing optimizers into seven categories plus a benchmarking methodology that jointly evaluates hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream performance.

Load-bearing premise

The body of published optimizer papers is representative enough to support clean non-overlapping categories and to justify the proposed shift in benchmarking practices.

What would settle it

A widely adopted optimizer that cannot be placed in any of the seven categories or a large-scale study showing that single-algorithm speedup claims still drive most practical LLM training decisions would undermine the argument that the field has entered a new multi-dimensional benchmarking phase.

read the original abstract

Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale. Although Adam remains the dominant optimizer for large-scale language-model pretraining and fine-tuning, recent work has revisited nearly every component of the optimization stack: adaptive moment estimation, decoupled weight decay, memory footprint, curvature approximation, sign-based updates, large-batch stability, low-rank gradient structure, and matrix-wise orthogonalized updates. This survey reviews optimizer design for large language models through a systems-and-optimization lens. We organize the literature into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers such as Muon. We also discuss benchmarking methodology, including hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream evaluation. We argue that optimizer research for LLMs is entering a new phase: moving from single-algorithm speedup claims toward rigorous, scale-aware comparisons that jointly evaluate convergence, stability, memory, and implementation complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical survey that organizes recent LLM optimizer work into categories and pushes for multi-dimensional benchmarking instead of isolated speedup claims.

read the letter

This paper is a survey that groups existing optimizer research for large language models into buckets like classical first-order methods, adaptive ones, memory-efficient variants, second-order approaches, sign-based methods, low-rank projections, and matrix-based ones such as Muon. It then argues the field should shift from single-algorithm performance claims to evaluations that jointly consider convergence, stability, memory use, and implementation cost at scale. That framing is the main contribution, and it lands reasonably well because the abstract shows a coherent high-level map of the space since AdamW became default. The benchmarking section correctly flags issues like hyperparameter fairness, scale dependence, wall-clock versus token efficiency, and downstream metrics, which are real practical problems when people try to compare new optimizers. The paper does not introduce new algorithms, derivations, or experiments, so its value is entirely in the synthesis and the call for better standards. The soft spot is that any survey's usefulness hinges on coverage and whether the categories feel natural rather than forced; without seeing the full citations and examples, it's unclear if key recent papers are omitted or if overlaps between low-rank and matrix methods are addressed. The central claim about entering a new phase is a perspective rather than a data-driven result, but it does not contradict itself or rely on shaky assumptions in the structure provided. This is for researchers who train or tune LLMs and want a quick structured overview of the optimizer landscape rather than a methods paper with new code or proofs. A reader hunting for novel techniques or large-scale ablations will come away empty. It deserves a serious referee because the organization is timely and the benchmarking discussion is grounded in actual pain points, even if revisions will likely focus on completeness of the literature review.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey reviewing optimizer algorithms for large language models through a systems-and-optimization lens. It organizes prior work into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers (e.g., Muon). The paper also covers benchmarking methodology across axes such as hyperparameter fairness, scale dependence, wall-clock and token efficiency, memory overhead, and downstream evaluation, and argues that the field is shifting from isolated single-algorithm speedup claims to rigorous, scale-aware, multi-dimensional comparisons of convergence, stability, memory, and implementation complexity.

Significance. If the literature coverage proves representative, the survey provides a useful organizing framework for the rapidly expanding body of LLM optimizer research and usefully flags the shortcomings of single-metric claims. The descriptive nature of the work (no new derivations, fitted parameters, or self-referential claims) avoids circularity and positions it as a potential reference for guiding future scale-aware benchmarking.

major comments (1)

The central claim that optimizer research is entering a 'new phase' of scale-aware multi-dimensional evaluation depends on the surveyed literature being sufficiently representative; the manuscript should therefore state explicit search methodology, inclusion criteria, or date range for the reviewed papers (in the section that introduces the seven categories) so readers can assess potential gaps or selection bias.

minor comments (2)

The abstract lists seven categories but does not indicate their relative sizes or highlight one or two representative papers per category; adding a short summary table in the introduction would improve readability.
In the benchmarking discussion, the distinction between 'wall-clock efficiency' and 'token efficiency' is introduced but not illustrated with a concrete example from the cited literature; a brief case study would clarify the practical difference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. The suggestion to document the literature search process is well-taken and will improve the transparency of the survey.

read point-by-point responses

Referee: The central claim that optimizer research is entering a 'new phase' of scale-aware multi-dimensional evaluation depends on the surveyed literature being sufficiently representative; the manuscript should therefore state explicit search methodology, inclusion criteria, or date range for the reviewed papers (in the section that introduces the seven categories) so readers can assess potential gaps or selection bias.

Authors: We agree that the claim of a shift toward scale-aware, multi-dimensional evaluation is stronger when readers can evaluate the scope of the surveyed literature. In the revised manuscript we will insert a short subsection immediately preceding the introduction of the seven categories. This subsection will state: (i) the primary search venues and databases used (arXiv, NeurIPS, ICML, ICLR, and major optimizer-related workshops), (ii) the time window (papers appearing between January 2018 and December 2024), (iii) inclusion criteria (works that propose or rigorously evaluate optimizers specifically for large-scale language-model training, with emphasis on memory, stability, or wall-clock considerations), and (iv) exclusion criteria (purely theoretical analyses without empirical LLM-scale results, or works focused exclusively on vision or reinforcement-learning domains). We believe this addition will allow readers to assess coverage and potential selection effects without altering the descriptive nature of the survey. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely descriptive survey with no derivations or self-referential reductions

full rationale

The manuscript is a literature survey that organizes existing optimizer research into categories (classical first-order, adaptive, memory-efficient, second-order, sign-based, low-rank, matrix-based) and discusses benchmarking axes without presenting any original derivations, equations, fitted parameters, or predictive claims that could reduce to inputs by construction. No load-bearing steps rely on self-citation chains, ansatzes smuggled via prior work, or renaming of results; the central argument is a perspective on field evolution supported by external citations to prior literature. The paper is self-contained as a review and exhibits none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no new technical claims, parameters, or entities introduced.

pith-pipeline@v0.9.0 · 5492 in / 1067 out tokens · 35647 ms · 2026-05-12T03:57:32.933712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 8 internal anchors

[1]

arXiv preprint arXiv:2002.09018 , year=

Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2021. URLhttps://arxiv.org/abs/2002.09018

work page arXiv 2002
[2]

signsgd: Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: Compressed optimisation for non-convex problems. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 560–569. PMLR, 2018. URL https://proceedings.mlr. press/v80/bernstein18a.html

work page 2018
[3]

SIAM Review , author =

L´ eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173. URL https://doi.org/10.1137/16M1080173

work page doi:10.1137/16m1080173 2018
[4]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020. URL https://arxiv.org/abs/2005. 14165

work page 1901
[5]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2302.06675

work page arXiv 2023
[6]

PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023. URLhttps://jmlr.org/papers/v24/22-1144.html

work page 2023
[7]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InInternational Conference on Learning Representations,

work page
[8]

60 Navigating LLM V alley

URLhttps://openreview.net/forum?id=shpkpVXzo3h. 60 Navigating LLM V alley

work page
[9]

BERT: Pre- training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics, pages 4171–4186, 2019. URL https://aclanthology.org/ N19-1423/

work page 2019
[10]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12: 2121–2159, 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12: 2121–2159, 2011. URLhttps://jmlr.org/papers/v12/duchi11a.html

work page 2011
[12]

Practical quasi-newton methods for training deep neural networks.arXiv preprint arXiv:2006.08877, 2020

Donald Goldfarb, Yi Ren, and Achraf Bahamou. Practical quasi-newton methods for training deep neural networks.arXiv preprint arXiv:2006.08877, 2020. URL https://arxiv.org/abs/2006.08877

work page arXiv 2006
[13]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018. URLhttps://proceedings.mlr.press/v80/gupta18a.html

work page 2018
[14]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023. URLhttps://arxiv.org/abs/2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan. Muon: An optimizer for hidden layers in neural networks. Blog post and software repository, 2024. URL https://kellerjordan.github.io/posts/muon/. Introduces MomentUm Orthogonalized by Newton–Schulz

work page 2024
[16]

Convergence of Muon with Newton-Schulz.arXiv preprint arXiv:2601.19156, 2026

Gyu Yeol Kim and Min-hwan Oh. Convergence of Muon with newton–schulz.arXiv preprint arXiv:2601.19156, 2026. URLhttps://arxiv.org/abs/2601.19156

work page arXiv 2026
[17]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv. org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Mathematical Programming , author =

Dong C. Liu and Jorge Nocedal. On the limited memory method for large scale opti- mization.Mathematical Programming, 45(1):503–528, 1989. doi: 10.1007/BF01589116

work page doi:10.1007/bf01589116 1989
[19]

Sophia: A scalable stochastic second-order optimizer for language model pre-training

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=3xHDeA8Noi. 61 Ranganath

work page 2024
[20]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInter- national Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=Bkg6RiCqY7

work page 2019
[22]

Full parameter fine-tuning for large language models with limited resources

Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URLhttps://aclanthology.org/2024.acl-long.445/

work page 2024
[23]

Mixed precision training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. InInternational Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=r1gs9JgRZ

work page 2018
[24]

Global convergence of online limited memory bfgs.Journal of Machine Learning Research, 16:3151–3181, 2015

Aryan Mokhtari and Alejandro Ribeiro. Global convergence of online limited memory bfgs.Journal of Machine Learning Research, 16:3151–3181, 2015. URL https://jmlr. org/papers/v16/mokhtari15a.html

work page 2015
[25]

A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023

Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, Binh Tang, Diana Liskovich, Puxin Xu, Yuchen Zhang, Melanie Kambadur, Stephen Roller, and Susan Zhang. A theory on adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023. URLhttps:/...

work page arXiv 2023
[26]

Efficient large-scale language model training on GPU clusters using Megatron-LM.arXiv preprint arXiv:2104.04473,

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using Megatron-LM. InProceedings of the International Conference for High Perform...

work page arXiv 2021
[27]

mL- BFGS: A momentum-based l-bfgs for distributed large-scale neural network optimization

Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, and Salman Avestimehr. mL- BFGS: A momentum-based l-bfgs for distributed large-scale neural network optimization. arXiv preprint arXiv:2307.13744, 2023. URLhttps://arxiv.org/abs/2307.13744

work page arXiv 2023
[28]

Wright.Numerical Optimization

Jorge Nocedal and Stephen J. Wright.Numerical Optimization. Springer, 2 edition,

work page
[29]

doi: 10.1007/978-0-387-40065-5

work page doi:10.1007/978-0-387-40065-5
[30]

PyTorch DistributedDataParallel design note

PyTorch Contributors. PyTorch DistributedDataParallel design note. https://docs. pytorch.org/docs/stable/notes/ddp.html, 2024. Accessed 2026-05-09. 62 Navigating LLM V alley

work page 2024
[31]

ZeRO: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InProceedings of the Interna- tional Conference for High Performance Computing, Networking, Storage and Analysis,

work page
[32]

URLhttps://arxiv.org/abs/1910.02054

work page arXiv 1910
[33]

Symmetric rank-one quasi-newton methods for deep learning using cubic regularization.arXiv preprint arXiv:2502.12298, 2025

Aditya Ranganath, Mukesh Singhal, and Roummel Marcia. Symmetric rank-one quasi-newton methods for deep learning using cubic regularization.arXiv preprint arXiv:2502.12298, 2025. URLhttps://arxiv.org/abs/2502.12298

work page arXiv 2025
[34]

Reddi, Satyen Kale, and Sanjiv Kumar

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. InInternational Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7f-RZ

work page 2018
[35]

An overview of gradient descent optimization algorithms,

Sebastian Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016. URLhttps://arxiv.org/abs/1609.04747

work page arXiv 2016
[36]

Schraudolph, Jin Yu, and Simon G¨ unter

Nicol N. Schraudolph, Jin Yu, and Simon G¨ unter. A stochastic quasi-newton method for online convex optimization. InProceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007. URL https://proceedings.mlr.press/ v2/schraudolph07a.html

work page 2007
[37]

Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025. URL https://arxiv.org/abs/2509.01440

work page arXiv 2025
[38]

Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev and Mike Del Balso. Horovod: Fast and easy distributed deep learning in tensorflow.arXiv preprint arXiv:1802.05799, 2018. URL https://arxiv. org/abs/1802.05799

work page Pith review arXiv 2018
[39]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4596–4604. PMLR, 2018. URLhttps://proceedings.mlr.press/v80/shazeer18a.html

work page 2018
[40]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. URL https://arxiv. org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 1909
[41]

A survey of optimization methods from a machine learning perspective.IEEE Transactions on Cybernetics, 50(8):3668– 3681, 2020

Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao. A survey of optimization methods from a machine learning perspective.IEEE Transactions on Cybernetics, 50(8):3668– 3681, 2020. doi: 10.1109/TCYB.2019.2950779. URL https://arxiv.org/abs/1906. 06821

work page doi:10.1109/tcyb.2019.2950779 2020
[42]

Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2023

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2023. doi: 10.1145/3530811

work page doi:10.1145/3530811 2023
[43]

Lecture 6.5—rmsprop: Divide the gradient by a running average of its recent magnitude

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. 63 Ranganath

work page 2012
[44]

Understanding why adam outperforms sgd: Gradient heterogeneity in transformers.arXiv preprint arXiv:2502.00213, 2025

Akiyoshi Tomihari and Issei Sato. Understanding why adam outperforms sgd: Gradient heterogeneity in transformers.arXiv preprint arXiv:2502.00213, 2025. URL https: //arxiv.org/abs/2502.00213

work page arXiv 2025
[45]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. URLhttps://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

URLhttps://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017. URL https://arxiv.org/abs/1706. 03762

work page 2017
[49]

A survey on efficient training of transformers

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhuoran Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. A survey on efficient training of transformers. arXiv preprint arXiv:2302.01107, 2023. URLhttps://arxiv.org/abs/2302.01107

work page arXiv 2023
[50]

Stable adam optimization for 16-bit neural networks training.arXiv preprint arXiv:2307.16189, 2023

Yifei Wang et al. Stable adam optimization for 16-bit neural networks training.arXiv preprint arXiv:2307.16189, 2023. URL https://arxiv.org/abs/2307.16189. Author metadata should be verified before final submission

work page arXiv 2023
[51]

Large batch optimization for deep learning: Training BERT in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojana- palli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training BERT in 76 minutes. InInternational Confer- ence on Learning Representations, 2020. URL https://openreview.net/forum?id= Syx4wnEtvH

work page 2020
[52]

Why transformers need adam: A hessian perspective

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Why transformers need adam: A hessian perspective. InAdvances in Neural Information Processing Systems,

work page
[53]

URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ ee0e45ff4de76cbfdf07015a7839f339-Abstract-Conference.html

work page 2024
[54]

Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more. InInternational Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=iBExhaU3Lc

work page 2025
[55]

GaLore: Memory-efficient LLM training by gradient low-rank projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. InProceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2024. URL https://proceedings. mlr.press/v235/zhao24s.html. 64 Navigating ...

work page 2024
[56]

Deconstructing what makes a good optimizer for autoregressive language mod- els

Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham Kakade. Deconstructing what makes a good optimizer for autoregressive language mod- els. InInternational Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=0J9gL2DVO4

work page 2025
[57]

Adaptive loss scaling for mixed precision training.arXiv preprint arXiv:1910.12385, 2019

Ruizhe Zhao, Brian Vogel, and Tanvir Ahmed. Adaptive loss scaling for mixed precision training.arXiv preprint arXiv:1910.12385, 2019. URL https://arxiv.org/abs/1910. 12385

work page arXiv 1910
[58]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, and Shen Li. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023. URLhttps://arxiv....

work page internal anchor Pith review arXiv 2023
[59]

Pan, Zhangyang Wang, and Jinwon Lee

Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, and Jinwon Lee. APOLLO: SGD-like memory, AdamW-level performance.arXiv preprint arXiv:2412.05270, 2024. URL https: //arxiv.org/abs/2412.05270. 65

work page arXiv 2024