Recognition: no theorem link
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
Pith reviewed 2026-05-12 03:57 UTC · model grok-4.3
The pith
Optimizer research for large language models is shifting from single-algorithm speedup claims to rigorous scale-aware multi-dimensional comparisons.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that optimizer research for LLMs is entering a new phase in which single-algorithm speedup claims give way to systematic, scale-aware evaluations that jointly assess convergence, stability, memory footprint, and implementation complexity, supported by a taxonomy that groups work into classical first-order, adaptive, memory-efficient, second-order and curvature-aware, sign-based, low-rank and projection-based, and matrix-based categories.
What carries the argument
A taxonomy organizing optimizers into seven categories plus a benchmarking methodology that jointly evaluates hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream performance.
Load-bearing premise
The body of published optimizer papers is representative enough to support clean non-overlapping categories and to justify the proposed shift in benchmarking practices.
What would settle it
A widely adopted optimizer that cannot be placed in any of the seven categories or a large-scale study showing that single-algorithm speedup claims still drive most practical LLM training decisions would undermine the argument that the field has entered a new multi-dimensional benchmarking phase.
read the original abstract
Training large language models requires optimization algorithms that are not only statistically effective, but also computationally and memory efficient at extreme scale. Although Adam remains the dominant optimizer for large-scale language-model pretraining and fine-tuning, recent work has revisited nearly every component of the optimization stack: adaptive moment estimation, decoupled weight decay, memory footprint, curvature approximation, sign-based updates, large-batch stability, low-rank gradient structure, and matrix-wise orthogonalized updates. This survey reviews optimizer design for large language models through a systems-and-optimization lens. We organize the literature into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers such as Muon. We also discuss benchmarking methodology, including hyperparameter fairness, scale dependence, wall-clock efficiency, token efficiency, memory overhead, and downstream evaluation. We argue that optimizer research for LLMs is entering a new phase: moving from single-algorithm speedup claims toward rigorous, scale-aware comparisons that jointly evaluate convergence, stability, memory, and implementation complexity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey reviewing optimizer algorithms for large language models through a systems-and-optimization lens. It organizes prior work into classical first-order optimizers, adaptive optimizers, memory-efficient variants, second-order and curvature-aware methods, sign-based and discovered optimizers, low-rank and projection-based methods, and matrix-based optimizers (e.g., Muon). The paper also covers benchmarking methodology across axes such as hyperparameter fairness, scale dependence, wall-clock and token efficiency, memory overhead, and downstream evaluation, and argues that the field is shifting from isolated single-algorithm speedup claims to rigorous, scale-aware, multi-dimensional comparisons of convergence, stability, memory, and implementation complexity.
Significance. If the literature coverage proves representative, the survey provides a useful organizing framework for the rapidly expanding body of LLM optimizer research and usefully flags the shortcomings of single-metric claims. The descriptive nature of the work (no new derivations, fitted parameters, or self-referential claims) avoids circularity and positions it as a potential reference for guiding future scale-aware benchmarking.
major comments (1)
- The central claim that optimizer research is entering a 'new phase' of scale-aware multi-dimensional evaluation depends on the surveyed literature being sufficiently representative; the manuscript should therefore state explicit search methodology, inclusion criteria, or date range for the reviewed papers (in the section that introduces the seven categories) so readers can assess potential gaps or selection bias.
minor comments (2)
- The abstract lists seven categories but does not indicate their relative sizes or highlight one or two representative papers per category; adding a short summary table in the introduction would improve readability.
- In the benchmarking discussion, the distinction between 'wall-clock efficiency' and 'token efficiency' is introduced but not illustrated with a concrete example from the cited literature; a brief case study would clarify the practical difference.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. The suggestion to document the literature search process is well-taken and will improve the transparency of the survey.
read point-by-point responses
-
Referee: The central claim that optimizer research is entering a 'new phase' of scale-aware multi-dimensional evaluation depends on the surveyed literature being sufficiently representative; the manuscript should therefore state explicit search methodology, inclusion criteria, or date range for the reviewed papers (in the section that introduces the seven categories) so readers can assess potential gaps or selection bias.
Authors: We agree that the claim of a shift toward scale-aware, multi-dimensional evaluation is stronger when readers can evaluate the scope of the surveyed literature. In the revised manuscript we will insert a short subsection immediately preceding the introduction of the seven categories. This subsection will state: (i) the primary search venues and databases used (arXiv, NeurIPS, ICML, ICLR, and major optimizer-related workshops), (ii) the time window (papers appearing between January 2018 and December 2024), (iii) inclusion criteria (works that propose or rigorously evaluate optimizers specifically for large-scale language-model training, with emphasis on memory, stability, or wall-clock considerations), and (iv) exclusion criteria (purely theoretical analyses without empirical LLM-scale results, or works focused exclusively on vision or reinforcement-learning domains). We believe this addition will allow readers to assess coverage and potential selection effects without altering the descriptive nature of the survey. revision: yes
Circularity Check
No significant circularity; purely descriptive survey with no derivations or self-referential reductions
full rationale
The manuscript is a literature survey that organizes existing optimizer research into categories (classical first-order, adaptive, memory-efficient, second-order, sign-based, low-rank, matrix-based) and discusses benchmarking axes without presenting any original derivations, equations, fitted parameters, or predictive claims that could reduce to inputs by construction. No load-bearing steps rely on self-citation chains, ansatzes smuggled via prior work, or renaming of results; the central argument is a perspective on field evolution supported by external citations to prior literature. The paper is self-contained as a review and exhibits none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2002.09018 , year=
Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2021. URLhttps://arxiv.org/abs/2002.09018
-
[2]
signsgd: Compressed optimisation for non-convex problems
Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: Compressed optimisation for non-convex problems. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 560–569. PMLR, 2018. URL https://proceedings.mlr. press/v80/bernstein18a.html
work page 2018
-
[3]
L´ eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018. doi: 10.1137/16M1080173. URL https://doi.org/10.1137/16M1080173
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020. URL https://arxiv.org/abs/2005. 14165
work page 1901
-
[5]
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2302.06675
-
[6]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023. URLhttps://jmlr.org/papers/v24/22-1144.html
work page 2023
-
[7]
8-bit optimizers via block-wise quantization
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InInternational Conference on Learning Representations,
-
[8]
URLhttps://openreview.net/forum?id=shpkpVXzo3h. 60 Navigating LLM V alley
-
[9]
BERT: Pre- training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics, pages 4171–4186, 2019. URL https://aclanthology.org/ N19-1423/
work page 2019
-
[10]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12: 2121–2159, 2011. URLhttps://jmlr.org/papers/v12/duchi11a.html
work page 2011
-
[12]
Donald Goldfarb, Yi Ren, and Achraf Bahamou. Practical quasi-newton methods for training deep neural networks.arXiv preprint arXiv:2006.08877, 2020. URL https://arxiv.org/abs/2006.08877
-
[13]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018. URLhttps://proceedings.mlr.press/v80/gupta18a.html
work page 2018
-
[14]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023. URLhttps://arxiv.org/abs/2310.06825
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Muon: An optimizer for hidden layers in neural networks
Keller Jordan. Muon: An optimizer for hidden layers in neural networks. Blog post and software repository, 2024. URL https://kellerjordan.github.io/posts/muon/. Introduces MomentUm Orthogonalized by Newton–Schulz
work page 2024
-
[16]
Convergence of Muon with Newton-Schulz.arXiv preprint arXiv:2601.19156, 2026
Gyu Yeol Kim and Min-hwan Oh. Convergence of Muon with newton–schulz.arXiv preprint arXiv:2601.19156, 2026. URLhttps://arxiv.org/abs/2601.19156
-
[17]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. URL https://arxiv. org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[18]
Mathematical Programming , author =
Dong C. Liu and Jorge Nocedal. On the limited memory method for large scale opti- mization.Mathematical Programming, 45(1):503–528, 1989. doi: 10.1007/BF01589116
-
[19]
Sophia: A scalable stochastic second-order optimizer for language model pre-training
Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=3xHDeA8Noi. 61 Ranganath
work page 2024
-
[20]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInter- national Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=Bkg6RiCqY7
work page 2019
-
[22]
Full parameter fine-tuning for large language models with limited resources
Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URLhttps://aclanthology.org/2024.acl-long.445/
work page 2024
-
[23]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. InInternational Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=r1gs9JgRZ
work page 2018
-
[24]
Aryan Mokhtari and Alejandro Ribeiro. Global convergence of online limited memory bfgs.Journal of Machine Learning Research, 16:3151–3181, 2015. URL https://jmlr. org/papers/v16/mokhtari15a.html
work page 2015
-
[25]
A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023
Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, Binh Tang, Diana Liskovich, Puxin Xu, Yuchen Zhang, Melanie Kambadur, Stephen Roller, and Susan Zhang. A theory on adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871, 2023. URLhttps:/...
-
[26]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using Megatron-LM. InProceedings of the International Conference for High Perform...
-
[27]
mL- BFGS: A momentum-based l-bfgs for distributed large-scale neural network optimization
Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, and Salman Avestimehr. mL- BFGS: A momentum-based l-bfgs for distributed large-scale neural network optimization. arXiv preprint arXiv:2307.13744, 2023. URLhttps://arxiv.org/abs/2307.13744
-
[28]
Jorge Nocedal and Stephen J. Wright.Numerical Optimization. Springer, 2 edition,
-
[29]
doi: 10.1007/978-0-387-40065-5
-
[30]
PyTorch DistributedDataParallel design note
PyTorch Contributors. PyTorch DistributedDataParallel design note. https://docs. pytorch.org/docs/stable/notes/ddp.html, 2024. Accessed 2026-05-09. 62 Navigating LLM V alley
work page 2024
-
[31]
ZeRO: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InProceedings of the Interna- tional Conference for High Performance Computing, Networking, Storage and Analysis,
- [32]
-
[33]
Aditya Ranganath, Mukesh Singhal, and Roummel Marcia. Symmetric rank-one quasi-newton methods for deep learning using cubic regularization.arXiv preprint arXiv:2502.12298, 2025. URLhttps://arxiv.org/abs/2502.12298
-
[34]
Reddi, Satyen Kale, and Sanjiv Kumar
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. InInternational Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7f-RZ
work page 2018
-
[35]
An overview of gradient descent optimization algorithms,
Sebastian Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016. URLhttps://arxiv.org/abs/1609.04747
-
[36]
Schraudolph, Jin Yu, and Simon G¨ unter
Nicol N. Schraudolph, Jin Yu, and Simon G¨ unter. A stochastic quasi-newton method for online convex optimization. InProceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007. URL https://proceedings.mlr.press/ v2/schraudolph07a.html
work page 2007
-
[37]
Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025
Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025. URL https://arxiv.org/abs/2509.01440
-
[38]
Horovod: fast and easy distributed deep learning in TensorFlow
Alexander Sergeev and Mike Del Balso. Horovod: Fast and easy distributed deep learning in tensorflow.arXiv preprint arXiv:1802.05799, 2018. URL https://arxiv. org/abs/1802.05799
work page Pith review arXiv 2018
-
[39]
Adafactor: Adaptive learning rates with sublinear memory cost
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4596–4604. PMLR, 2018. URLhttps://proceedings.mlr.press/v80/shazeer18a.html
work page 2018
-
[40]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. URL https://arxiv. org/abs/1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[41]
Shiliang Sun, Zehui Cao, Han Zhu, and Jing Zhao. A survey of optimization methods from a machine learning perspective.IEEE Transactions on Cybernetics, 50(8):3668– 3681, 2020. doi: 10.1109/TCYB.2019.2950779. URL https://arxiv.org/abs/1906. 06821
-
[42]
Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2023
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2023. doi: 10.1145/3530811
-
[43]
Lecture 6.5—rmsprop: Divide the gradient by a running average of its recent magnitude
Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. 63 Ranganath
work page 2012
-
[44]
Akiyoshi Tomihari and Issei Sato. Understanding why adam outperforms sgd: Gradient heterogeneity in transformers.arXiv preprint arXiv:2502.00213, 2025. URL https: //arxiv.org/abs/2502.00213
-
[45]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. URLhttps://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
URLhttps://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017. URL https://arxiv.org/abs/1706. 03762
work page 2017
-
[49]
A survey on efficient training of transformers
Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhuoran Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. A survey on efficient training of transformers. arXiv preprint arXiv:2302.01107, 2023. URLhttps://arxiv.org/abs/2302.01107
-
[50]
Stable adam optimization for 16-bit neural networks training.arXiv preprint arXiv:2307.16189, 2023
Yifei Wang et al. Stable adam optimization for 16-bit neural networks training.arXiv preprint arXiv:2307.16189, 2023. URL https://arxiv.org/abs/2307.16189. Author metadata should be verified before final submission
-
[51]
Large batch optimization for deep learning: Training BERT in 76 minutes
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojana- palli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training BERT in 76 minutes. InInternational Confer- ence on Learning Representations, 2020. URL https://openreview.net/forum?id= Syx4wnEtvH
work page 2020
-
[52]
Why transformers need adam: A hessian perspective
Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Why transformers need adam: A hessian perspective. InAdvances in Neural Information Processing Systems,
-
[53]
URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ ee0e45ff4de76cbfdf07015a7839f339-Abstract-Conference.html
work page 2024
-
[54]
Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun
Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more. InInternational Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=iBExhaU3Lc
work page 2025
-
[55]
GaLore: Memory-efficient LLM training by gradient low-rank projection
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. InProceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2024. URL https://proceedings. mlr.press/v235/zhao24s.html. 64 Navigating ...
work page 2024
-
[56]
Deconstructing what makes a good optimizer for autoregressive language mod- els
Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham Kakade. Deconstructing what makes a good optimizer for autoregressive language mod- els. InInternational Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=0J9gL2DVO4
work page 2025
-
[57]
Adaptive loss scaling for mixed precision training.arXiv preprint arXiv:1910.12385, 2019
Ruizhe Zhao, Brian Vogel, and Tanvir Ahmed. Adaptive loss scaling for mixed precision training.arXiv preprint arXiv:1910.12385, 2019. URL https://arxiv.org/abs/1910. 12385
-
[58]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, and Shen Li. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 16(12):3848–3860, 2023. URLhttps://arxiv....
work page internal anchor Pith review arXiv 2023
-
[59]
Pan, Zhangyang Wang, and Jinwon Lee
Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, and Jinwon Lee. APOLLO: SGD-like memory, AdamW-level performance.arXiv preprint arXiv:2412.05270, 2024. URL https: //arxiv.org/abs/2412.05270. 65
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.