pith. machine review for the scientific record. sign in

arxiv: 2403.03507 · v2 · submitted 2024-03-06 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:47 UTC · model grok-4.3

classification 💻 cs.LG
keywords memory-efficient traininglow-rank gradient projectionLLM pre-trainingoptimizer statesconsumer GPUfull-parameter training
0
0 comments X

The pith

GaLore projects full gradients onto low-rank subspaces periodically, cutting optimizer memory by 65.5% while training every parameter of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training method that projects gradients to a low-rank form at set intervals so that optimizer states consume far less memory. Unlike methods that freeze most weights and train only low-rank adapters, this approach still updates every model parameter yet stores only the compact projected information. Experiments on LLaMA models up to 7 billion parameters show that final model quality on pre-training and fine-tuning tasks stays comparable to standard full-rank training. The practical result is that a complete 7B pre-training run fits on a single 24GB consumer GPU.

Core claim

Gradient Low-Rank Projection reduces optimizer-state memory by up to 65.5% by decomposing gradients into low-rank factors whose bases are recomputed periodically, while the actual weight updates remain full rank and the resulting models match the quality of conventional training on both C4 pre-training and GLUE fine-tuning.

What carries the argument

Periodic low-rank projection of gradients, which decomposes each gradient matrix into a pair of low-rank factors stored by the optimizer while the weight update itself stays full rank.

If this is right

  • Pre-training a 7B model becomes feasible on a single 24GB GPU without model parallelism, checkpointing, or offloading.
  • An 8-bit version further reduces optimizer memory by up to 82.5% and total training memory by 63.3%.
  • Performance stays comparable to full-rank training across both pre-training and fine-tuning regimes.
  • No full-rank warm-start is required, unlike some low-rank adaptation approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection idea could be applied to other first-order optimizers beyond Adam.
  • Lower memory use may allow larger batch sizes or longer context lengths on the same hardware.
  • Periodic recomputation opens the possibility of making the projection interval itself adaptive during training.

Load-bearing premise

That recomputing the low-rank bases at intervals keeps the projected gradients close enough to the original ones for the optimizer to reach models of comparable quality.

What would settle it

Run identical 7B pre-training on C4 with both GaLore and a full-memory baseline and compare final validation perplexity or downstream GLUE scores; a large gap would falsify the claim.

read the original abstract

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GaLore, a training method that projects gradients onto a low-rank subspace via periodic SVD-based basis updates, enabling full-parameter optimization of LLMs with substantially reduced optimizer memory (up to 65.5% savings). It reports performance parity with standard Adam on LLaMA 1B/7B pre-training using up to 19.7B C4 tokens and on RoBERTa fine-tuning for GLUE, including the first claimed demonstration of 7B pre-training on a single 24GB consumer GPU without parallelism, checkpointing, or offloading. An 8-bit variant further reduces memory.

Significance. If the empirical results hold under rigorous controls, the work has clear significance for lowering hardware barriers to LLM pre-training. Demonstrating viable 7B-scale training on consumer GPUs directly addresses a practical bottleneck and could accelerate research in resource-limited settings. The approach's retention of full-rank parameter updates distinguishes it from adaptation methods like LoRA.

major comments (3)
  1. [§3] §3 (Method), Algorithm 1: The central mechanism relies on updating low-rank bases every T steps via SVD of the gradient matrix. No analysis or bounds are given on how quickly the gradient subspace evolves during 19.7B-token pre-training; if drift exceeds the update interval, the projected direction introduces accumulating bias relative to full gradients, undermining the claim of comparable optimization dynamics.
  2. [§4.1] §4.1 (Pre-training experiments): The reported performance parity for LLaMA 7B lacks ablations on the free parameters r (projection rank) and T (update frequency). Without these, it is impossible to determine whether the chosen values are robust or were tuned post-hoc to match baseline quality.
  3. [Table 2] Table 2 (Memory and performance): The 65.5% optimizer memory reduction and GLUE results are presented without explicit confirmation that all baselines (including 8-bit Adam) used identical learning-rate schedules, batch sizes, and warm-up protocols; any mismatch would invalidate the cross-method comparison.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'for the first time' for 7B single-GPU training should be qualified with a short citation or discussion of prior single-GPU attempts.
  2. [§4.3] §4.3 (8-bit variant): Clarify the interaction between 8-bit quantization and the low-rank projection step, including any additional error introduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [§3] §3 (Method), Algorithm 1: The central mechanism relies on updating low-rank bases every T steps via SVD of the gradient matrix. No analysis or bounds are given on how quickly the gradient subspace evolves during 19.7B-token pre-training; if drift exceeds the update interval, the projected direction introduces accumulating bias relative to full gradients, undermining the claim of comparable optimization dynamics.

    Authors: We agree that providing analysis on the evolution of the gradient subspace would be valuable. While we do not provide theoretical bounds in the current manuscript, our experiments show that GaLore achieves performance comparable to full-rank training, indicating that the periodic updates with the selected T effectively capture the relevant subspace without significant bias accumulation. In the revised version, we will include an empirical analysis of the subspace drift, such as measuring the angle between successive low-rank bases over the course of training, to better justify the update frequency. revision: partial

  2. Referee: [§4.1] §4.1 (Pre-training experiments): The reported performance parity for LLaMA 7B lacks ablations on the free parameters r (projection rank) and T (update frequency). Without these, it is impossible to determine whether the chosen values are robust or were tuned post-hoc to match baseline quality.

    Authors: We appreciate this suggestion. The values of r and T were chosen based on preliminary experiments to balance memory savings and performance, but we acknowledge the need for more comprehensive ablations. In the revision, we will add ablation studies varying r and T for the LLaMA 7B pre-training, demonstrating the robustness of the results within reasonable ranges of these hyperparameters. revision: yes

  3. Referee: [Table 2] Table 2 (Memory and performance): The 65.5% optimizer memory reduction and GLUE results are presented without explicit confirmation that all baselines (including 8-bit Adam) used identical learning-rate schedules, batch sizes, and warm-up protocols; any mismatch would invalidate the cross-method comparison.

    Authors: We confirm that all methods, including the 8-bit Adam baseline, were trained using identical hyperparameters: the same learning rate schedule, batch size, and warm-up protocol as detailed in Section 4. To make this explicit, we will add a clarifying statement in the caption of Table 2 and in the experimental setup section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GaLore derivation

full rationale

The paper proposes GaLore as an algorithmic modification to the optimizer: gradients are projected onto a low-rank subspace obtained via periodic SVD of the gradient matrix, with bases updated every T steps. Memory savings (up to 65.5% in optimizer states) are direct measurements of reduced state sizes under BF16/8-bit quantization, not quantities derived from fitted constants or self-referential equations. Performance equivalence to full Adam is shown via empirical pre-training on LLaMA 1B/7B with C4 (19.7B tokens) and fine-tuning on GLUE; no derivation step reduces to a self-citation chain, ansatz smuggled via prior work, or renaming of known results. The central claim rests on measured quantities and experimental validation rather than any load-bearing self-definition or fitted-input prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on one central domain assumption and one tunable hyper-parameter; no new physical entities are introduced.

free parameters (1)
  • projection rank r
    Low-rank dimension chosen per layer or experiment; controls memory-performance tradeoff and must be set by the user.
axioms (1)
  • domain assumption Gradients admit a low-rank approximation that preserves sufficient directional information for effective Adam-style updates when the basis is refreshed periodically.
    This assumption is required for the memory reduction to not degrade final model quality.

pith-pipeline@v0.9.0 · 5584 in / 1319 out tokens · 61164 ms · 2026-05-16T23:47:37.157229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

    cs.LG 2026-05 unverdicted novelty 7.0

    Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

  2. Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

    cs.LG 2026-05 unverdicted novelty 7.0

    Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

  3. Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

    math.OC 2026-05 unverdicted novelty 7.0

    Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.

  4. BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.

  5. BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.

  6. Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

    cs.LG 2026-05 unverdicted novelty 6.0

    Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-...

  7. AdamO: A Collapse-Suppressed Optimizer for Offline RL

    cs.LG 2026-05 unverdicted novelty 6.0

    AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

  8. Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

    cs.LG 2026-04 unverdicted novelty 6.0

    Muon² adds adaptive second-moment preconditioning to Muon, improving spectrum conditioning for faster orthogonalization, outperforming Muon on GPT and LLaMA pre-training from 60M to 1.3B parameters while cutting Newto...

  9. STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training

    cs.LG 2026-04 unverdicted novelty 6.0

    STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.

  10. Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters

    cs.LG 2026-04 unverdicted novelty 6.0

    PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.

  11. Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction

    cs.LG 2026-04 conditional novelty 6.0

    SCT pre-trains LLMs by keeping weights as compact SVD factors with Stiefel QR retraction, delivering up to 199x memory reduction per layer and allowing 70B-parameter training on a Steam Deck.

  12. BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

    cs.LG 2025-12 unverdicted novelty 6.0

    BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.

  13. Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

    cs.LG 2026-05 unverdicted novelty 5.0

    Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

  14. Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 5.0

    LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.

  15. Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 5.0

    LoPT delivers competitive LLM post-training results by training only the top half on the task objective and using feature reconstruction to update the bottom half.

  16. ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

    cs.LG 2026-05 unverdicted novelty 5.0

    ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.

  17. Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems

    cs.AI 2026-04 unverdicted novelty 4.0

    A Lingua Franca reactor-based method is proposed to address nondeterminism in agentic AI for human-in-the-loop cyber-physical systems such as driving coaches.

  18. MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

    cs.LG 2026-02 unverdicted novelty 4.0

    Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.

  19. AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control

    cs.LG 2025-12 unverdicted novelty 4.0

    AdaFRUGAL automates FRUGAL's static hyperparameters with linear decay on subspace ratio and loss-aware update frequency, delivering competitive accuracy with lower memory and faster training on C4, VietVault, and GLUE.

  20. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 18 Pith papers · 9 internal anchors

  1. [1]

    Memory efficient adaptive optimization

    Anil, R., Gupta, V., Koren, T., and Singer, Y. Memory efficient adaptive optimization. Advances in Neural Information Processing Systems, 2019

  2. [2]

    Belle: Be everyone's large language model engine

    BELLEGroup. Belle: Be everyone's large language model engine. https://github.com/LianjiaTech/BELLE, 2023

  3. [3]

    Continual learning in low-rank orthogonal subspaces

    Chaudhry, A., Khan, N., Dokania, P., and Torr, P. Continual learning in low-rank orthogonal subspaces. Advances in Neural Information Processing Systems, 2020

  4. [4]

    Non- Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression

    Chen, H., Raskutti, G., and Yuan, M. Non- Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression . Journal of Machine Learning Research, 2019

  5. [5]

    Training Deep Nets with Sublinear Memory Cost

    Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training Deep Nets with Sublinear Memory Cost . ArXiv preprint arXiv:1604.06174, 2016

  6. [6]

    Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees

    Chen, Y. and Wainwright, M. J. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. ArXiv preprint arXiv:1509.03025, 2015

  7. [7]

    W., Sutton, C., Gehrmann, S., et al

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 2023

  8. [8]

    Low- Rank Gradient Descent

    Cosson, R., Jadbabaie, A., Makur, A., Reisizadeh, A., and Shah, D. Low- Rank Gradient Descent . IEEE Open Journal of Control Systems, 2023

  9. [9]

    8-bit optimizers via block-wise quantization

    Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

  10. [10]

    Qlora: Efficient finetuning of quantized llms

    Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 2024

  11. [11]

    Delta Tuning : A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

    Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., Yi, J., Zhao, W., Wang, X., Liu, Z., Zheng, H.-T., Chen, J., Liu, Y., Tang, J., Li, J., and Sun, M. Delta Tuning : A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models . ArXiv preprint arXiv:2203.06904, 2022

  12. [12]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

  13. [13]

    C., Zadrazil, P., Kabel, A., Beaufays, F., and Motta, G

    Gooneratne, M., Sim, K. C., Zadrazil, P., Kabel, A., Beaufays, F., and Motta, G. Low-rank gradient approximation for memory-efficient on-device training of deep neural network. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020 . IEEE , 2020

  14. [14]

    Gradient Descent Happens in a Tiny Subspace

    Gur-Ari , G., Roberts, D. A., and Dyer, E. Gradient Descent Happens in a Tiny Subspace . ArXiv preprint arXiv:1812.04754, 2018

  15. [15]

    Flora: Low-Rank Adapters Are Secretly Gradient Compressors

    Hao, Y., Cao, Y., and Mou, L. Flora: Low-Rank Adapters Are Secretly Gradient Compressors . ArXiv preprint arXiv:2402.03293, 2024

  16. [16]

    Denoising diffusion probabilistic models

    Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020

  17. [17]

    J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

  18. [18]

    D., Daniels, M

    Huang, S., Hoskins, B. D., Daniels, M. W., Stiles, M. D., and Adam, G. C. Low- Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays . ACM Journal on Emerging Technologies in Computing Systems, 2023

  19. [19]

    R., Locatelli, A., Venkitesh, B., Ba, J., Gal, Y., and Gomez, A

    Kamalakara, S. R., Locatelli, A., Venkitesh, B., Ba, J., Gal, Y., and Gomez, A. N. Exploring Low Rank Training of Deep Neural Networks . ArXiv preprint arXiv:2209.13569, 2022

  20. [20]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015

  21. [21]

    o pf, A., Kilcher, Y., von R \

    K \"o pf, A., Kilcher, Y., von R \"u tte, D., Anagnostidis, S., Tam, Z. R., Stevens, K., Barhoum, A., Nguyen, D., Stanley, O., Nagyfi, R., et al. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 2024

  22. [22]

    W., Fort, S., Becker, N., and Ganguli, S

    Larsen, B. W., Fort, S., Becker, N., and Ganguli, S. How many degrees of freedom do we need to train deep networks: a loss landscape perspective. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

  23. [23]

    and Choi, S

    Lee, Y. and Choi, S. Gradient-based meta-learning with learned layerwise metric and subspace. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 . PMLR , 2018

  24. [24]

    Memory efficient optimizers with 4-bit states

    Li, B., Chen, J., and Zhu, J. Memory efficient optimizers with 4-bit states. Advances in Neural Information Processing Systems, 2024

  25. [25]

    Relo RA : High-rank training through low-rank updates

    Lialin, V., Muckatira, S., Shivagunde, N., and Rumshisky, A. Relo RA : High-rank training through low-rank updates. In The Twelfth International Conference on Learning Representations, 2024

  26. [26]

    Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

    Lin, H., Zhang, H., Ma, Y., He, T., Zhang, Z., Zha, S., and Li, M. Dynamic mini-batch sgd for elastic distributed training: Learning in the limbo of resources. arXiv preprint arXiv:1904.12043, 2019

  27. [27]

    and Hutter, F

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019

  28. [28]

    AdaLomo : Low-memory Optimization with Adaptive Learning Rate

    Lv, K., Yan, H., Guo, Q., Lv, H., and Qiu, X. AdaLomo : Low-memory Optimization with Adaptive Learning Rate . ArXiv preprint arXiv:2310.10195, 2023 a

  29. [29]

    Full Parameter Fine-tuning for Large Language Models with Limited Resources

    Lv, K., Yang, Y., Liu, T., Gao, Q., Guo, Q., and Qiu, X. Full Parameter Fine-tuning for Large Language Models with Limited Resources . ArXiv preprint arXiv:2306.09782, 2023 b

  30. [30]

    Error Feedback Can Accurately Compress Preconditioners

    Modoranu, I.-V., Kalinov, A., Kurtic, E., Frantar, E., and Alistarh, D. Error Feedback Can Accurately Compress Preconditioners . ArXiv preprint arXiv:2306.06098, 2023

  31. [31]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

  32. [32]

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020

  33. [33]

    Zero: Memory optimizations toward training trillion parameter models

    Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020

  34. [34]

    SQ u AD : 100,000+ questions for machine comprehension of text

    Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2016

  35. [35]

    Tied- Lora : Enhacing parameter efficiency of LoRA with weight tying

    Renduchintala, A., Konuk, T., and Kuchaiev, O. Tied- Lora : Enhacing parameter efficiency of LoRA with weight tying. ArXiv preprint arXiv:2311.09578, 2023

  36. [36]

    GLU Variants Improve Transformer

    Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

  37. [37]

    and Stern, M

    Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 . PMLR , 2018

  38. [38]

    E., and Stoica, I

    Sheng, Y., Cao, S., Li, D., Hooper, C., Lee, N., Yang, S., Chou, C., Zhu, B., Zheng, L., Keutzer, K., Gonzalez, J. E., and Stoica, I. S- LoRA : Serving Thousands of Concurrent LoRA Adapters . ArXiv preprint arXiv:2311.03285, 2023

  39. [39]

    Gemma: Open Models Based on Gemini Research and Technology

    Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi \`e re, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  40. [40]

    Understanding self-supervised learning with dual deep networks

    Tian, Y., Yu, L., Chen, X., and Ganguli, S. Understanding self-supervised learning with dual deep networks. ArXiv preprint arXiv:2010.00578, 2020

  41. [41]

    Tian, Y., Wang, Y., Zhang, Z., Chen, B., and Du, S. S. Jo MA : Demystifying multilayer transformers via joint dynamics of MLP and attention. In The Twelfth International Conference on Learning Representations, 2024

  42. [42]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  43. [43]

    P., and Jaggi, M

    Vogels, T., Karimireddy, S. P., and Jaggi, M. Practical low-rank communication compression in decentralized deep learning. Advances in Neural Information Processing Systems, 2020

  44. [44]

    Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019

  45. [45]

    Atomo: Communication-efficient learning via atomic sparsification

    Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. Atomo: Communication-efficient learning via atomic sparsification. Advances in neural information processing systems, 31, 2018

  46. [46]

    Cuttlefish: Low-rank model training without all the tuning

    Wang, H., Agarwal, S., Tanaka, Y., Xing, E., Papailiopoulos, D., et al. Cuttlefish: Low-rank model training without all the tuning. Proceedings of Machine Learning and Systems, 2023 a

  47. [47]

    MultiLoRA : Democratizing LoRA for Better Multi-Task Learning

    Wang, Y., Lin, Y., Zeng, X., and Zhang, G. MultiLoRA : Democratizing LoRA for Better Multi-Task Learning . ArXiv preprint arXiv:2311.11501, 2023 b

  48. [48]

    Stable and low-precision training for large-scale vision-language models

    Wortsman, M., Dettmers, T., Zettlemoyer, L., Morcos, A., Farhadi, A., and Schmidt, L. Stable and low-precision training for large-scale vision-language models. Advances in Neural Information Processing Systems, 2023

  49. [49]

    Chain of LoRA : Efficient Fine-tuning of Language Models via Residual Learning

    Xia, W., Qin, C., and Hazan, E. Chain of LoRA : Efficient Fine-tuning of Language Models via Residual Learning . ArXiv preprint arXiv:2401.04151, 2024

  50. [50]

    B., and Bernstein, J

    Yang, G., Simon, J. B., and Bernstein, J. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

  51. [51]

    Scaling Vision Transformers

    Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling Vision Transformers . In 2022 IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . IEEE, 2022

  52. [52]

    and Sennrich, R

    Zhang, B. and Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

  53. [53]

    LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

    Zhang, L., Zhang, L., Shi, S., Chu, X., and Li, B. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023

  54. [54]

    T., and Anandkumar, A

    Zhao, J., Schaefer, F. T., and Anandkumar, A. Zero initialization: Initializing neural networks with only zeros and ones. Transactions on Machine Learning Research, 2022

  55. [55]

    Inrank: Incremental low-rank learning

    Zhao, J., Zhang, Y., Chen, B., Sch \"a fer, F., and Anandkumar, A. Inrank: Incremental low-rank learning. arXiv preprint arXiv:2306.11250, 2023