arxiv: 2403.03507 · v2 · submitted 2024-03-06 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Jiawei Zhao , Zhenyu Zhang , Beidi Chen , Zhangyang Wang , Anima Anandkumar , Yuandong Tian

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords memory-efficient traininglow-rank gradient projectionLLM pre-trainingoptimizer statesconsumer GPUfull-parameter training

0 comments

The pith

GaLore projects full gradients onto low-rank subspaces periodically, cutting optimizer memory by 65.5% while training every parameter of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training method that projects gradients to a low-rank form at set intervals so that optimizer states consume far less memory. Unlike methods that freeze most weights and train only low-rank adapters, this approach still updates every model parameter yet stores only the compact projected information. Experiments on LLaMA models up to 7 billion parameters show that final model quality on pre-training and fine-tuning tasks stays comparable to standard full-rank training. The practical result is that a complete 7B pre-training run fits on a single 24GB consumer GPU.

Core claim

Gradient Low-Rank Projection reduces optimizer-state memory by up to 65.5% by decomposing gradients into low-rank factors whose bases are recomputed periodically, while the actual weight updates remain full rank and the resulting models match the quality of conventional training on both C4 pre-training and GLUE fine-tuning.

What carries the argument

Periodic low-rank projection of gradients, which decomposes each gradient matrix into a pair of low-rank factors stored by the optimizer while the weight update itself stays full rank.

If this is right

Pre-training a 7B model becomes feasible on a single 24GB GPU without model parallelism, checkpointing, or offloading.
An 8-bit version further reduces optimizer memory by up to 82.5% and total training memory by 63.3%.
Performance stays comparable to full-rank training across both pre-training and fine-tuning regimes.
No full-rank warm-start is required, unlike some low-rank adaptation approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection idea could be applied to other first-order optimizers beyond Adam.
Lower memory use may allow larger batch sizes or longer context lengths on the same hardware.
Periodic recomputation opens the possibility of making the projection interval itself adaptive during training.

Load-bearing premise

That recomputing the low-rank bases at intervals keeps the projected gradients close enough to the original ones for the optimizer to reach models of comparable quality.

What would settle it

Run identical 7B pre-training on C4 with both GaLore and a full-memory baseline and compare final validation perplexity or downstream GLUE scores; a large gap would falsify the claim.

read the original abstract

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GaLore projects full gradients to low-rank via periodic SVD to cut optimizer memory while keeping all parameters trainable, and the 7B pre-training on 24GB GPUs is the concrete result worth checking.

read the letter

The core idea is simple: instead of storing full-rank Adam states, GaLore runs SVD on the gradient matrix every T steps to get a low-rank basis, projects the gradient into that subspace for the update, then applies it back to the full weights. This keeps the model fully trainable unlike LoRA, and they report 65% less optimizer memory on LLaMA 1B and 7B pre-training over 19.7B C4 tokens, plus further gains with 8-bit quantization. The single-GPU 7B run on an RTX 4090 without offloading is the part that stands out as new in practice.

Referee Report

3 major / 2 minor

Summary. The paper introduces GaLore, a training method that projects gradients onto a low-rank subspace via periodic SVD-based basis updates, enabling full-parameter optimization of LLMs with substantially reduced optimizer memory (up to 65.5% savings). It reports performance parity with standard Adam on LLaMA 1B/7B pre-training using up to 19.7B C4 tokens and on RoBERTa fine-tuning for GLUE, including the first claimed demonstration of 7B pre-training on a single 24GB consumer GPU without parallelism, checkpointing, or offloading. An 8-bit variant further reduces memory.

Significance. If the empirical results hold under rigorous controls, the work has clear significance for lowering hardware barriers to LLM pre-training. Demonstrating viable 7B-scale training on consumer GPUs directly addresses a practical bottleneck and could accelerate research in resource-limited settings. The approach's retention of full-rank parameter updates distinguishes it from adaptation methods like LoRA.

major comments (3)

[§3] §3 (Method), Algorithm 1: The central mechanism relies on updating low-rank bases every T steps via SVD of the gradient matrix. No analysis or bounds are given on how quickly the gradient subspace evolves during 19.7B-token pre-training; if drift exceeds the update interval, the projected direction introduces accumulating bias relative to full gradients, undermining the claim of comparable optimization dynamics.
[§4.1] §4.1 (Pre-training experiments): The reported performance parity for LLaMA 7B lacks ablations on the free parameters r (projection rank) and T (update frequency). Without these, it is impossible to determine whether the chosen values are robust or were tuned post-hoc to match baseline quality.
[Table 2] Table 2 (Memory and performance): The 65.5% optimizer memory reduction and GLUE results are presented without explicit confirmation that all baselines (including 8-bit Adam) used identical learning-rate schedules, batch sizes, and warm-up protocols; any mismatch would invalidate the cross-method comparison.

minor comments (2)

[Abstract] Abstract: The phrase 'for the first time' for 7B single-GPU training should be qualified with a short citation or discussion of prior single-GPU attempts.
[§4.3] §4.3 (8-bit variant): Clarify the interaction between 8-bit quantization and the low-rank projection step, including any additional error introduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [§3] §3 (Method), Algorithm 1: The central mechanism relies on updating low-rank bases every T steps via SVD of the gradient matrix. No analysis or bounds are given on how quickly the gradient subspace evolves during 19.7B-token pre-training; if drift exceeds the update interval, the projected direction introduces accumulating bias relative to full gradients, undermining the claim of comparable optimization dynamics.

Authors: We agree that providing analysis on the evolution of the gradient subspace would be valuable. While we do not provide theoretical bounds in the current manuscript, our experiments show that GaLore achieves performance comparable to full-rank training, indicating that the periodic updates with the selected T effectively capture the relevant subspace without significant bias accumulation. In the revised version, we will include an empirical analysis of the subspace drift, such as measuring the angle between successive low-rank bases over the course of training, to better justify the update frequency. revision: partial
Referee: [§4.1] §4.1 (Pre-training experiments): The reported performance parity for LLaMA 7B lacks ablations on the free parameters r (projection rank) and T (update frequency). Without these, it is impossible to determine whether the chosen values are robust or were tuned post-hoc to match baseline quality.

Authors: We appreciate this suggestion. The values of r and T were chosen based on preliminary experiments to balance memory savings and performance, but we acknowledge the need for more comprehensive ablations. In the revision, we will add ablation studies varying r and T for the LLaMA 7B pre-training, demonstrating the robustness of the results within reasonable ranges of these hyperparameters. revision: yes
Referee: [Table 2] Table 2 (Memory and performance): The 65.5% optimizer memory reduction and GLUE results are presented without explicit confirmation that all baselines (including 8-bit Adam) used identical learning-rate schedules, batch sizes, and warm-up protocols; any mismatch would invalidate the cross-method comparison.

Authors: We confirm that all methods, including the 8-bit Adam baseline, were trained using identical hyperparameters: the same learning rate schedule, batch size, and warm-up protocol as detailed in Section 4. To make this explicit, we will add a clarifying statement in the caption of Table 2 and in the experimental setup section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GaLore derivation

full rationale

The paper proposes GaLore as an algorithmic modification to the optimizer: gradients are projected onto a low-rank subspace obtained via periodic SVD of the gradient matrix, with bases updated every T steps. Memory savings (up to 65.5% in optimizer states) are direct measurements of reduced state sizes under BF16/8-bit quantization, not quantities derived from fitted constants or self-referential equations. Performance equivalence to full Adam is shown via empirical pre-training on LLaMA 1B/7B with C4 (19.7B tokens) and fine-tuning on GLUE; no derivation step reduces to a self-citation chain, ansatz smuggled via prior work, or renaming of known results. The central claim rests on measured quantities and experimental validation rather than any load-bearing self-definition or fitted-input prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on one central domain assumption and one tunable hyper-parameter; no new physical entities are introduced.

free parameters (1)

projection rank r
Low-rank dimension chosen per layer or experiment; controls memory-performance tradeoff and must be set by the user.

axioms (1)

domain assumption Gradients admit a low-rank approximation that preserves sufficient directional information for effective Adam-style updates when the basis is refreshed periodically.
This assumption is required for the memory reduction to not degrade final model quality.

pith-pipeline@v0.9.0 · 5584 in / 1319 out tokens · 61164 ms · 2026-05-16T23:47:37.157229+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
math.OC 2026-05 unverdicted novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
cs.LG 2026-05 unverdicted novelty 6.0

BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
cs.LG 2026-05 unverdicted novelty 6.0

BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
cs.LG 2026-05 unverdicted novelty 6.0

Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-...
AdamO: A Collapse-Suppressed Optimizer for Offline RL
cs.LG 2026-05 unverdicted novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
cs.LG 2026-04 unverdicted novelty 6.0

Muon² adds adaptive second-moment preconditioning to Muon, improving spectrum conditioning for faster orthogonalization, outperforming Muon on GPT and LLaMA pre-training from 60M to 1.3B parameters while cutting Newto...
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training
cs.LG 2026-04 unverdicted novelty 6.0

STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters
cs.LG 2026-04 unverdicted novelty 6.0

PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction
cs.LG 2026-04 conditional novelty 6.0

SCT pre-trains LLMs by keeping weights as compact SVD factors with Stiefel QR retraction, delivering up to 199x memory reduction per layer and allowing 70B-parameter training on a Steam Deck.
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
cs.LG 2025-12 unverdicted novelty 6.0

BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
cs.LG 2026-05 unverdicted novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
cs.CL 2026-05 unverdicted novelty 5.0

LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
cs.CL 2026-05 unverdicted novelty 5.0

LoPT delivers competitive LLM post-training results by training only the top half on the task objective and using feature reconstruction to update the bottom half.
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity
cs.LG 2026-05 unverdicted novelty 5.0

ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.
Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems
cs.AI 2026-04 unverdicted novelty 4.0

A Lingua Franca reactor-based method is proposed to address nondeterminism in agentic AI for human-in-the-loop cyber-physical systems such as driving coaches.
MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training
cs.LG 2026-02 unverdicted novelty 4.0

Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.
AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
cs.LG 2025-12 unverdicted novelty 4.0

AdaFRUGAL automates FRUGAL's static hyperparameters with linear decay on subspace ratio and loss-aware update frequency, delivering competitive accuracy with lower memory and faster training on C4, VietVault, and GLUE.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 18 Pith papers · 9 internal anchors

[1]

Memory efficient adaptive optimization

Anil, R., Gupta, V., Koren, T., and Singer, Y. Memory efficient adaptive optimization. Advances in Neural Information Processing Systems, 2019

work page 2019
[2]

Belle: Be everyone's large language model engine

BELLEGroup. Belle: Be everyone's large language model engine. https://github.com/LianjiaTech/BELLE, 2023

work page 2023
[3]

Continual learning in low-rank orthogonal subspaces

Chaudhry, A., Khan, N., Dokania, P., and Torr, P. Continual learning in low-rank orthogonal subspaces. Advances in Neural Information Processing Systems, 2020

work page 2020
[4]

Non- Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression

Chen, H., Raskutti, G., and Yuan, M. Non- Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression . Journal of Machine Learning Research, 2019

work page 2019
[5]

Training Deep Nets with Sublinear Memory Cost

Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training Deep Nets with Sublinear Memory Cost . ArXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees

Chen, Y. and Wainwright, M. J. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. ArXiv preprint arXiv:1509.03025, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

W., Sutton, C., Gehrmann, S., et al

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 2023

work page 2023
[8]

Low- Rank Gradient Descent

Cosson, R., Jadbabaie, A., Makur, A., Reisizadeh, A., and Shah, D. Low- Rank Gradient Descent . IEEE Open Journal of Control Systems, 2023

work page 2023
[9]

8-bit optimizers via block-wise quantization

Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

work page 2022
[10]

Qlora: Efficient finetuning of quantized llms

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 2024

work page 2024
[11]

Delta Tuning : A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., Yi, J., Zhao, W., Wang, X., Liu, Z., Zheng, H.-T., Chen, J., Liu, Y., Tang, J., Li, J., and Sun, M. Delta Tuning : A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models . ArXiv preprint arXiv:2203.06904, 2022

work page arXiv 2022
[12]

An image is worth 16x16 words: Transformers for image recognition at scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

work page 2021
[13]

C., Zadrazil, P., Kabel, A., Beaufays, F., and Motta, G

Gooneratne, M., Sim, K. C., Zadrazil, P., Kabel, A., Beaufays, F., and Motta, G. Low-rank gradient approximation for memory-efficient on-device training of deep neural network. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020 . IEEE , 2020

work page 2020
[14]

Gradient Descent Happens in a Tiny Subspace

Gur-Ari , G., Roberts, D. A., and Dyer, E. Gradient Descent Happens in a Tiny Subspace . ArXiv preprint arXiv:1812.04754, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Flora: Low-Rank Adapters Are Secretly Gradient Compressors

Hao, Y., Cao, Y., and Mou, L. Flora: Low-Rank Adapters Are Secretly Gradient Compressors . ArXiv preprint arXiv:2402.03293, 2024

work page arXiv 2024
[16]

Denoising diffusion probabilistic models

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020

work page 2020
[17]

J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

work page 2022
[18]

D., Daniels, M

Huang, S., Hoskins, B. D., Daniels, M. W., Stiles, M. D., and Adam, G. C. Low- Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays . ACM Journal on Emerging Technologies in Computing Systems, 2023

work page 2023
[19]

R., Locatelli, A., Venkitesh, B., Ba, J., Gal, Y., and Gomez, A

Kamalakara, S. R., Locatelli, A., Venkitesh, B., Ba, J., Gal, Y., and Gomez, A. N. Exploring Low Rank Training of Deep Neural Networks . ArXiv preprint arXiv:2209.13569, 2022

work page arXiv 2022
[20]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015

work page 2015
[21]

o pf, A., Kilcher, Y., von R \

K \"o pf, A., Kilcher, Y., von R \"u tte, D., Anagnostidis, S., Tam, Z. R., Stevens, K., Barhoum, A., Nguyen, D., Stanley, O., Nagyfi, R., et al. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 2024

work page 2024
[22]

W., Fort, S., Becker, N., and Ganguli, S

Larsen, B. W., Fort, S., Becker, N., and Ganguli, S. How many degrees of freedom do we need to train deep networks: a loss landscape perspective. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

work page 2022
[23]

and Choi, S

Lee, Y. and Choi, S. Gradient-based meta-learning with learned layerwise metric and subspace. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 . PMLR , 2018

work page 2018
[24]

Memory efficient optimizers with 4-bit states

Li, B., Chen, J., and Zhu, J. Memory efficient optimizers with 4-bit states. Advances in Neural Information Processing Systems, 2024

work page 2024
[25]

Relo RA : High-rank training through low-rank updates

Lialin, V., Muckatira, S., Shivagunde, N., and Rumshisky, A. Relo RA : High-rank training through low-rank updates. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[26]

Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

Lin, H., Zhang, H., Ma, Y., He, T., Zhang, Z., Zha, S., and Li, M. Dynamic mini-batch sgd for elastic distributed training: Learning in the limbo of resources. arXiv preprint arXiv:1904.12043, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[27]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019

work page 2019
[28]

AdaLomo : Low-memory Optimization with Adaptive Learning Rate

Lv, K., Yan, H., Guo, Q., Lv, H., and Qiu, X. AdaLomo : Low-memory Optimization with Adaptive Learning Rate . ArXiv preprint arXiv:2310.10195, 2023 a

work page arXiv 2023
[29]

Full Parameter Fine-tuning for Large Language Models with Limited Resources

Lv, K., Yang, Y., Liu, T., Gao, Q., Guo, Q., and Qiu, X. Full Parameter Fine-tuning for Large Language Models with Limited Resources . ArXiv preprint arXiv:2306.09782, 2023 b

work page arXiv 2023
[30]

Error Feedback Can Accurately Compress Preconditioners

Modoranu, I.-V., Kalinov, A., Kurtic, E., Frantar, E., and Alistarh, D. Error Feedback Can Accurately Compress Preconditioners . ArXiv preprint arXiv:2306.06098, 2023

work page arXiv 2023
[31]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020

work page 2020
[33]

Zero: Memory optimizations toward training trillion parameter models

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020

work page 2020
[34]

SQ u AD : 100,000+ questions for machine comprehension of text

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2016

work page 2016
[35]

Tied- Lora : Enhacing parameter efficiency of LoRA with weight tying

Renduchintala, A., Konuk, T., and Kuchaiev, O. Tied- Lora : Enhacing parameter efficiency of LoRA with weight tying. ArXiv preprint arXiv:2311.09578, 2023

work page arXiv 2023
[36]

GLU Variants Improve Transformer

Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[37]

and Stern, M

Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 . PMLR , 2018

work page 2018
[38]

E., and Stoica, I

Sheng, Y., Cao, S., Li, D., Hooper, C., Lee, N., Yang, S., Chou, C., Zhu, B., Zheng, L., Keutzer, K., Gonzalez, J. E., and Stoica, I. S- LoRA : Serving Thousands of Concurrent LoRA Adapters . ArXiv preprint arXiv:2311.03285, 2023

work page arXiv 2023
[39]

Gemma: Open Models Based on Gemini Research and Technology

Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi \`e re, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Understanding self-supervised learning with dual deep networks

Tian, Y., Yu, L., Chen, X., and Ganguli, S. Understanding self-supervised learning with dual deep networks. ArXiv preprint arXiv:2010.00578, 2020

work page arXiv 2010
[41]

Tian, Y., Wang, Y., Zhang, Z., Chen, B., and Du, S. S. Jo MA : Demystifying multilayer transformers via joint dynamics of MLP and attention. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

P., and Jaggi, M

Vogels, T., Karimireddy, S. P., and Jaggi, M. Practical low-rank communication compression in decentralized deep learning. Advances in Neural Information Processing Systems, 2020

work page 2020
[44]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019

work page 2019
[45]

Atomo: Communication-efficient learning via atomic sparsification

Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. Atomo: Communication-efficient learning via atomic sparsification. Advances in neural information processing systems, 31, 2018

work page 2018
[46]

Cuttlefish: Low-rank model training without all the tuning

Wang, H., Agarwal, S., Tanaka, Y., Xing, E., Papailiopoulos, D., et al. Cuttlefish: Low-rank model training without all the tuning. Proceedings of Machine Learning and Systems, 2023 a

work page 2023
[47]

MultiLoRA : Democratizing LoRA for Better Multi-Task Learning

Wang, Y., Lin, Y., Zeng, X., and Zhang, G. MultiLoRA : Democratizing LoRA for Better Multi-Task Learning . ArXiv preprint arXiv:2311.11501, 2023 b

work page arXiv 2023
[48]

Stable and low-precision training for large-scale vision-language models

Wortsman, M., Dettmers, T., Zettlemoyer, L., Morcos, A., Farhadi, A., and Schmidt, L. Stable and low-precision training for large-scale vision-language models. Advances in Neural Information Processing Systems, 2023

work page 2023
[49]

Chain of LoRA : Efficient Fine-tuning of Language Models via Residual Learning

Xia, W., Qin, C., and Hazan, E. Chain of LoRA : Efficient Fine-tuning of Language Models via Residual Learning . ArXiv preprint arXiv:2401.04151, 2024

work page arXiv 2024
[50]

B., and Bernstein, J

Yang, G., Simon, J. B., and Bernstein, J. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

work page arXiv 2023
[51]

Scaling Vision Transformers

Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling Vision Transformers . In 2022 IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . IEEE, 2022

work page 2022
[52]

and Sennrich, R

Zhang, B. and Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[53]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Zhang, L., Zhang, L., Shi, S., Chu, X., and Li, B. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

T., and Anandkumar, A

Zhao, J., Schaefer, F. T., and Anandkumar, A. Zero initialization: Initializing neural networks with only zeros and ones. Transactions on Machine Learning Research, 2022

work page 2022
[55]

Inrank: Incremental low-rank learning

Zhao, J., Zhang, Y., Chen, B., Sch \"a fer, F., and Anandkumar, A. Inrank: Incremental low-rank learning. arXiv preprint arXiv:2306.11250, 2023

work page arXiv 2023