pith. sign in

arxiv: 2605.13652 · v2 · pith:75CWPNKQnew · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.CL

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

Pith reviewed 2026-05-20 20:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords low-rank pre-trainingloss landscapespectral analysislanguage modelsactivation similaritygeometric basinsfull-rank trainingdownstream performance
0
0 comments X

The pith

Low-rank pre-training methods converge to geometrically distinct loss basins compared to full-rank training, even when validation perplexity matches closely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether low-rank pre-training approaches for language models reach the same solutions as full-rank training. It evaluates five low-rank methods against full-rank training across three model scales using sixteen metrics that examine loss landscapes, spectral properties of weights and updates, and activation patterns. The results show that these methods land in different regions of the loss surface and develop distinct internal representations. This matters because relying only on perplexity can mask important differences in how the models generalize and represent information.

Core claim

Low-rank pre-training methods including GaLore, Fira, CoLA, SLTrain, and ReLoRA are not equivalent to full-rank training or to each other. Full-rank training settles into a sharper basin along random directions while low-rank methods show the reverse along the top-1 PCA direction, with each method converging to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Validation perplexity does not always translate to downstream performance, and adding geometric and spectral metrics improves prediction of such performance.

What carries the argument

1-D loss landscape analysis along random and top-K PCA directions together with spectral structure of weights and updates and activation similarity to full-rank training. These metrics expose geometric and spectral distinctions between the solutions found by different methods.

If this is right

  • Perplexity alone cannot be used to claim that low-rank methods produce models comparable to full-rank training.
  • Each low-rank technique reaches a unique basin geometry, so method choice affects the final solution properties.
  • Activation divergence in later layers suggests internal representations differ even when surface metrics look similar.
  • Geometric and spectral metrics can supplement perplexity to better forecast downstream performance.
  • Low-rank methods are not interchangeable with each other for the same reason they differ from full-rank training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • At larger model scales the geometric distinctions might become more or less pronounced, affecting which method is preferable for specific tasks.
  • The distinct basins could influence model robustness or transfer to new domains in ways not captured by the current metrics.
  • Practitioners might select a low-rank method based on desired spectral characteristics rather than assuming all low-rank approaches behave alike.
  • Future experiments could test whether these geometric differences persist or change when models are fine-tuned on downstream data.

Load-bearing premise

That observed differences in loss landscape geometry, spectral properties, and activations reliably indicate non-equivalent generalization and internal representations across methods.

What would settle it

A demonstration that low-rank and full-rank models achieve identical downstream task performance and identical layer-wise activation distributions at multiple scales despite the reported geometric and spectral differences.

Figures

Figures reproduced from arXiv: 2605.13652 by Anna Rumshisky, Namrata Shivagunde, Sherin Muckatira, Vijeta Deshpande.

Figure 1
Figure 1. Figure 1: 1-D loss landscape (a) random direction (b) top-1 PCA direction. GaLore, CoLA and ReLoRA converge to sharper basin than full-rank. GaLore has relatively smaller σ1 at every scale (∼ 3 throughout training) yet still produces moderate-to-high sharpness (∼ 0.005−0.007) — the loss elevates substantially, indicating a very steep loss landscape along its leading direction. Similar pattern is seen in CoLA and ReL… view at source ↗
Figure 2
Figure 2. Figure 2: 1-D interpolation (a) CCBH (b) IMBH and ReLoRA exhibit relatively low mutual barriers. SLTrain, by contrast, shows substantially higher barriers against all other low-rank methods, placing it in a distinctly separate valley. At 130M and 350M, the full-rank versus low-rank barriers decrease, and low-rank vs. low-rank increases. Fira and ReLoRA retain the smallest mutual barriers, GaLore occupies an intermed… view at source ↗
Figure 3
Figure 3. Figure 3: Rank and spectral metrics at 350M. the deviation, and Fira benefits from this more than any other method. Per-layer dynamics (Row 3). For both Fira and CoLA, L2 distance grows in the later layers as training progresses, with the final layer reducing drift relative to full-rank. CoLA is directionally off at every layer (cos ≈ 0), while Fira preserves angular alignment. Linear CKA deviates most in the middle… view at source ↗
Figure 4
Figure 4. Figure 4: Activation deviation with full-rank baseline. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Zero-shot downstream performance. Predictor LOSO Pearson LOMO Pearson R 2 (in-sample) val loss only 0.873 0.864 0.841 geometry only (8 feats) 0.498 0.431 0.558 val loss + geometry (9 feats) 0.913 0.895 0.907 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 1-D loss landscape at 60M parameters. Top: centered loss profile L(α) − L(0) averaged over 100 random directions, at five training checkpoints (1k, 3k, 5k, 8k, 10k). Bottom-left: average sharpness with respect to training step. Bottom-right: average direction variance. ReLoRA is omitted from the plot as it makes it harder to view other methods. The plot including ReLoRA is given in 7 . 0.002 0.000 0.002 0.… view at source ↗
Figure 7
Figure 7. Figure 7: 1-D loss landscape at 60M parameters for All methods. Top: centered loss profile L(α) − L(0) averaged over 100 random directions, at five training checkpoints (1k, 3k, 5k, 8k, 10k). Bottom-left: average expected sharpness with respect to training step. Bottom-right: average direction variance. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: 1-D loss landscape at 130M parameters. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: 1-D loss landscape at 130M parameters for [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: 1-D loss landscape at 350M parameters. Same layout as 6. 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 ( ) (0) step 6000 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 step 12000 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 step 24000 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 step 48000 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 step 60000 10000 20000 30000 40000 50000 60000 training step 0.000 0.025 0.050 0.075 0.100 0.125 expected shar… view at source ↗
Figure 11
Figure 11. Figure 11: 1-D loss landscape at 350M parameters for [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: 1-D loss landscape along top-k PCA directions at 60M parameters for k ∈ {1, 5, 10, 20}. Top four rows: centered loss profile at five training checkpoints. Bottom: expected sharpness (left column of summary panels) and across-component direction variance (right column) as a function of training step, per k. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 1-D loss landscape along top-k PCA directions at 130M parameters for k ∈ {1, 5, 10, 20}. Layout identical to [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: 1-D loss landscape along top-k PCA directions at 350M parameters for k ∈ {1, 5, 10, 20}. Layout identical to [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Rank and Spectral metrics for 60M. 5k 10k 15k 20k training step 200 300 400 500 600 effective rank of W 5k 10k 15k 20k training step 20 40 60 80 100 stable rank of W 5k 10k 15k 20k training step 0.00 0.05 0.10 0.15 0.20 0.25 spectral gap of W 5k 10k 15k 20k training step 300 400 500 600 700 # > 0.1 o f W 5k 10k 15k 20k training step 0 200 400 600 e f f e c tiv e r a n k o f W 5k 10k 15k 20k training step … view at source ↗
Figure 16
Figure 16. Figure 16: Rank and Spectral metrics for 130M. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Activation L2 distance layer-wise 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Activation linear CKA similarity 24 [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Activation cosine similarity layer-wise 25 [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
read the original abstract

Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper empirically compares five low-rank pre-training methods (GaLore, Fira, CoLA, SLTrain, ReLoRA) against full-rank training on language models at 60M, 130M, and 350M scales. It uses 16 metrics spanning 1-D loss-landscape slices (random and top-K PCA directions), checkpoint interpolation, spectral properties of weights/updates, and activation similarities to argue that low-rank methods reach geometrically distinct basins from full-rank and from each other, even at matched validation perplexity, and that perplexity alone is a poor predictor of downstream performance.

Significance. If the geometric and spectral distinctions hold under replication, the work provides a useful demonstration that perplexity matching does not imply solution equivalence in low-rank LLM training. The multi-scale design and breadth of metrics (loss curvature, spectral structure, activation divergence) offer a concrete template for more informative method comparisons beyond scalar performance numbers.

major comments (2)
  1. [Experimental results and evaluation protocol] The central non-equivalence claim rests on observed differences in 1-D loss-landscape curvature (random vs. top-1 PCA directions) and layer-wise activation divergence. However, the experiments appear to follow the single-seed protocol critiqued in the abstract for prior work; no variance across independent runs, standard errors, or statistical tests are reported. This is load-bearing because stochastic effects from initialization, data ordering, or optimizer state could produce the reported basin differences without any systematic effect of the rank constraint.
  2. [Downstream evaluation and metric utility] The statement that 'adding geometric and spectral metrics improves the prediction' of downstream performance is presented without quantitative details on the regression or classification setup, the specific downstream tasks, or the improvement magnitude at each scale. This weakens the practical takeaway that the 16-metric suite is superior to perplexity alone.
minor comments (2)
  1. [Metric definitions] Notation for the 16 metrics is introduced in the abstract but would benefit from an explicit table or appendix listing each metric, its mathematical definition, and the exact directions or layers used.
  2. [Figures] Figure captions for the loss-landscape slices should state the number of points sampled along each direction and the step size to allow direct reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Experimental results and evaluation protocol] The central non-equivalence claim rests on observed differences in 1-D loss-landscape curvature (random vs. top-1 PCA directions) and layer-wise activation divergence. However, the experiments appear to follow the single-seed protocol critiqued in the abstract for prior work; no variance across independent runs, standard errors, or statistical tests are reported. This is load-bearing because stochastic effects from initialization, data ordering, or optimizer state could produce the reported basin differences without any systematic effect of the rank constraint.

    Authors: We agree that reliance on single-seed runs is a limitation, especially since the manuscript itself critiques this practice in prior work. Training at these scales is computationally intensive, which constrained our initial experimental design. In the revised manuscript we will add results from at least three independent random seeds for the 60M and 130M models, reporting means and standard deviations for the primary geometric and spectral metrics. For the 350M scale we will explicitly acknowledge the single-seed constraint and discuss its implications for the strength of the claims. These additions will help demonstrate that the observed basin differences are not attributable solely to stochastic variation. revision: yes

  2. Referee: [Downstream evaluation and metric utility] The statement that 'adding geometric and spectral metrics improves the prediction' of downstream performance is presented without quantitative details on the regression or classification setup, the specific downstream tasks, or the improvement magnitude at each scale. This weakens the practical takeaway that the 16-metric suite is superior to perplexity alone.

    Authors: We thank the referee for highlighting this gap. The current manuscript states the predictive improvement at a high level without the supporting experimental details. In the revision we will expand the relevant section to specify the downstream tasks, the regression setup (including the model type and feature sets), the quantitative gains (e.g., changes in R² or prediction error) when geometric and spectral metrics are added versus perplexity alone, and results disaggregated by model scale. This will make the claim concrete and reproducible. revision: yes

Circularity Check

0 steps flagged

Empirical comparison of low-rank methods uses independent metrics with no definitional or fitted circularity

full rationale

The paper reports direct empirical measurements on trained models: 1-D loss landscape slices along random and PCA directions, spectral analysis of weights/updates, activation similarities, and downstream performance. These are applied as separate evaluation procedures to full-rank and low-rank runs at matched perplexity. No step defines a quantity in terms of another that is then re-used as a prediction, no fitted parameters are renamed as forecasts, and no uniqueness theorems or ansatzes are imported via self-citation. The non-equivalence conclusion follows from the observed differences across the 16 metrics rather than from any reduction to the input training configurations by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical comparative study and introduces no new free parameters, mathematical axioms, or postulated entities beyond standard assumptions of gradient-based optimization and the chosen evaluation metrics.

axioms (1)
  • domain assumption Gradient-based optimization on the chosen loss landscape produces representative solutions for the model scales tested.
    Invoked when interpreting differences in loss landscape geometry and activation patterns as meaningful distinctions between training regimes.

pith-pipeline@v0.9.0 · 5877 in / 1252 out tokens · 70811 ms · 2026-05-20T20:19:16.830666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    A modern look at the relationship between sharpness and generalization.arXiv preprint arXiv:2302.07011, 2023

    Maksym Andriushchenko, Francesco Croce, Maximilian Müller, Matthias Hein, and Nicolas Flammarion. A modern look at the relationship between sharpness and generalization.arXiv preprint arXiv:2302.07011, 2023

  2. [2]

    Understanding pre-training and fine-tuning from loss landscape perspectives,

    Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Unveiling the basin-like loss landscape in large language models.arXiv preprint arXiv:2505.17646, 2025

  3. [3]

    Fira: Can we achieve full-rank training of llms under low-rank constraint?arXiv preprint arXiv:2410.01623, 2024a

    Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, and Guoren Wang. Fira: Can we achieve full-rank training of llms under low-rank constraint?ArXiv, abs/2410.01623, 2024

  4. [4]

    Linear mode connectivity and the lottery ticket hypothesis

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational conference on machine learning, pages 3259–3269. PMLR, 2020

  5. [5]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  6. [6]

    SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining.arXiv preprint arXiv:2406.02214, 2024

    Andi Han, Jiaxiang Li, Wei Huang, Mingyi Hong, Akiko Takeda, Pratik Jawanpuria, and Bamdev Mishra. SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining.arXiv preprint arXiv:2406.02214, 2024

  7. [7]

    Flora: Low-rank adapters are secretly gradient compressors,

    Yongchang Hao, Yanshuai Cao, and Lili Mou. Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293, 2024

  8. [8]

    Galore-mini: Low rank gradient learning with fewer learning rates

    Weihao Huang, Zhenyu Zhang, Yushun Zhang, Zhi-Quan Luo, Ruoyu Sun, and Zhangyang Wang. Galore-mini: Low rank gradient learning with fewer learning rates. InNeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024

  9. [9]

    From galore to welore: How low-rank weights non-uniformly emerge from low-rank gradients

    AJAY KUMAR JAISW AL, Lu Yin, Zhenyu Zhang, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. From galore to welore: How low-rank weights non-uniformly emerge from low-rank gradients

  10. [10]

    I Can’t Believe It’s Not Better! - Understanding Deep Learning Through Empirical Falsification

    Simran Kaur, Jeremy Cohen, and Zachary Chase Lipton. On the maximum hessian eigenvalue and generalization. In Javier Antorán, Arno Blaas, Fan Feng, Sahra Ghalebikesabi, Ian Mason, Melanie F. Pradier, David Rohde, Francisco J. R. Ruiz, and Aaron Schein, editors,Proceedings on "I Can’t Believe It’s Not Better! - Understanding Deep Learning Through Empirical...

  11. [11]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning (ICML), 2019

  12. [12]

    Visualizing the loss landscape of neural nets

    Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

  13. [13]

    Lost: Low-rank and sparse pre-training for large language models.arXiv preprint arXiv:2508.02668, 2025

    Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, and Xilu Wang. Lost: Low-rank and sparse pre-training for large language models.arXiv preprint arXiv:2508.02668, 2025

  14. [14]

    Flat-lora: Low-rank adaptation over a flat loss landscape.arXiv preprint arXiv:2409.14396, 2024

    Tao Li, Zhengbao He, Yujun Li, Yasheng Wang, Lifeng Shang, and Xiaolin Huang. Flat-lora: Low-rank adaptation over a flat loss landscape.arXiv preprint arXiv:2409.14396, 2024

  15. [15]

    ReLoRA: High-Rank Training Through Low-Rank Updates

    Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReLoRA: High-Rank Training Through Low-Rank Updates. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 10

  16. [16]

    Same pre-training loss, better downstream: Implicit bias matters for language models

    Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. Same pre-training loss, better downstream: Implicit bias matters for language models. InInternational Conference on Machine Learning, 2022

  17. [17]

    On the optimization landscape of low rank adaptation methods for large language models

    Xu-Hui Liu, Yali Du, Jun Wang, and Yang Yu. On the optimization landscape of low rank adaptation methods for large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  18. [18]

    Cola: Compute-efficient pre-training of llms via low-rank activation

    Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Mingsong Yan, Zi Yang, Paul D Hovland, Bogdan Nicolae, Franck Cappello, Sui Tang, and Zheng Zhang. Cola: Compute-efficient pre-training of llms via low-rank activation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4627–4645, 2025

  19. [19]

    LoQT: Low Rank Adapters for Quantized Training.arXiv preprint arXiv:2405.16528, 2024

    Sebastian Loeschcke, Mads Toftrup, Michael J Kastoryano, Serge Belongie, and Vésteinn Snæb- jarnarson. LoQT: Low Rank Adapters for Quantized Training.arXiv preprint arXiv:2405.16528, 2024

  20. [20]

    Velora: Memory efficient training using rank-1 sub-token projections.Advances in Neural Information Processing Systems, 37:42292–42310, 2024

    Roy Miles, Pradyumna Reddy, Ismail Elezi, and Jiankang Deng. Velora: Memory efficient training using rank-1 sub-token projections.Advances in Neural Information Processing Systems, 37:42292–42310, 2024

  21. [21]

    Grass: Com- pute efficient low-memory llm training with structured sparse gradients.arXiv preprint arXiv:2406.17660, 2024

    Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, and Virginia Smith. Grass: Com- pute efficient low-memory llm training with structured sparse gradients.arXiv preprint arXiv:2406.17660, 2024

  22. [22]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  23. [23]

    Namrata Shivagunde, Mayank Kulkarni, Giannis Karamanolakis, Jack G. M. FitzGerald, Yan- nick Versley, Saleh Soltan, V olkan Cevher, Jianhua Lu, and Anna Rumshisky. Approximations may be all you need: Towards pre-training llms with low-rank decomposition and optimizers. 2024

  24. [24]

    Galore 2: Large-scale llm pre-training by gradient low-rank projection.ArXiv, abs/2504.20437, 2025

    DiJia Su, Andrew Gu, Jane Xu, Yuan Tian, and Jiawei Zhao. Galore 2: Large-scale llm pre-training by gradient low-rank projection.ArXiv, abs/2504.20437, 2025

  25. [25]

    Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

    Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

  26. [26]

    Coap: Memory-efficient training with correlation-aware gradient projection

    Jinqi Xiao, Shen Sang, Tiancheng Zhi, Jing Liu, Qing Yan, Linjie Luo, and Bo Yuan. Coap: Memory-efficient training with correlation-aware gradient projection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30116–30126, 2025

  27. [27]

    Q-galore: Quantized galore with int4 projection and layer-adaptive low-rank gradients

    Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. Q-galore: Quantized galore with int4 projection and layer-adaptive low-rank gradients. arXiv preprint arXiv:2407.08296, 2024

  28. [28]

    GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

  29. [29]

    Switchlora: Switched low-rank adaptation can learn full-rank information.arXiv preprint arXiv:2406.06564, 2024

    Kaiye Zhou, Shucheng Wang, and Jun Xu. Switchlora: Switched low-rank adaptation can learn full-rank information.arXiv preprint arXiv:2406.06564, 2024

  30. [30]

    Demystifying Mergeability: Interpretable Properties to Predict Model Merging Success

    Luca Zhou, Bo Zhao, Rose Yu, and Emanuele Rodolà. Demystifying mergeability: Interpretable properties to predict model merging success.arXiv preprint arXiv:2601.22285, 2026. A More details on metrics We provide more details on the metrics in this section. 11 A.1 Loss landscape related metrics Direction variance equation is given below DV= 1 2N PN j=1 h σ2...