arxiv: 2604.02659 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI· cs.NA· math.NA· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Low-Rank Compression of Pretrained Models via Randomized Subspace Iteration

Farhad Pourkamali-Anaraki

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NAmath.NAstat.ML

keywords low-rank compressionrandomized subspace iterationpretrained modelsmodel compressionrandomized SVDsoftmax perturbationstransformer architecturesconvolutional networks

0 comments

The pith

Randomized subspace iteration provides superior low-rank compression for pretrained models compared to randomized SVD through better spectral separation and bounded softmax errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard randomized SVD struggles with slowly decaying singular values common in large models. It proposes randomized subspace iteration, which adds power iterations to better separate the dominant singular components. This leads to lower approximation error that directly controls how much the softmax class probabilities deviate from the original model. Experiments on convolutional and transformer networks confirm that RSI maintains higher predictive accuracy even when compressing aggressively.

Core claim

By linking low-rank approximation error to perturbations in softmax probabilities, the work establishes that randomized subspace iteration, through multiple power iterations, achieves near-optimal approximation quality and outperforms randomized SVD in predictive accuracy for compressed pretrained models.

What carries the argument

Randomized subspace iteration (RSI) that incorporates multiple power iterations to improve spectral separation in low-rank matrix approximations of model weights.

If this is right

RSI offers a tunable parameter via the number of power iterations to balance computation and approximation quality.
The bound on softmax probability deviations provides a theoretical guarantee connecting matrix error to output changes.
Compression via RSI works across both convolutional networks and transformer architectures.
Under aggressive compression, RSI preserves more of the original model's predictive performance than RSVD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending RSI to other layers or non-linear activations beyond softmax could broaden its applicability.
Testing RSI on even larger models like those with billions of parameters would validate scalability.
The approach might combine with other compression techniques such as quantization for further gains.

Load-bearing premise

The derived theoretical bound on softmax deviations from spectral error actually produces measurable gains in real predictive accuracy for the models and compression ratios tested.

What would settle it

A controlled experiment where increasing the number of power iterations in RSI reduces spectral error but fails to improve or even worsens predictive accuracy on a held-out test set.

Figures

Figures reproduced from arXiv: 2604.02659 by Farhad Pourkamali-Anaraki.

**Figure 1.1.** Figure 1.1: Singular value spectrum and normalized spectral error for a layer of VGG with size 4096 [PITH_FULL_IMAGE:figures/full_fig_p002_1_1.png] view at source ↗

**Figure 4.** Figure 4: (b) reports the runtime comparison. Despite the relatively small matrix size, RSI remains [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 4.1.** Figure 4.1: Normalized error and runtime for low-rank approximation of a single layer in VGG19, comparing SVD [PITH_FULL_IMAGE:figures/full_fig_p009_4_1.png] view at source ↗

**Figure 4.2.** Figure 4.2: Normalized error and runtime for low-rank approximation of a single layer in ViT. [PITH_FULL_IMAGE:figures/full_fig_p009_4_2.png] view at source ↗

read the original abstract

The massive scale of pretrained models has made efficient compression essential for practical deployment. Low-rank decomposition based on the singular value decomposition (SVD) provides a principled approach for model reduction, but its exact computation is expensive for large weight matrices. Randomized alternatives such as randomized SVD (RSVD) improve efficiency, yet they can suffer from poor approximation quality when the singular value spectrum decays slowly, a regime commonly observed in modern pretrained models. In this work, we address this limitation from both theoretical and empirical perspectives. First, we establish a connection between low-rank approximation error and predictive performance by analyzing softmax perturbations, showing that deviations in class probabilities are controlled by the spectral error of the compressed weights. Second, we demonstrate that RSVD is inadequate, and we propose randomized subspace iteration (RSI) as a more effective alternative. By incorporating multiple power iterations, RSI improves spectral separation and provides a controllable mechanism for enhancing approximation quality. We evaluate our approach on both convolutional networks and transformer-based architectures. Our results show that RSI achieves near-optimal approximation quality while outperforming RSVD in predictive accuracy under aggressive compression, enabling efficient model compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RSI with a few power iterations gives better low-rank approximations than plain RSVD for pretrained weights and ties the error to softmax probability shifts, but the bound may not fully explain the accuracy gains.

read the letter

The main takeaway is that this paper shows randomized subspace iteration with multiple power steps produces stronger low-rank approximations than standard randomized SVD for the weight matrices in large pretrained models, especially when singular values decay slowly. They add an analysis that bounds how spectral error affects deviations in softmax class probabilities, which they use to argue for better predictive accuracy under compression. Evaluations on convolutional nets and transformers back the claim that RSI gets closer to optimal quality and higher accuracy than RSVD at aggressive ranks. The practical contribution is real: the extra iterations are a low-cost way to improve spectral separation without needing the full SVD. The softmax perturbation link is a reasonable attempt to connect approximation quality directly to output behavior rather than stopping at Frobenius or spectral norms. That said, the bound's usefulness depends on how tight it stays once errors propagate through stacked layers and whether pretrained matrices actually satisfy the spectral gap assumptions needed for the power iterations to deliver the predicted separation. If the analysis only holds for isolated layers or loose constants, the observed accuracy edge could be coming mostly from the better randomization rather than the claimed mechanism. The abstract does not include error bars, exact compression ratios, or layer-wise breakdowns, so the strength of the empirical case is still unclear without the full details. This is worth a serious referee for anyone working on efficient inference or randomized compression methods. The algorithmic idea is simple and targeted, and the error-to-probability step is a step in the right direction even if it needs more scrutiny. I would send it out for review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that randomized subspace iteration (RSI) improves low-rank compression of pretrained models over randomized SVD (RSVD) by using multiple power iterations to achieve better spectral separation. It establishes a theoretical link showing that low-rank spectral approximation errors control deviations in softmax class probabilities, and empirically demonstrates that RSI yields near-optimal approximations and superior predictive accuracy under aggressive compression on convolutional networks and transformers.

Significance. If the softmax perturbation bound is shown to be predictive rather than merely existent and the empirical gains are reproducible, the work supplies a practical, controllable compression technique grounded in randomized linear algebra that directly ties approximation quality to downstream model performance. The explicit bridge from spectral error to probability deviations is a notable strength.

major comments (2)

[Theoretical analysis] Theoretical analysis (softmax perturbation section): the derived bound linking spectral approximation error to softmax probability deviations is presented for an isolated linear layer; the manuscript does not demonstrate tightness numerically or address error accumulation across stacked compressed layers and nonlinearities, which is load-bearing for attributing RSI's reported accuracy gains to the claimed mechanism rather than other factors.
[Experimental results] Experimental results section: outperformance of RSI over RSVD in predictive accuracy is reported at aggressive compression levels, yet the text lacks error bars across runs, explicit values for the number of power iterations, oversampling parameters, and compression ratios used; without these the reproducibility and robustness of the central empirical claim cannot be assessed.

minor comments (1)

[Abstract] Abstract: the phrase 'near-optimal approximation quality' is used without a quantitative definition relative to exact SVD; a brief clarification would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, outlining the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis (softmax perturbation section): the derived bound linking spectral approximation error to softmax probability deviations is presented for an isolated linear layer; the manuscript does not demonstrate tightness numerically or address error accumulation across stacked compressed layers and nonlinearities, which is load-bearing for attributing RSI's reported accuracy gains to the claimed mechanism rather than other factors.

Authors: The bound is derived for an isolated linear layer to isolate and clearly establish the direct mechanistic link between spectral approximation error and deviations in softmax class probabilities, providing a clean theoretical foundation without the confounding effects of the full network architecture. We agree that numerical demonstration of tightness and analysis of error accumulation across layers would further support attributing the observed accuracy gains specifically to this mechanism. In the revised manuscript we will add numerical experiments validating the bound's tightness on representative layers extracted from the evaluated CNNs and transformers, along with a discussion of error propagation that includes additional ablation results showing that per-layer RSI improvements consistently translate to end-to-end accuracy gains. A full multi-layer theoretical accumulation analysis lies beyond the scope of the present work, but the single-layer bound supplies the essential insight that motivates the RSI approach. revision: partial
Referee: [Experimental results] Experimental results section: outperformance of RSI over RSVD in predictive accuracy is reported at aggressive compression levels, yet the text lacks error bars across runs, explicit values for the number of power iterations, oversampling parameters, and compression ratios used; without these the reproducibility and robustness of the central empirical claim cannot be assessed.

Authors: We thank the referee for highlighting this omission, which is essential for reproducibility. The revised manuscript will report error bars computed over five independent runs using different random seeds for both RSVD and RSI. We will explicitly state the hyperparameters employed throughout the experiments: two power iterations, an oversampling parameter of ten, and compression ratios corresponding to target ranks of 10 percent, 20 percent, and 30 percent of the smaller matrix dimension for the aggressive regimes. These specific values were selected via preliminary tuning to balance approximation quality against runtime on the evaluated models. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained via independent analysis and standard randomized linear algebra

full rationale

The paper's central chain proceeds from a direct perturbation analysis of softmax probabilities controlled by spectral error (derived from matrix norms and probability deviations, independent of any fitted target), to the standard construction of RSI via multiple power iterations on the randomized range finder (a known technique for spectral gap improvement, not redefined here). No step reduces a claimed prediction to a fitted input by construction, invokes a self-citation as the sole justification for a uniqueness claim, or renames an empirical pattern as a new result. Empirical comparisons to RSVD on held-out model accuracy provide external falsifiability rather than tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, new entities, or ad-hoc axioms are stated.

axioms (1)

standard math Standard properties of singular value decomposition and randomized matrix approximation algorithms
The work relies on established results from randomized linear algebra for low-rank approximation.

pith-pipeline@v0.9.0 · 5504 in / 1211 out tokens · 52775 ms · 2026-05-13T20:45:11.700705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[1]

Banyongrakkul, M

P. Banyongrakkul, M. Zahedi, P. Thongtanunam, C. Treude, and H. Gao , From release to adoption: Challenges in reusing pre-trained ai models for downstream developers , in IEEE International Conference on Software Maintenance and Evolution (ICSME), 2025, pp. 1--13

work page 2025
[2]

On the Opportunities and Risks of Foundation Models

R. Bommasani et al. , On the opportunities and risks of foundation models , arXiv preprint arXiv:2108.07258, (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Cheng, M

H. Cheng, M. Zhang, and J. Shi , A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations , IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 (2024), pp. 10558--10578

work page 2024
[4]

Cho and B

J. Cho and B. Hariharan , On the efficacy of knowledge distillation , in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4794--4802

work page 2019
[5]

Derezi \'n ski and M

M. Derezi \'n ski and M. Mahoney , Recent and upcoming developments in randomized numerical linear algebra for machine learning , in ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 6470--6479

work page 2024
[6]

Dosovitskiy et al

A. Dosovitskiy et al. , An image is worth 16x16 words: Transformers for image recognition at scale , in International Conference on Learning Representations, 2021

work page 2021
[7]

Emami, H

Y. Emami, H. Zhou, S. Nabavirazani, and L. Almeida , LLM -enabled in-context learning for data collection scheduling in UAV -assisted sensor networks , IEEE Internet of Things Journal, (2025)

work page 2025
[8]

R. Feng, K. Zheng, Y. Huang, D. Zhao, M. Jordan, and Z. Zha , Rank diminishing in deep neural networks , Advances in Neural Information Processing Systems, 35 (2022), pp. 33054--33065

work page 2022
[9]

Gu , Subspace iteration randomization and singular value problems , SIAM Journal on Scientific Computing, 37 (2015), pp

M. Gu , Subspace iteration randomization and singular value problems , SIAM Journal on Scientific Computing, 37 (2015), pp. A1139--A1173

work page 2015
[10]

Gupta and P

M. Gupta and P. Agrawal , Compression of deep learning models for text: A survey , ACM Transactions on Knowledge Discovery from Data, 16 (2022), pp. 1--55

work page 2022
[11]

Halko, P

N. Halko, P. Martinsson, and J. Tropp , Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions , SIAM Review, 53 (2011), pp. 217--288

work page 2011
[12]

E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen , Lo RA : Low-rank adaptation of large language models , in International Conference on Learning Representations, 2022

work page 2022
[13]

Accessed: 2026-03-29

Hugging Face , Models , https://huggingface.co/models. Accessed: 2026-03-29

work page 2026
[14]

Idelbayev and M

Y. Idelbayev and M. Carreira-Perpin \'a n , Low-rank compression of neural nets: Learning the rank of each layer , in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8049--8059

work page 2020
[15]

Liang et al

Y. Liang et al. , A comprehensive survey on large language model compression for artificial intelligence applications in edge systems , IEEE Internet of Things Journal, (2026)

work page 2026
[16]

J. Lin, L. Zhu, W. Chen, W. Wang, C. Gan, and S. Han , On-device training under 256kb memory , Advances in Neural Information Processing Systems, 35 (2022), pp. 22941--22954

work page 2022
[17]

D. Liu, Y. Zhu, Z. Liu, Y. Liu, C. Han, J. Tian, R. Li, and W. Yi , A survey of model compression techniques: Past, present, and future , Frontiers in Robotics and AI, 12 (2025), p. 1518965

work page 2025
[18]

H. Liu, M. Galindo, H. Xie, L. Wong, H. Shuai, Y. Li, and W. Cheng , Lightweight deep learning for resource-constrained environments: A survey , ACM Computing Surveys, 56 (2024), pp. 1--42

work page 2024
[19]

Martinsson and J

P. Martinsson and J. Tropp , Randomized numerical linear algebra: Foundations and algorithms , Acta Numerica, 29 (2020), pp. 403--572

work page 2020
[20]

F. Meng, Z. Wang, and M. Zhang , PiSSA : Principal singular values and singular vectors adaptation of large language models , Advances in Neural Information Processing Systems, (2024), pp. 121038--121072

work page 2024
[21]

Parisi, A

S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta , The unsurprising effectiveness of pre-trained vision models for control , in International Conference on Machine Learning, 2022, pp. 17359--17371

work page 2022
[22]

Pourkamali-Anaraki, S

F. Pourkamali-Anaraki, S. Becker, and M. Wakin , Randomized clustered nystrom for large-scale kernel machines , in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018

work page 2018
[23]

https://pytorch.org/hub/, 2019

PyTorch Contributors , Pytorch H ub: A repository of pretrained models . https://pytorch.org/hub/, 2019. Accessed: 2026-03-30

work page 2019
[24]

Rudin , Principles of Mathematical Analysis , McGraw-Hill, 1976

W. Rudin , Principles of Mathematical Analysis , McGraw-Hill, 1976

work page 1976
[25]

R. Saha, N. Sagan, V. Srivastava, A. Goldsmith, and M. Pilanci , Compressing large language models using low rank and low precision decomposition , Advances in Neural Information Processing Systems, (2024), pp. 88981--89018

work page 2024
[26]

Saibaba , Randomized subspace iteration: Analysis of canonical angles and unitarily invariant norms , SIAM Journal on Matrix Analysis and Applications, 40 (2019), pp

A. Saibaba , Randomized subspace iteration: Analysis of canonical angles and unitarily invariant norms , SIAM Journal on Matrix Analysis and Applications, 40 (2019), pp. 23--48

work page 2019
[27]

Saibaba and A

A. Saibaba and A. Mi e dlar , Randomized low-rank approximations beyond gaussian random matrices , SIAM Journal on Mathematics of Data Science, 7 (2025), pp. 136--162

work page 2025
[28]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman , Very deep convolutional networks for large-scale image recognition , arXiv preprint arXiv:1409.1556, (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[29]

Thamm, M

M. Thamm, M. Staats, and B. Rosenow , Random matrix analysis of deep neural network weight matrices , Physical Review E, 106 (2022), p. 054124

work page 2022
[30]

Tropp and R

J. Tropp and R. Webber , Randomized algorithms for low-rank matrix approximation: Design, analysis, and applications , arXiv preprint arXiv:2306.12418, (2023)

work page arXiv 2023
[31]

X. Tu, Z. He, Y. Huang, Z. Zhang, M. Yang, and J. Zhao , An overview of large AI models and their applications , Visual Intelligence, 2 (2024), p. 34

work page 2024
[32]

Udell and A

M. Udell and A. Townsend , Why are big data matrices approximately low rank? , SIAM Journal on Mathematics of Data Science, 1 (2019), pp. 144--160

work page 2019
[33]

Varshney, N

L. Varshney, N. Keskar, and R. Socher , Pretrained AI models: Performativity, mobility, and change , arXiv preprint arXiv:1909.03290, (2019)

work page arXiv 1909
[34]

Y. Wang, X. Deng, Y. Xie, W. Peng, S. Chen, X. Li, M. Tang, and M. Fang , End-to-end knowledge distillation for unsupervised domain adaptation with large vision-language models , in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, 2026, pp. 26624--26633

work page 2026
[35]

X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang , A survey on model compression for large language models , Transactions of the Association for Computational Linguistics, 12 (2024), pp. 1556--1577

work page 2024
[36]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTION or pop #1 'skip if FUNCTION new.block.checka empty 'skip 'new.block if FUNCTION field.or.null duplicate empty pop "" 'skip ...

work page