pith. machine review for the scientific record. sign in

arxiv: 2605.02829 · v1 · submitted 2026-05-04 · 💻 cs.AI

Recognition: unknown

Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

Jingze Ge, Min Wu, Wang Zhe Mark, Wanqi Dong, Xue Geng, Xulei Yang, Yun Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:06 UTC · model grok-4.3

classification 💻 cs.AI
keywords parameter-efficient fine-tuningmodel compressionlow-rank adaptationsubspace unionvision transformerslarge language modelsjoint optimizationcalibration set
0
0 comments X

The pith

JACTUS unifies compression and adaptation via a single task-aware union of subspaces instead of doing them sequentially.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that first compressing a pretrained model and then adapting it risks preserving directions that do not align with the downstream task, wasting part of the global parameter budget. JACTUS instead estimates input and gradient covariances on a small calibration set, forms an orthogonal union of those directions with the original pretrained weight subspace, projects a low-rank approximation inside the union, and allocates ranks by marginal gain per parameter before training only the resulting compact core. This produces a deployable low-rank model that keeps the full parameter budget fixed while coupling compression choices directly to adaptation needs. The authors report that at 80 percent retained parameters the method exceeds both 100-percent PEFT baselines and prior compress-then-finetune pipelines on eight vision datasets with ViT-Base and on commonsense QA with Llama2-7B.

Core claim

JACTUS estimates input and pre-activation gradient covariances from a calibration set, forms their orthogonal union with the pretrained weight subspace, performs projected low-rank approximation inside this union, allocates rank globally by marginal gain per parameter, and trains only a compact core matrix, thereby coupling the preserved directions for compression with those required for adaptation and yielding a low-rank model that does not retain full frozen weights.

What carries the argument

The task-aware union of subspaces, formed by orthogonally combining the pretrained weight subspace with subspaces spanned by input and pre-activation gradient covariances estimated on a calibration set, which then serves as the ambient space for projected low-rank approximation and global rank allocation.

If this is right

  • At an 80 percent retained-parameter budget, JACTUS reaches 89.2 percent average accuracy on ViT-Base across eight vision datasets, exceeding 100 percent PEFT baselines such as DoRA at 87.9 percent.
  • On Llama2-7B commonsense QA the same budget yields 80.9 percent average accuracy, surpassing DoRA at 79.7 percent and prior compress-then-finetune pipelines.
  • The final model is fully deployable as a low-rank structure without storing the original full frozen weights.
  • Global rank allocation by marginal gain per parameter distributes the fixed budget more efficiently than uniform or heuristic choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The joint-union construction may allow model-serving systems to skip a separate compression stage and directly output a tuned low-rank checkpoint.
  • If the calibration set is drawn from a distribution close to the target task, the same machinery could support rapid few-shot adaptation without retraining large subspaces.
  • Extending the covariance estimation to additional layers or modalities could test whether the union principle scales beyond vision and language transformers.
  • Comparing the singular vectors retained by the union against those kept by sequential compression would quantify how much task-relevant direction is otherwise discarded.

Load-bearing premise

Covariances computed from a small calibration set, once orthogonally unioned with the pretrained subspace, will contain enough directions to support both faithful low-rank compression and effective downstream adaptation.

What would settle it

On a new held-out task, JACTUS at 80 percent retained parameters produces lower average accuracy than a strong sequential compress-then-adapt baseline that uses the same total parameter budget.

Figures

Figures reproduced from arXiv: 2605.02829 by Jingze Ge, Min Wu, Wang Zhe Mark, Wanqi Dong, Xue Geng, Xulei Yang, Yun Liu.

Figure 1
Figure 1. Figure 1: Comparison of three paradigms: PEFT, compression then fine-tuning, and our joint adaptation and compression. weights, yet at inference time the full pre-trained weights must still be retained. Conversely, low-rank compression via Singular Value Decomposition (SVD)- family techniques [16, 36, 47, 50] reduces the number of deployed parameters, but typically selects directions by weight-space energy or activa… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed subspace-based adaptation. (a) Activation and gra￾dient covariances Cx and Cg are estimated on the calibration set. (b) Uw, Vw, Ug, Vx are obtained by applying SVD to W, Cx, and Cg, then truncated by energy thresholds α, β, γ to ranks kw, ki, ko; the merged subspaces form joint bases QL, QR with ranks kL, kR satisfying kw + ki = kL and kw + ko = kR. (c) A global rank allocator assi… view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise ranks selected by the greedy global allocator for ViT-Large [10] under 40%, 60%, and 80% retained-parameter budget. We plot the ranks of attention (Query/Key/Value/Projection) and MLP (Up/Down) projections across layers. The calibration dataset is extracted from CIFAR-100 [22] training split. We detail this procedure in Algorithm 1. Given a target retained-parameter budget c, we induce a global … view at source ↗
read the original abstract

Adapting large pretrained models to diverse tasks is now routine, yet the two dominant strategies of parameter-efficient fine-tuning (PEFT) and low-rank compression are typically composed in sequence. This decoupled practice first compresses and then fine-tunes adapters, potentially misaligning the compressed subspace with downstream objectives and squandering a global parameter budget. To overcome this limitation, we introduce JACTUS (Joint Adaptation and Compression with a Task-aware Union of Subspaces), a single framework that unifies compression and adaptation. From a small calibration set, JACTUS estimates input and pre-activation gradient covariances, forms their orthogonal union with the pretrained weight subspace, performs a projected low-rank approximation inside this union, allocates rank globally by marginal gain per parameter, and trains only a compact core matrix. This explicitly mitigates the potential misalignment between the compressed subspace and downstream objectives by coupling the directions preserved for compression with those required for adaptation, yielding a deployable low-rank model that avoids retaining full frozen weights while enabling fast and robust tuning. On vision, JACTUS attains an average 89.2% accuracy on ViT-Base across eight datasets at 80% retained parameters, surpassing strong 100% PEFT baselines (e.g., DoRA 87.9%). On language, JACTUS achieves an 80.9% average on Llama2-7B commonsense QA at the same 80% retained-parameter budget, outperforming 100% PEFT (e.g., DoRA 79.7%) and exceeding prior compress-then-finetune pipelines under the same ratained-parameter budget. We will release code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces JACTUS, a unified framework for joint low-rank compression and task adaptation of pretrained models. It estimates input and pre-activation gradient covariances from a small calibration set, forms their orthogonal union with the pretrained weight subspace, performs a projected low-rank approximation inside this union, allocates ranks globally by marginal gain per parameter, and trains only a compact core matrix. Empirical results claim average accuracies of 89.2% on ViT-Base across eight vision datasets and 80.9% on Llama2-7B commonsense QA at 80% retained parameters, outperforming 100% PEFT baselines such as DoRA and prior compress-then-finetune methods.

Significance. If the central construction holds, the work is significant because it directly couples compression directions with downstream adaptation objectives, avoiding the misalignment risk in sequential pipelines while producing a deployable low-rank model without retaining full frozen weights. The reported gains at fixed parameter budgets on both vision and language models, together with the promised code release, would support practical efficiency improvements for large-model adaptation.

major comments (3)
  1. [§3] §3 (method description): the construction of the orthogonal union of the pretrained column space with the estimated covariance subspaces, followed by the projected low-rank factorization, is described at a high level without explicit matrix equations or pseudocode for the projection operator; this leaves unclear whether the subsequent low-rank model can recover task-critical directions that would be lost under a standard compress-then-adapt baseline.
  2. [Experimental section] Experimental section (results on ViT-Base and Llama2-7B): the headline numbers (89.2% and 80.9% at 80% retention) rely on covariance estimates from an unspecified calibration-set size and distribution; no ablation or sensitivity analysis is provided on calibration-set size or diversity, which directly tests the weakest assumption that these estimates reliably span adaptation directions when unioned with the pretrained subspace.
  3. [§4] §4 (rank allocation): the global rank allocation by marginal gain per parameter is presented without a derivation or proof that the marginal-gain criterion remains optimal once the subspace is restricted to the task-aware union; this step is load-bearing for the claim that the joint procedure outperforms separate compression followed by adaptation under the same budget.
minor comments (2)
  1. [Abstract] Abstract: typo 'ratained-parameter' should be 'retained-parameter'.
  2. [Notation] Notation: the distinction between input covariance and pre-activation gradient covariance is introduced without consistent symbols across the text and any equations, hindering readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity, add missing details, and strengthen the presentation.

read point-by-point responses
  1. Referee: §3 (method description): the construction of the orthogonal union of the pretrained column space with the estimated covariance subspaces, followed by the projected low-rank factorization, is described at a high level without explicit matrix equations or pseudocode for the projection operator; this leaves unclear whether the subsequent low-rank model can recover task-critical directions that would be lost under a standard compress-then-adapt baseline.

    Authors: We agree that the description in §3 would benefit from greater mathematical precision. In the revision we will insert explicit matrix equations for the orthogonal union (pretrained weight subspace unioned with the column spaces of the estimated input and pre-activation gradient covariances) and for the subsequent projection-based low-rank factorization inside that union. Pseudocode for the full procedure will also be added. These additions will make explicit that task-critical directions identified by the gradient covariances are retained in the union and therefore available for recovery in the final low-rank model. revision: yes

  2. Referee: Experimental section (results on ViT-Base and Llama2-7B): the headline numbers (89.2% and 80.9% at 80% retention) rely on covariance estimates from an unspecified calibration-set size and distribution; no ablation or sensitivity analysis is provided on calibration-set size or diversity, which directly tests the weakest assumption that these estimates reliably span adaptation directions when unioned with the pretrained subspace.

    Authors: The referee is correct that calibration-set size and distribution are not stated. We will specify the exact sizes and sampling procedures used (random subsets drawn from the respective training sets, typically a few hundred to a thousand examples). We will also add a sensitivity ablation that varies calibration-set size (128, 512, 2048 samples) and reports downstream accuracy on representative vision and language tasks, thereby directly addressing the robustness of the covariance estimates when unioned with the pretrained subspace. revision: yes

  3. Referee: §4 (rank allocation): the global rank allocation by marginal gain per parameter is presented without a derivation or proof that the marginal-gain criterion remains optimal once the subspace is restricted to the task-aware union; this step is load-bearing for the claim that the joint procedure outperforms separate compression followed by adaptation under the same budget.

    Authors: We will add a concise derivation in the revised §4 showing that the marginal-gain-per-parameter rule is the standard greedy step for maximizing captured variance (trace of the projected covariance) subject to a total-parameter budget; because the union subspace is fixed before allocation, the same greedy argument applies inside the restricted subspace. A full optimality proof for the restricted case is not provided in the current manuscript and would require further theoretical development; we will note this limitation while retaining the empirical comparisons that support the practical advantage of the joint approach. revision: partial

standing simulated objections not resolved
  • A rigorous optimality proof for the marginal-gain rank allocation inside the task-aware union subspace.

Circularity Check

0 steps flagged

No significant circularity in the JACTUS derivation chain

full rationale

The paper presents an algorithmic pipeline: covariance estimation on a calibration set, orthogonal union with the pretrained weight subspace, projected low-rank approximation inside the union, marginal-gain global rank allocation, and training of a compact core matrix. These steps are standard linear-algebra operations on empirically derived quantities and do not reduce the central claim (joint compression-adaptation superiority) to a definition, a fitted input renamed as prediction, or a self-citation chain. Reported accuracies (89.2% ViT-Base, 80.9% Llama-2-7B at 80% retained parameters) are empirical outcomes on downstream tasks, not tautological consequences of the inputs. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the method; the construction remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method builds on standard linear algebra operations and covariance estimation from prior literature, with the primary addition being the joint coupling of compression and adaptation directions; no explicit free parameters or new postulated entities are detailed in the abstract.

axioms (1)
  • domain assumption The orthogonal union of the pretrained weight subspace with input and pre-activation gradient covariance subspaces preserves directions relevant for both compression fidelity and downstream task performance.
    Invoked when forming the union prior to projected low-rank approximation inside it.

pith-pipeline@v0.9.0 · 5617 in / 1490 out tokens · 55603 ms · 2026-05-08T18:06:45.262971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    Neural Computation 10(2), 251–276 (1998).https://doi.org/10.1162/0899766983000177462, 5

    Amari, S.i.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998).https://doi.org/10.1162/0899766983000177462, 5

  2. [2]

    CMS Books in Mathematics, Springer, New York, 2 edn

    Ben-Israel, A., Greville, T.N.E.: Generalized Inverses: Theory and Applications. CMS Books in Mathematics, Springer, New York, 2 edn. (2003).https://doi. org/10.1007/978-0-387-21524-53

  3. [3]

    In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (2020) 11, 21

    Bisk, Y., Zellers, R., Le Bras, R., Gao, J., Choi, Y.: Piqa: Reasoning about physical commonsense in natural language. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (2020) 11, 21

  4. [4]

    Proceedings of the IEEE105(10), 1865–1883 (2017) 11, 21

    Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE105(10), 1865–1883 (2017) 11, 21

  5. [5]

    In: CVPR

    Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR. pp. 3606–3613 (2014) 11, 21

  6. [6]

    In: Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019) 11, 21

    Clark, C., Lee, K., Chang, M.W., Kwiatkowski, T., Collins, M., Toutanova, K.: Boolq: Exploring the surprising difficulty of natural yes/no questions. In: Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019) 11, 21

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018) 11, 21

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021) 11, 21

  9. [9]

    In: CVPR

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009) 10

  10. [10]

    In: ICLR (2021) 1, 3, 8, 10, 12, 21, 22

    Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. In: ICLR (2021) 1, 3, 8, 10, 12, 21, 22

  11. [11]

    Psychometrika1(3), 211–218 (1936) 7

    Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika1(3), 211–218 (1936) 7

  12. [12]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: GPTQ: Accurate post- training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022) 5

  13. [13]

    Johns Hopkins Univer- sity Press, Baltimore, MD, 4 edn

    Golub, G.H., Van Loan, C.F.: Matrix Computations. Johns Hopkins Univer- sity Press, Baltimore, MD, 4 edn. (2013).https : / / doi . org / 10 . 1137 / 1 . 97814214079443, 6

  14. [14]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing12(7), 2217– 2226 (2019) 11, 21 16 Ge et al

    Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing12(7), 2217– 2226 (2019) 11, 21 16 Ge et al

  15. [15]

    NeurIPS Datasets and Benchmarks Track (2021) 11, 21

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. NeurIPS Datasets and Benchmarks Track (2021) 11, 21

  16. [16]

    Hsu, Y.C., Hua, T., Chang, S., Lou, Q., Shen, Y., Jin, H.: Language model com- pression with weighted low-rank factorization (2022),https://arxiv.org/abs/ 2207.001122, 3, 4

  17. [17]

    In: ICLR (2022) 1, 3, 4, 11, 12, 13, 23, 24

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022) 1, 3, 4, 11, 12, 13, 23, 24

  18. [18]

    arXiv preprint arXiv:2406.09117 (2024).https://doi.org/10.48550/arXiv.2406.09117,https: //arxiv.org/abs/2406.091172

    Hwang, I., Park, H., Lee, Y., Yang, J., Maeng, S.: Pc-lora: Low-rank adapta- tion for progressive model compression with knowledge distillation. arXiv preprint arXiv:2406.09117 (2024).https://doi.org/10.48550/arXiv.2406.09117,https: //arxiv.org/abs/2406.091172

  19. [19]

    In- formation Processing Letters70(1), 39–45 (1999).https://doi.org/10.1016/ S0020-0190(99)00031-97

    Khuller, S., Moss, A., Naor, J.S.: The budgeted maximum coverage problem. In- formation Processing Letters70(1), 39–45 (1999).https://doi.org/10.1016/ S0020-0190(99)00031-97

  20. [20]

    In: ICLR (2024) 3, 4

    Kopiczko, D.J., Blankevoort, T., Asano, Y.M.: VeRA: Vector-based random matrix adaptation. In: ICLR (2024) 3, 4

  21. [21]

    In: ICCV Workshops

    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine- grained categorization. In: ICCV Workshops. pp. 554–561 (2013) 11, 21

  22. [22]

    Master’s thesis, University of Tront (2009) 8, 11, 21

    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Master’s thesis, University of Tront (2009) 8, 11, 21

  23. [23]

    Journal of Multivariate Analysis88(2), 365–411 (2004).https://doi

    Ledoit, O., Wolf, M.: A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis88(2), 365–411 (2004).https://doi. org/10.1016/j.jmva.2003.11.0096

  24. [24]

    In: NeurIPS

    Lingam, V.C., Neerkaje, A., Vavre, A., Shetty, A., Gudur, G.K., Ghosh, J., Choi, E., Dimakis, A., Bojchevski, A., Sanghavi, S.: SVFT: Parameter-efficient fine- tuning with singular vectors. In: NeurIPS. pp. 41425–41446 (2024) 3, 4, 11, 12, 13, 23, 24

  25. [25]

    In: ICML

    Liu, S.Y., Wang, C.Y., Yin, H., Molchanov, P., Wang, Y.C.F., Cheng, K.T., Chen, M.H.: DoRA: Weight-decomposed low-rank adaptation. In: ICML. pp. 32100– 32121 (2024) 1, 3, 4, 11, 12, 13

  26. [26]

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 1, 3, 10, 12, 21, 22, 23

  27. [27]

    Liu,Z., Kundu, S., Li, A., Wan, J., Jiang,L., Beerel, P.A.: Aflora: Adaptive freezing of low rank adaptation in parameter efficient fine-tuning of large models (2024), https://arxiv.org/abs/2403.132693, 4

  28. [28]

    In: ICLR (2017) 11

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017) 11

  29. [29]

    Fine-Grained Visual Classification of Aircraft

    Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013) 11, 21

  30. [30]

    Optimizing neural networks with kronecker-factored approximate curvature.arXiv:1503.05671, 2020

    Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored ap- proximate curvature. In: Proceedings of the 32nd International Conference on Ma- chine Learning (ICML). pp. 2408–2417 (2015),https://arxiv.org/abs/1503. 05671, arXiv:1503.05671 2, 5

  31. [31]

    In: NeurIPS

    Meng, F., Wang, Z., Zhang, M.: PiSSA: Principal singular values and singular vec- tors adaptation of large language models. In: NeurIPS. pp. 121038–121072 (2024) 1, 3, 4, 11, 12, 13 Joint Compression and Adaptation 17

  32. [32]

    In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018) 11, 21

    Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018) 11, 21

  33. [33]

    In: Proceedings of the 37th International Conference on Machine Learning

    Nagel, M., Amjad, R.A., van Baalen, M., Louizos, C., Blankevoort, T.: Up or down? Adaptive rounding for post-training quantization. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, vol. 119, pp. 7197– 7206 (2020),https://proceedings.mlr.press/v119/nagel20a.html5

  34. [34]

    In: CVPR

    Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR. pp. 3498–3505 (2012) 11, 21

  35. [35]

    In: NeurIPS

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: An imperative style, high- performance deep learning library. In: NeurIPS. pp. 8026–8037 (2019) 11

  36. [36]

    In: The Thirteenth Inter- national Conference on Learning Representations (2025),https://openreview

    Qinsi, W., Ke, J., Tomizuka, M., Keutzer, K., Xu, C.: Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives. In: The Thirteenth Inter- national Conference on Learning Representations (2025),https://openreview. net/forum?id=kws76i5XB82, 3, 4, 11, 13, 23, 24

  37. [37]

    Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020),https://api.semanticscholar.org/CorpusID:221191193 11

    Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020),https://api.semanticscholar.org/CorpusID:221191193 11

  38. [38]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 11, 21

    Sakaguchi, K., Le Bras, R., Bhagavatula, C., Choi, Y.: Winogrande: An adversarial winograd schema challenge at scale. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 11, 21

  39. [39]

    In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (2019) 11, 21

    Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Socialiqa: Commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (2019) 11, 21

  40. [40]

    In: Burstein, J., Doran, C., Solorio, T

    Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A question an- swering challenge targeting commonsense knowledge. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers)...

  41. [41]

    In: NeurIPS

    Tian, C., Shi, Z., Guo, Z., Li, L., Xu, C.: HydraLoRA: An asymmetric LoRA architecture for efficient fine-tuning. In: NeurIPS. pp. 9565–9584 (2024) 3, 4

  42. [42]

    Soviet Math

    Tikhonov, A.N.: Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl.5, 1035–1038 (1963) 6

  43. [43]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: Llama: Open and efficient foundation language models. CoRRabs/2302.13971(2023).https://doi.org/10.48550/ARXIV.2302.13971, https://doi.org/10.48550/arXiv.2302.139711

  44. [44]

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., 18 Ge et al. Kardas, M., Kerk...

  45. [45]

    Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y., Liu, J., Qu, Z., Yan, S., Zhu, Y., Zhang, Q., Chowdhury, M., Zhang, M.: Efficient large language models: A survey (2024),https://arxiv.org/abs/2312.038631

  46. [46]

    Wang, X., Chen, T., Ge, Q., Xia, H., Bao, R., Zheng, R., Zhang, Q., Gui, T., Huang, X.: Orthogonal subspace learning for language model continual learning (2023),https://arxiv.org/abs/2310.141521, 3, 4

  47. [47]

    In: International Con- ference on Learning Representations (ICLR) (2025),https://openreview.net/ forum?id=LNYIUouhdt2, 3, 4, 7, 11, 13, 23, 24

    Wang, X., Zheng, Y., Wan, Z., Zhang, M.: SVD-LLM: Truncation-aware singular value decomposition for large language model compression. In: International Con- ference on Learning Representations (ICLR) (2025),https://openreview.net/ forum?id=LNYIUouhdt2, 3, 4, 7, 11, 13, 23, 24

  48. [48]

    In: EMNLP

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: Transformers: State-of-the-art natural language processing. In: EMNLP. pp. 38–45 (2020) 11

  49. [49]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok, J.T., Li, Z., Weller, A., Liu, W.: Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 (2023) 11, 13, 21, 23, 24

  50. [50]

    Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., Sun, G.: Asvd: Activation-aware singular value decomposition for compressing large language models (2023) 2, 3, 4

  51. [51]

    Zellers,R.,Holtzman,A.,Bisk,Y.,Farhadi,A.,Choi,Y.:Hellaswag:Canamachine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019) 11, 21

  52. [52]

    LoRAPrune: Structured pruning meets low-rank parameter-efficient fine-tuning

    Zhang, M., Chen, H., Shen, C., Yang, Z., Ou, L., Yu, X., Zhuang, B.: Loraprune: Structured pruning meets low-rank parameter-efficient fine-tuning. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 3013–3026. As- sociation for Computational Linguistics, Bangkok, Thailand (Aug 2024).https: //doi.org/10.18653/v1/2024.findings-acl.1...

  53. [53]

    In: ICLR (2023) 1, 3, 4

    Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., Zhao, T.: Adap- tive budget allocation for parameter-efficient fine-tuning. In: ICLR (2023) 1, 3, 4

  54. [54]

    Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., Zhao, T.: Adalora: Adaptive budget allocation for parameter-efficient fine- tuning (2023),https://arxiv.org/abs/2303.105123

  55. [55]

    arXiv preprint arXiv:2307.07705 (2024).https: //doi.org/10.48550/arXiv.2307.07705,https://arxiv.org/abs/2307.077052

    Zhao, W., Huang, Y., Han, X., Liu, Z., Zhang, Z., Li, K., Chen, C., Yang, T., Sun, M.: Ca-lora: Adapting existing lora for compressed llms to enable efficient multi-tasking on personal devices. arXiv preprint arXiv:2307.07705 (2024).https: //doi.org/10.48550/arXiv.2307.07705,https://arxiv.org/abs/2307.077052

  56. [56]

    A survey on efficient inference for large language models,

    Zhou, Z., Ning, X., Hong, K., Fu, T., Xu, J., Li, S., Lou, Y., Wang, L., Yuan, Z., Li, X., Yan, S., Dai, G., Zhang, X.P., Dong, Y., Wang, Y.: A survey on efficient Joint Compression and Adaptation 19 inference for large language models. CoRRabs/2404.14294(2024).https:// doi.org/10.48550/arXiv.2404.14294,https://arxiv.org/abs/2404.142941 20 Ge et al. Appen...