Recognition: unknown
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
Pith reviewed 2026-05-08 18:06 UTC · model grok-4.3
The pith
JACTUS unifies compression and adaptation via a single task-aware union of subspaces instead of doing them sequentially.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JACTUS estimates input and pre-activation gradient covariances from a calibration set, forms their orthogonal union with the pretrained weight subspace, performs projected low-rank approximation inside this union, allocates rank globally by marginal gain per parameter, and trains only a compact core matrix, thereby coupling the preserved directions for compression with those required for adaptation and yielding a low-rank model that does not retain full frozen weights.
What carries the argument
The task-aware union of subspaces, formed by orthogonally combining the pretrained weight subspace with subspaces spanned by input and pre-activation gradient covariances estimated on a calibration set, which then serves as the ambient space for projected low-rank approximation and global rank allocation.
If this is right
- At an 80 percent retained-parameter budget, JACTUS reaches 89.2 percent average accuracy on ViT-Base across eight vision datasets, exceeding 100 percent PEFT baselines such as DoRA at 87.9 percent.
- On Llama2-7B commonsense QA the same budget yields 80.9 percent average accuracy, surpassing DoRA at 79.7 percent and prior compress-then-finetune pipelines.
- The final model is fully deployable as a low-rank structure without storing the original full frozen weights.
- Global rank allocation by marginal gain per parameter distributes the fixed budget more efficiently than uniform or heuristic choices.
Where Pith is reading between the lines
- The joint-union construction may allow model-serving systems to skip a separate compression stage and directly output a tuned low-rank checkpoint.
- If the calibration set is drawn from a distribution close to the target task, the same machinery could support rapid few-shot adaptation without retraining large subspaces.
- Extending the covariance estimation to additional layers or modalities could test whether the union principle scales beyond vision and language transformers.
- Comparing the singular vectors retained by the union against those kept by sequential compression would quantify how much task-relevant direction is otherwise discarded.
Load-bearing premise
Covariances computed from a small calibration set, once orthogonally unioned with the pretrained subspace, will contain enough directions to support both faithful low-rank compression and effective downstream adaptation.
What would settle it
On a new held-out task, JACTUS at 80 percent retained parameters produces lower average accuracy than a strong sequential compress-then-adapt baseline that uses the same total parameter budget.
Figures
read the original abstract
Adapting large pretrained models to diverse tasks is now routine, yet the two dominant strategies of parameter-efficient fine-tuning (PEFT) and low-rank compression are typically composed in sequence. This decoupled practice first compresses and then fine-tunes adapters, potentially misaligning the compressed subspace with downstream objectives and squandering a global parameter budget. To overcome this limitation, we introduce JACTUS (Joint Adaptation and Compression with a Task-aware Union of Subspaces), a single framework that unifies compression and adaptation. From a small calibration set, JACTUS estimates input and pre-activation gradient covariances, forms their orthogonal union with the pretrained weight subspace, performs a projected low-rank approximation inside this union, allocates rank globally by marginal gain per parameter, and trains only a compact core matrix. This explicitly mitigates the potential misalignment between the compressed subspace and downstream objectives by coupling the directions preserved for compression with those required for adaptation, yielding a deployable low-rank model that avoids retaining full frozen weights while enabling fast and robust tuning. On vision, JACTUS attains an average 89.2% accuracy on ViT-Base across eight datasets at 80% retained parameters, surpassing strong 100% PEFT baselines (e.g., DoRA 87.9%). On language, JACTUS achieves an 80.9% average on Llama2-7B commonsense QA at the same 80% retained-parameter budget, outperforming 100% PEFT (e.g., DoRA 79.7%) and exceeding prior compress-then-finetune pipelines under the same ratained-parameter budget. We will release code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JACTUS, a unified framework for joint low-rank compression and task adaptation of pretrained models. It estimates input and pre-activation gradient covariances from a small calibration set, forms their orthogonal union with the pretrained weight subspace, performs a projected low-rank approximation inside this union, allocates ranks globally by marginal gain per parameter, and trains only a compact core matrix. Empirical results claim average accuracies of 89.2% on ViT-Base across eight vision datasets and 80.9% on Llama2-7B commonsense QA at 80% retained parameters, outperforming 100% PEFT baselines such as DoRA and prior compress-then-finetune methods.
Significance. If the central construction holds, the work is significant because it directly couples compression directions with downstream adaptation objectives, avoiding the misalignment risk in sequential pipelines while producing a deployable low-rank model without retaining full frozen weights. The reported gains at fixed parameter budgets on both vision and language models, together with the promised code release, would support practical efficiency improvements for large-model adaptation.
major comments (3)
- [§3] §3 (method description): the construction of the orthogonal union of the pretrained column space with the estimated covariance subspaces, followed by the projected low-rank factorization, is described at a high level without explicit matrix equations or pseudocode for the projection operator; this leaves unclear whether the subsequent low-rank model can recover task-critical directions that would be lost under a standard compress-then-adapt baseline.
- [Experimental section] Experimental section (results on ViT-Base and Llama2-7B): the headline numbers (89.2% and 80.9% at 80% retention) rely on covariance estimates from an unspecified calibration-set size and distribution; no ablation or sensitivity analysis is provided on calibration-set size or diversity, which directly tests the weakest assumption that these estimates reliably span adaptation directions when unioned with the pretrained subspace.
- [§4] §4 (rank allocation): the global rank allocation by marginal gain per parameter is presented without a derivation or proof that the marginal-gain criterion remains optimal once the subspace is restricted to the task-aware union; this step is load-bearing for the claim that the joint procedure outperforms separate compression followed by adaptation under the same budget.
minor comments (2)
- [Abstract] Abstract: typo 'ratained-parameter' should be 'retained-parameter'.
- [Notation] Notation: the distinction between input covariance and pre-activation gradient covariance is introduced without consistent symbols across the text and any equations, hindering readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity, add missing details, and strengthen the presentation.
read point-by-point responses
-
Referee: §3 (method description): the construction of the orthogonal union of the pretrained column space with the estimated covariance subspaces, followed by the projected low-rank factorization, is described at a high level without explicit matrix equations or pseudocode for the projection operator; this leaves unclear whether the subsequent low-rank model can recover task-critical directions that would be lost under a standard compress-then-adapt baseline.
Authors: We agree that the description in §3 would benefit from greater mathematical precision. In the revision we will insert explicit matrix equations for the orthogonal union (pretrained weight subspace unioned with the column spaces of the estimated input and pre-activation gradient covariances) and for the subsequent projection-based low-rank factorization inside that union. Pseudocode for the full procedure will also be added. These additions will make explicit that task-critical directions identified by the gradient covariances are retained in the union and therefore available for recovery in the final low-rank model. revision: yes
-
Referee: Experimental section (results on ViT-Base and Llama2-7B): the headline numbers (89.2% and 80.9% at 80% retention) rely on covariance estimates from an unspecified calibration-set size and distribution; no ablation or sensitivity analysis is provided on calibration-set size or diversity, which directly tests the weakest assumption that these estimates reliably span adaptation directions when unioned with the pretrained subspace.
Authors: The referee is correct that calibration-set size and distribution are not stated. We will specify the exact sizes and sampling procedures used (random subsets drawn from the respective training sets, typically a few hundred to a thousand examples). We will also add a sensitivity ablation that varies calibration-set size (128, 512, 2048 samples) and reports downstream accuracy on representative vision and language tasks, thereby directly addressing the robustness of the covariance estimates when unioned with the pretrained subspace. revision: yes
-
Referee: §4 (rank allocation): the global rank allocation by marginal gain per parameter is presented without a derivation or proof that the marginal-gain criterion remains optimal once the subspace is restricted to the task-aware union; this step is load-bearing for the claim that the joint procedure outperforms separate compression followed by adaptation under the same budget.
Authors: We will add a concise derivation in the revised §4 showing that the marginal-gain-per-parameter rule is the standard greedy step for maximizing captured variance (trace of the projected covariance) subject to a total-parameter budget; because the union subspace is fixed before allocation, the same greedy argument applies inside the restricted subspace. A full optimality proof for the restricted case is not provided in the current manuscript and would require further theoretical development; we will note this limitation while retaining the empirical comparisons that support the practical advantage of the joint approach. revision: partial
- A rigorous optimality proof for the marginal-gain rank allocation inside the task-aware union subspace.
Circularity Check
No significant circularity in the JACTUS derivation chain
full rationale
The paper presents an algorithmic pipeline: covariance estimation on a calibration set, orthogonal union with the pretrained weight subspace, projected low-rank approximation inside the union, marginal-gain global rank allocation, and training of a compact core matrix. These steps are standard linear-algebra operations on empirically derived quantities and do not reduce the central claim (joint compression-adaptation superiority) to a definition, a fitted input renamed as prediction, or a self-citation chain. Reported accuracies (89.2% ViT-Base, 80.9% Llama-2-7B at 80% retained parameters) are empirical outcomes on downstream tasks, not tautological consequences of the inputs. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the method; the construction remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The orthogonal union of the pretrained weight subspace with input and pre-activation gradient covariance subspaces preserves directions relevant for both compression fidelity and downstream task performance.
Reference graph
Works this paper leans on
-
[1]
Neural Computation 10(2), 251–276 (1998).https://doi.org/10.1162/0899766983000177462, 5
Amari, S.i.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998).https://doi.org/10.1162/0899766983000177462, 5
-
[2]
CMS Books in Mathematics, Springer, New York, 2 edn
Ben-Israel, A., Greville, T.N.E.: Generalized Inverses: Theory and Applications. CMS Books in Mathematics, Springer, New York, 2 edn. (2003).https://doi. org/10.1007/978-0-387-21524-53
-
[3]
In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (2020) 11, 21
Bisk, Y., Zellers, R., Le Bras, R., Gao, J., Choi, Y.: Piqa: Reasoning about physical commonsense in natural language. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (2020) 11, 21
2020
-
[4]
Proceedings of the IEEE105(10), 1865–1883 (2017) 11, 21
Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE105(10), 1865–1883 (2017) 11, 21
2017
-
[5]
In: CVPR
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR. pp. 3606–3613 (2014) 11, 21
2014
-
[6]
In: Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019) 11, 21
Clark, C., Lee, K., Chang, M.W., Kwiatkowski, T., Collins, M., Toutanova, K.: Boolq: Exploring the surprising difficulty of natural yes/no questions. In: Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019) 11, 21
2019
-
[7]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018) 11, 21
work page Pith review arXiv 2018
-
[8]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021) 11, 21
work page internal anchor Pith review arXiv 2021
-
[9]
In: CVPR
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009) 10
2009
-
[10]
In: ICLR (2021) 1, 3, 8, 10, 12, 21, 22
Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. In: ICLR (2021) 1, 3, 8, 10, 12, 21, 22
2021
-
[11]
Psychometrika1(3), 211–218 (1936) 7
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika1(3), 211–218 (1936) 7
1936
-
[12]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: GPTQ: Accurate post- training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022) 5
work page internal anchor Pith review arXiv 2022
-
[13]
Johns Hopkins Univer- sity Press, Baltimore, MD, 4 edn
Golub, G.H., Van Loan, C.F.: Matrix Computations. Johns Hopkins Univer- sity Press, Baltimore, MD, 4 edn. (2013).https : / / doi . org / 10 . 1137 / 1 . 97814214079443, 6
2013
-
[14]
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing12(7), 2217– 2226 (2019) 11, 21 16 Ge et al
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing12(7), 2217– 2226 (2019) 11, 21 16 Ge et al
2019
-
[15]
NeurIPS Datasets and Benchmarks Track (2021) 11, 21
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. NeurIPS Datasets and Benchmarks Track (2021) 11, 21
2021
- [16]
-
[17]
In: ICLR (2022) 1, 3, 4, 11, 12, 13, 23, 24
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022) 1, 3, 4, 11, 12, 13, 23, 24
2022
-
[18]
Hwang, I., Park, H., Lee, Y., Yang, J., Maeng, S.: Pc-lora: Low-rank adapta- tion for progressive model compression with knowledge distillation. arXiv preprint arXiv:2406.09117 (2024).https://doi.org/10.48550/arXiv.2406.09117,https: //arxiv.org/abs/2406.091172
-
[19]
In- formation Processing Letters70(1), 39–45 (1999).https://doi.org/10.1016/ S0020-0190(99)00031-97
Khuller, S., Moss, A., Naor, J.S.: The budgeted maximum coverage problem. In- formation Processing Letters70(1), 39–45 (1999).https://doi.org/10.1016/ S0020-0190(99)00031-97
1999
-
[20]
In: ICLR (2024) 3, 4
Kopiczko, D.J., Blankevoort, T., Asano, Y.M.: VeRA: Vector-based random matrix adaptation. In: ICLR (2024) 3, 4
2024
-
[21]
In: ICCV Workshops
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine- grained categorization. In: ICCV Workshops. pp. 554–561 (2013) 11, 21
2013
-
[22]
Master’s thesis, University of Tront (2009) 8, 11, 21
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Master’s thesis, University of Tront (2009) 8, 11, 21
2009
-
[23]
Journal of Multivariate Analysis88(2), 365–411 (2004).https://doi
Ledoit, O., Wolf, M.: A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis88(2), 365–411 (2004).https://doi. org/10.1016/j.jmva.2003.11.0096
-
[24]
In: NeurIPS
Lingam, V.C., Neerkaje, A., Vavre, A., Shetty, A., Gudur, G.K., Ghosh, J., Choi, E., Dimakis, A., Bojchevski, A., Sanghavi, S.: SVFT: Parameter-efficient fine- tuning with singular vectors. In: NeurIPS. pp. 41425–41446 (2024) 3, 4, 11, 12, 13, 23, 24
2024
-
[25]
In: ICML
Liu, S.Y., Wang, C.Y., Yin, H., Molchanov, P., Wang, Y.C.F., Cheng, K.T., Chen, M.H.: DoRA: Weight-decomposed low-rank adaptation. In: ICML. pp. 32100– 32121 (2024) 1, 3, 4, 11, 12, 13
2024
-
[26]
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 1, 3, 10, 12, 21, 22, 23
2021
- [27]
-
[28]
In: ICLR (2017) 11
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017) 11
2017
-
[29]
Fine-Grained Visual Classification of Aircraft
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013) 11, 21
work page internal anchor Pith review arXiv 2013
-
[30]
Optimizing neural networks with kronecker-factored approximate curvature.arXiv:1503.05671, 2020
Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored ap- proximate curvature. In: Proceedings of the 32nd International Conference on Ma- chine Learning (ICML). pp. 2408–2417 (2015),https://arxiv.org/abs/1503. 05671, arXiv:1503.05671 2, 5
-
[31]
In: NeurIPS
Meng, F., Wang, Z., Zhang, M.: PiSSA: Principal singular values and singular vec- tors adaptation of large language models. In: NeurIPS. pp. 121038–121072 (2024) 1, 3, 4, 11, 12, 13 Joint Compression and Adaptation 17
2024
-
[32]
In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018) 11, 21
Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018) 11, 21
2018
-
[33]
In: Proceedings of the 37th International Conference on Machine Learning
Nagel, M., Amjad, R.A., van Baalen, M., Louizos, C., Blankevoort, T.: Up or down? Adaptive rounding for post-training quantization. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, vol. 119, pp. 7197– 7206 (2020),https://proceedings.mlr.press/v119/nagel20a.html5
2020
-
[34]
In: CVPR
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR. pp. 3498–3505 (2012) 11, 21
2012
-
[35]
In: NeurIPS
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: An imperative style, high- performance deep learning library. In: NeurIPS. pp. 8026–8037 (2019) 11
2019
-
[36]
In: The Thirteenth Inter- national Conference on Learning Representations (2025),https://openreview
Qinsi, W., Ke, J., Tomizuka, M., Keutzer, K., Xu, C.: Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives. In: The Thirteenth Inter- national Conference on Learning Representations (2025),https://openreview. net/forum?id=kws76i5XB82, 3, 4, 11, 13, 23, 24
2025
-
[37]
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020),https://api.semanticscholar.org/CorpusID:221191193 11
Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020),https://api.semanticscholar.org/CorpusID:221191193 11
2020
-
[38]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 11, 21
Sakaguchi, K., Le Bras, R., Bhagavatula, C., Choi, Y.: Winogrande: An adversarial winograd schema challenge at scale. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020) 11, 21
2020
-
[39]
In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (2019) 11, 21
Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Socialiqa: Commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (2019) 11, 21
2019
-
[40]
In: Burstein, J., Doran, C., Solorio, T
Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A question an- swering challenge targeting commonsense knowledge. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers)...
2019
-
[41]
In: NeurIPS
Tian, C., Shi, Z., Guo, Z., Li, L., Xu, C.: HydraLoRA: An asymmetric LoRA architecture for efficient fine-tuning. In: NeurIPS. pp. 9565–9584 (2024) 3, 4
2024
-
[42]
Soviet Math
Tikhonov, A.N.: Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl.5, 1035–1038 (1963) 6
1963
-
[43]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: Llama: Open and efficient foundation language models. CoRRabs/2302.13971(2023).https://doi.org/10.48550/ARXIV.2302.13971, https://doi.org/10.48550/arXiv.2302.139711
-
[44]
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., 18 Ge et al. Kardas, M., Kerk...
- [45]
- [46]
-
[47]
In: International Con- ference on Learning Representations (ICLR) (2025),https://openreview.net/ forum?id=LNYIUouhdt2, 3, 4, 7, 11, 13, 23, 24
Wang, X., Zheng, Y., Wan, Z., Zhang, M.: SVD-LLM: Truncation-aware singular value decomposition for large language model compression. In: International Con- ference on Learning Representations (ICLR) (2025),https://openreview.net/ forum?id=LNYIUouhdt2, 3, 4, 7, 11, 13, 23, 24
2025
-
[48]
In: EMNLP
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: Transformers: State-of-the-art natural language processing. In: EMNLP. pp. 38–45 (2020) 11
2020
-
[49]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok, J.T., Li, Z., Weller, A., Liu, W.: Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 (2023) 11, 13, 21, 23, 24
work page internal anchor Pith review arXiv 2023
-
[50]
Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., Sun, G.: Asvd: Activation-aware singular value decomposition for compressing large language models (2023) 2, 3, 4
2023
-
[51]
Zellers,R.,Holtzman,A.,Bisk,Y.,Farhadi,A.,Choi,Y.:Hellaswag:Canamachine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019) 11, 21
2019
-
[52]
LoRAPrune: Structured pruning meets low-rank parameter-efficient fine-tuning
Zhang, M., Chen, H., Shen, C., Yang, Z., Ou, L., Yu, X., Zhuang, B.: Loraprune: Structured pruning meets low-rank parameter-efficient fine-tuning. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 3013–3026. As- sociation for Computational Linguistics, Bangkok, Thailand (Aug 2024).https: //doi.org/10.18653/v1/2024.findings-acl.1...
-
[53]
In: ICLR (2023) 1, 3, 4
Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., Zhao, T.: Adap- tive budget allocation for parameter-efficient fine-tuning. In: ICLR (2023) 1, 3, 4
2023
-
[54]
Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., Zhao, T.: Adalora: Adaptive budget allocation for parameter-efficient fine- tuning (2023),https://arxiv.org/abs/2303.105123
work page internal anchor Pith review arXiv 2023
-
[55]
Zhao, W., Huang, Y., Han, X., Liu, Z., Zhang, Z., Li, K., Chen, C., Yang, T., Sun, M.: Ca-lora: Adapting existing lora for compressed llms to enable efficient multi-tasking on personal devices. arXiv preprint arXiv:2307.07705 (2024).https: //doi.org/10.48550/arXiv.2307.07705,https://arxiv.org/abs/2307.077052
-
[56]
A survey on efficient inference for large language models,
Zhou, Z., Ning, X., Hong, K., Fu, T., Xu, J., Li, S., Lou, Y., Wang, L., Yuan, Z., Li, X., Yan, S., Dai, G., Zhang, X.P., Dong, Y., Wang, Y.: A survey on efficient Joint Compression and Adaptation 19 inference for large language models. CoRRabs/2404.14294(2024).https:// doi.org/10.48550/arXiv.2404.14294,https://arxiv.org/abs/2404.142941 20 Ge et al. Appen...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.