Recognition: 2 theorem links
· Lean TheoremDifferent Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
Pith reviewed 2026-05-12 02:19 UTC · model grok-4.3
The pith
A linear router trained on dense outputs can pick prompt-specific SVD ranks to improve accuracy and speed in compressed LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PARSE decouples rank selection from any fixed calibration set by supervising a linear router against dense-model outputs on a broad corpus. This router assigns different SVD ranks to different inputs at inference time. Because rank-selection patterns are shared across semantically similar prompts and stay consistent during decoding, the chosen rank subsets can be served from a pattern cache. Expert memory aggregation and kernel fusion then keep the added overhead low. Integrated with four representative SVD pipelines, the method improves average task accuracy by up to 10 percent at a 0.6 compression ratio on LLaMA-7B while achieving up to 2.5× prefill and 2.4× decode speedup over native SVD.
What carries the argument
A linear router that maps prompt embeddings to rank choices, trained offline by supervising against dense-model outputs rather than calibration data, which selects input-specific SVD rank subsets at inference.
If this is right
- Existing SVD-based compression pipelines can be upgraded by adding the router without altering their core low-rank decomposition step.
- Each prompt receives only the singular components it needs, reducing average compute and memory traffic during both prefill and decode.
- Rank patterns that repeat across similar prompts allow caching, so router cost becomes negligible for long generations.
- The accuracy and speedup benefits hold when the router is combined with any of the four tested SVD methods.
- Rank selection no longer depends on the particular calibration dataset used to build the compressed model.
Where Pith is reading between the lines
- The same prompt-conditioned selection principle could be tested on other low-rank or sparse compression schemes that currently use static truncation.
- If the router generalizes across domains, it could reduce the need for repeated calibration when deploying compressed models to new tasks.
- The observed stability of rank choices across decoding steps suggests that prompt-level decisions can be made once per sequence instead of per token.
Load-bearing premise
A linear router trained offline on dense-model outputs from a large-scale corpus will reliably generalize to select suitable ranks for new, unseen prompts without being overly sensitive to the choice of training data.
What would settle it
Measure task accuracy on a held-out prompt set drawn from a different distribution than the router's training corpus; if the gains over static SVD vanish or reverse, the central claim does not hold.
Figures
read the original abstract
Large language models (LLMs) have rapidly grown in scale, creating substantial memory and computational costs that hinder efficient deployment. Singular value decomposition (SVD) has emerged as an effective post-training compression technique, but existing SVD-based methods rely on static rank truncation, applying a fixed prefix of singular components to all inputs regardless of their diversity. We identify two limitations of this static design: the optimal rank varies across individual prompts, and the selected rank is sensitive to the choice of calibration set, leading to suboptimal performance across diverse inputs. To address these challenges, we propose $\textbf{PARSE}$, a post-training framework for $\textbf{P}$rompt-$\textbf{A}$ware $\textbf{R}$ank $\textbf{S}$election as $\textbf{E}$xperts in SVD-compressed LLMs. PARSE trains a linear router offline to perform prompt-aware rank selection, decoupling it from calibration information by supervising the router against dense-model outputs on a large-scale corpus. We further observe that rank-selection patterns are shared across semantically similar prompts and remain stable across decoding steps, allowing appropriate rank subsets to be served directly from a pattern cache at inference. Complemented by expert memory aggregation and kernel fusion for system-level efficiency, PARSE is orthogonal to existing SVD-based pipelines and consistently improves both model quality and inference efficiency. Integrated with four representative SVD-based methods, PARSE improves average task accuracy by up to 10% at a compression ratio of 0.6 on LLaMA-7B, and achieves up to 2.5 $\times$ prefill and 2.4 $\times$ decode speedup over native SVD execution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PARSE, a post-training framework for prompt-aware dynamic rank selection in SVD-compressed LLMs. It identifies that static rank truncation is suboptimal because optimal ranks vary across prompts and are sensitive to calibration sets. PARSE trains a linear router offline, supervised on dense-model outputs from a large-scale corpus, to select per-layer rank subsets for each prompt. It exploits observed stability of rank patterns across semantically similar prompts and decoding steps via a pattern cache, plus expert memory aggregation and kernel fusion. When integrated with four SVD baselines, it reports up to 10% average task accuracy gain at 0.6 compression on LLaMA-7B and up to 2.5× prefill / 2.4× decode speedups over native SVD.
Significance. If the linear router generalizes reliably, the work offers a lightweight, calibration-decoupled way to improve existing SVD pipelines without retraining the base model or introducing non-linear overhead. The orthogonality claim and the use of offline dense supervision are positive features; reproducible speedups from caching and fusion would be practically useful for deployment.
major comments (3)
- [§3] §3 (router training): the headline accuracy claim (up to 10% at ratio 0.6) rests on the linear router generalizing from a fixed large-scale corpus to unseen prompts. No quantitative evidence is provided that rank-selection patterns are linearly separable (e.g., no separability metrics, no comparison to non-linear routers, no sensitivity analysis to corpus choice). If the mapping is corpus-dependent or requires non-linear decision boundaries, the reported gains become an artifact of the particular training distribution rather than a general property.
- [§4] §4 (experiments): the integration results with four SVD methods lack error bars, multiple random seeds, or statistical significance tests. Without these, it is impossible to determine whether the 10% average accuracy lift is robust or driven by post-hoc choices of prompts, tasks, or calibration data.
- [§4.3] §4.3 (generalization): the claim that rank patterns are shared across semantically similar prompts and stable across decoding steps is stated but not supported by any clustering, similarity, or stability metrics. A quantitative validation (e.g., intra-cluster variance of selected ranks or cross-prompt transfer accuracy) is needed to justify the pattern-cache design.
minor comments (2)
- [§3.2] Notation for the router input features and the exact supervision loss (cross-entropy on dense logits?) should be defined explicitly with equations.
- [Abstract, §4] The abstract and experiments should clarify the precise compression ratio definition (parameter count, FLOPs, or memory) and report the effective rank distribution chosen by the router.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the paper without altering its core claims.
read point-by-point responses
-
Referee: [§3] the headline accuracy claim (up to 10% at ratio 0.6) rests on the linear router generalizing from a fixed large-scale corpus to unseen prompts. No quantitative evidence is provided that rank-selection patterns are linearly separable (e.g., no separability metrics, no comparison to non-linear routers, no sensitivity analysis to corpus choice).
Authors: We agree that explicit evidence of linear separability would strengthen the justification for our router design. In the revised manuscript we will add a dedicated analysis subsection that reports (i) a linear separability metric (ratio of between-class to within-class scatter on rank-label embeddings), (ii) a direct comparison of the linear router against a small two-layer MLP on the same supervision data, and (iii) sensitivity results obtained by training on random 50 % and 25 % subsets of the corpus and evaluating on held-out prompts. These additions will demonstrate that the observed gains are not artifacts of the particular training distribution. revision: yes
-
Referee: [§4] the integration results with four SVD methods lack error bars, multiple random seeds, or statistical significance tests.
Authors: We acknowledge that variability measures are necessary to establish robustness. We will re-execute the main accuracy and speedup experiments across three independent random seeds, report mean and standard deviation for all task accuracies, and include paired t-tests (or Wilcoxon signed-rank tests where normality assumptions fail) for the reported improvements over the four SVD baselines. These statistics will be added to Tables 2–4 and the corresponding text in §4. revision: yes
-
Referee: [§4.3] the claim that rank patterns are shared across semantically similar prompts and stable across decoding steps is stated but not supported by any clustering, similarity, or stability metrics.
Authors: We recognize that quantitative validation is required to support the pattern-cache design. In the revision we will insert (i) intra-cluster variance of selected ranks when prompts are clustered by sentence-BERT embeddings, (ii) cross-prompt transfer accuracy when a router trained on one cluster is evaluated on another, and (iii) per-layer rank-change frequency and average stability score across decoding steps on long sequences. These metrics will be presented in a new paragraph in §4.3 together with the existing qualitative observations. revision: yes
Circularity Check
Minor self-citation present but not load-bearing; router supervision remains externally grounded
full rationale
The paper's central mechanism trains a linear router offline by supervising it directly against dense-model outputs on an external large-scale corpus, decoupling rank selection from any calibration set used in SVD truncation. This provides independent grounding outside the compressed model itself. No equations or steps reduce by construction to fitted inputs renamed as predictions, no uniqueness theorems are imported from the same authors, and no ansatz is smuggled via self-citation. The abstract explicitly states the supervision source and orthogonality to existing SVD pipelines, making the derivation self-contained against external benchmarks. A score of 2 accounts for the possibility of routine self-citations in the full text that do not carry the core claim.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We reformulate each SVD-compressed weight matrix as a mixture of rank experts... linear router f_θ : R^n → R^rmax... R(x) = TopK(f_θ(x))... supervised against dense-model outputs on a large-scale corpus.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rank-selection patterns are shared across semantically similar prompts and remain stable across decoding steps
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
work page 2025
-
[2]
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025
work page 2025
-
[3]
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020
work page 2020
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page 2020
-
[5]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page 2022
-
[6]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[8]
Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023
work page 2023
-
[9]
Dipsvd: Dual-importance protected svd for efficient llm compression
Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Chuan- long Xie, and Yao Zhu. Dipsvd: Dual-importance protected svd for efficient llm compression. arXiv preprint arXiv:2506.20353, 2025
-
[10]
A survey on efficient inference for large language models, 2024
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, and Yu Wang. A survey on efficient inference for large language models, 2024
work page 2024
-
[11]
Model compression and efficient inference for large language models: A survey, 2024
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, and Xiaofei He. Model compression and efficient inference for large language models: A survey, 2024
work page 2024
-
[12]
Lejla Begic Fazlic, Berkay Cetkin, Achim Guldner, Matthias Dziubany, Julian Heinen, Stefan Naumann, and Guido Dartmann. Enhancing energy efficiency in ai: A multi-faceted analysis across time series, semantic ai and deep learning domains. InEnvironmental Informatics, pages 237–256. Springer, 2024
work page 2024
-
[13]
David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and Jeff Dean. The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022
work page 2022
-
[14]
Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, and Chenfeng Xu. Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025
-
[15]
Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, and Grace Li Zhang. Basis shar- ing: Cross-layer parameter sharing for large language model compression.arXiv preprint arXiv:2410.03765, 2024
-
[16]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024
work page 2024
-
[18]
Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024
-
[19]
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023. 11
work page 2023
-
[20]
Sparsegpt: Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning, pages 10323–10337. PMLR, 2023
work page 2023
-
[21]
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023
-
[22]
Minillm: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024
work page 2024
-
[23]
Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, and Yiqiang Chen. Survey on knowledge distillation for large language models: methods, evaluation, and application.ACM Transactions on Intelligent Systems and Technology, 16(6):1– 27, 2025
work page 2025
-
[24]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023
work page 2023
-
[25]
Polycystic kidney disease.Nature reviews Disease primers, 4(1):50, 2018
Carsten Bergmann, Lisa M Guay-Woodford, Peter C Harris, Shigeo Horie, Dorien JM Peters, and Vicente E Torres. Polycystic kidney disease.Nature reviews Disease primers, 4(1):50, 2018
work page 2018
-
[26]
Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, and Jingren Zhou. Adabert: Task-adaptive bert compression with differentiable neural architecture search.arXiv preprint arXiv:2001.04246, 2020
-
[27]
Asvd: Activation-aware singular value decomposition for compressing large language models,
Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decomposition for compressing large language models. arXiv preprint arXiv:2312.05821, 2023
-
[28]
Svd-llm: Truncation- aware singular value decomposition for large language model compression,
Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression.arXiv preprint arXiv:2403.07378, 2024
-
[29]
Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Haotong Qin, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Adasvd: Adaptive singular value decomposition for large language models.arXiv preprint arXiv:2502.01403, 2025
-
[30]
The approximation of one matrix by another of lower rank
Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936
work page 1936
- [31]
-
[32]
Xing Hu, Dawei Yang, Yuan Cheng, Zhixuan Chen, and Zukang Xu. Saes-svd: Self-adaptive suppression of accumulated and local errors for svd-based llm compression.arXiv preprint arXiv:2602.03051, 2026
-
[33]
Svd-llm v2: Optimizing singular value truncation for large language model compression
Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model compression. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4287–4296, 2025
work page 2025
-
[34]
SNIP: Single-shot Network Pruning based on Connection Sensitivity
Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity.arXiv preprint arXiv:1810.02340, 2018
work page Pith review arXiv 2018
-
[35]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[36]
Layer-wisedynamicrankforcompressing large language models.arXiv preprint arXiv:2509.25622, 2025
Zhendong Mi, Bian Sun, Grace Li Zhang, and Shaoyi Huang. Layer-wise dynamic rank for compressing large language models.arXiv preprint arXiv:2509.25622, 2025. 12
-
[37]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023
work page 2023
-
[38]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Pointer sentinel mixture models, 2016
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016
work page 2016
-
[40]
Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993
work page 1993
-
[41]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[42]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018
work page 2018
-
[43]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[45]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019
work page 2019
-
[46]
Piqa: Reasoning about phys- ical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
work page 2020
-
[47]
Mathqa: Towards interpretable math word problem solving with operation-based formalisms
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and ...
work page 2019
-
[48]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
work page 2024
-
[49]
Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112,
Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization.arXiv preprint arXiv:2207.00112, 2022
-
[50]
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12:1556–1577, 2024
work page 2024
-
[51]
A survey on model compression and acceleration for pretrained language models
Canwen Xu and Julian McAuley. A survey on model compression and acceleration for pretrained language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10566–10575, 2023. 13
work page 2023
-
[52]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[53]
MiniLLM: On-Policy Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543, 2023
work page internal anchor Pith review arXiv 2023
-
[54]
Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng
Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models.arXiv preprint arXiv:2402.13116, 2024
-
[55]
Geonhwa Jeong, Po-An Tsai, Abhimanyu R Bambhaniya, Stephen W Keckler, and Tushar Kr- ishna. Enabling unstructured sparse acceleration on structured sparse accelerators.Proceedings of Machine Learning and Systems, 7, 2025
work page 2025
-
[56]
Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights.Proceedings of the IEEE, 109(10):1706–1752, 2021
work page 2021
-
[57]
Slicegpt: Compress large language models by deleting rows and columns,
Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024
-
[58]
Slimllm: Accurate structured pruning for large language models.arXiv preprint arXiv:2505.22689, 2025
Jialong Guo, Xinghao Chen, Yehui Tang, and Yunhe Wang. Slimllm: Accurate structured pruning for large language models.arXiv preprint arXiv:2505.22689, 2025
-
[59]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023
work page 2023
-
[60]
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in neural information processing systems, 35:27168–27183, 2022
work page 2022
-
[61]
Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025
-
[62]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[63]
A survey on mixture of experts.Authorea Preprints, 2024
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts.Authorea Preprints, 2024
work page 2024
-
[64]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Moefication: Transformer feed-forward layers are mixtures of experts
Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Moefication: Transformer feed-forward layers are mixtures of experts. InFindings of the Association for Computational Linguistics: ACL 2022, pages 877–890, 2022. 14 A Related Works Large Language Model Compression.Large language model compression has been widely studied to reduce...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.