pith. sign in

arxiv: 2605.15626 · v1 · pith:2CURW7NInew · submitted 2026-05-15 · 💻 cs.LG

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

Pith reviewed 2026-05-20 21:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM compressionSVDlow-rank approximationpost-training compressionadaptive rank allocationKL divergencemodel quantization
0
0 comments X

The pith

IO-SVD compresses LLMs by whitening both input activations and output prediction sensitivity to limit accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models remain expensive to store and run, so post-training methods that shrink their weight matrices without retraining are valuable. IO-SVD creates a whitening space that incorporates both the statistics of typical inputs and a measure of how weight changes affect the model's final token predictions. The output measure comes from expanding the KL divergence to second order over the most probable tokens. A separate step then assigns different compression levels to different singular components by estimating their individual contribution to loss under a fixed total budget. If the approach works, models can be made substantially smaller while still producing nearly the same answers on downstream tasks and running faster at inference time.

Core claim

IO-SVD forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, it constructs an output-side metric that captures predictive sensitivity, while input whitening captures activation statistics. It further introduces an efficient heterogeneous rank-allocation strategy that scores whitened singular components using first-order calibration loss estimates and prunes the least sensitive components under a global budget. The same sensitivity estimates also guide loss-aware remapping when combining the low-rank factors with 8-bit quantization.

What carries the argument

The KL-aware double-sided whitening space that combines input activation statistics with an output metric derived from second-order KL expansion over top token probabilities.

If this is right

  • Models retain higher task performance at the same compression ratio compared with input-only whitening.
  • Inference speed increases because the resulting low-rank matrices require fewer operations during forward passes.
  • Hybrid low-rank plus quantization achieves better quality by using the loss estimates to decide which factors to quantize to 8 bits.
  • The same construction applies to both pure language models and vision-language models with only minor changes to the calibration data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The whitening construction could be tested with other divergence measures or with full-sequence losses instead of top-K tokens.
  • The global budget for rank allocation might be replaced by per-layer hardware constraints to target specific latency targets.
  • Sensitivity scores computed once could be reused to guide further compression steps such as pruning or distillation.

Load-bearing premise

The second-order expansion of the KL loss over the top-K token probabilities accurately captures predictive sensitivity for the output-side metric.

What would settle it

Apply the chosen ranks and factors to a held-out calibration set, then measure the actual change in KL divergence or downstream accuracy and check whether it matches the first-order and second-order estimates used to select the components.

Figures

Figures reproduced from arXiv: 2605.15626 by Ali Abbasi, Chayne Thrash, Hamed Pirsiavash, Haoran Qin, Soheil Kolouri.

Figure 1
Figure 1. Figure 1: Overview of IO-SVD. (a) Comparison of whitening strategies: standard SVD reconstructs the weight directly, one-sided whitening incorporates only input activation statistics, and double-sided whitening incorporates both input statistics and output-side sensitivity before SVD. (b) Heterogeneous rank allocation. For each whitened matrix B, singular components are sorted by singular-value magnitude, and the sm… view at source ↗
Figure 2
Figure 2. Figure 2: Loss-aware remapping: (a) SVD-truncate each weight to rank k; (b) score factor rows by first-order calibration-loss change under int8 quantization; (c) greedily keep low-score rows in int8 until meeting Crem; (d) assign the remaining rows to fp16. 3.2 Adaptive rank allocation The SVD solution above assumes fixed per-layer ranks. Under a global compression budget, we instead allocate ranks by estimating the… view at source ↗
Figure 3
Figure 3. Figure 3: Top-K ablation for output-side KL curvature. Nor￾malized perplexity on wiki2, C4, PTB [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Throughput vs. peak memory on LLaMA-2- 7B (batch 64, seq 1024+1024). Peak GPU memory. The dense baseline consumes 77.6 GB, dominated by a 64.0 GB KV cache and a 12.6 GB weight tensor. IO-SVD without cache optimization shrinks the weight foot￾print to 5.4 GB but leaves the KV cache untouched at 64.0 GB, giving a 70.4 GB peak. Adding V-cache compression reduces the cache to 38.8 GB and yields a 50.3 GB peak,… view at source ↗
read the original abstract

Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource-constrained and latency-sensitive settings. SVD-based post-training compression offers a hardware-agnostic way to reduce model size and improve inference efficiency through low-rank factorization. However, existing methods often rely on input-only whitening spaces, homogeneous rank allocation, or loss-agnostic allocation heuristics, limiting their ability to preserve model quality under aggressive compression. We propose Input-Output Whitened SVD (IO-SVD), a post-training compression method that forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, IO-SVD constructs an output-side metric that captures predictive sensitivity, while input whitening captures activation statistics. We further introduce an efficient heterogeneous rank-allocation strategy that scores whitened singular components using first-order calibration loss estimates and prunes the least sensitive components under a global budget. Inspired by prior work that combines SVD truncation with quantization, we improve hybrid SVD-quantization compression through loss-aware remapping, which selects low-rank factor rows for 8-bit quantization based on the predicted loss change incurred by quantizing them. Extensive experiments across diverse LLM and VLM families, and inference-time analysis shows that IO-SVD compresses LLMs with minimal performance degradation while delivering practical inference speedups. Code is available at https://github.com/mint-vu/IO-SVD.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces IO-SVD, a post-training SVD-based compression technique for LLMs and VLMs. It constructs a double-sided whitening space by combining input-side activation statistics with an output-side metric derived from a second-order Taylor expansion of the KL divergence over top-K token probabilities. This metric guides heterogeneous rank allocation via first-order calibration loss estimates under a global budget, and the method is extended to loss-aware remapping for hybrid SVD-quantization. Experiments across model families report minimal degradation alongside inference speedups, with code released.

Significance. If the core approximations and empirical results hold, IO-SVD would offer a hardware-agnostic, loss-aware approach to adaptive-rank compression that improves upon input-only or homogeneous baselines. The public code and cross-family evaluation strengthen reproducibility and practical utility for deployment.

major comments (2)
  1. [IO-SVD construction] IO-SVD construction section: the second-order expansion of KL divergence over top-K probabilities is used to define the output-side whitening metric and sensitivity scores for rank allocation. No explicit bound or empirical check is provided showing that higher-order terms remain negligible under the target compression ratios, where large singular-value truncation can produce non-local output shifts. This approximation is load-bearing for the central claim of minimal degradation.
  2. [rank allocation] Heterogeneous rank-allocation paragraph: first-order calibration loss estimates are computed in the whitened space to prune components. If these estimates reuse the same calibration data or whitening transform that defines the metric, the procedure risks circularity; an explicit statement of data separation or a reduction showing the estimates are independent of the fitted parameters is needed to support the allocation strategy.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'minimal performance degradation' is repeated without quantitative qualifiers; adding a brief range of reported perplexity or accuracy drops would improve precision.
  2. Notation: the input and output whitening matrices are introduced without an explicit equation linking them to the final low-rank factors; a single displayed equation would clarify the double-sided construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the IO-SVD method, particularly regarding the validity of the second-order KL approximation and the independence of the rank-allocation estimates. We address each major comment below and will revise the manuscript to incorporate additional validation and clarifications.

read point-by-point responses
  1. Referee: [IO-SVD construction] IO-SVD construction section: the second-order expansion of KL divergence over top-K probabilities is used to define the output-side whitening metric and sensitivity scores for rank allocation. No explicit bound or empirical check is provided showing that higher-order terms remain negligible under the target compression ratios, where large singular-value truncation can produce non-local output shifts. This approximation is load-bearing for the central claim of minimal degradation.

    Authors: We acknowledge the importance of validating the second-order Taylor expansion of the KL divergence. In the revised manuscript, we will add an empirical section that compares the approximated output-side metric against the exact KL divergence computed on a held-out calibration set for compression ratios ranging from 2x to 4x. We will also include a brief analysis referencing approximation bounds from loss landscape literature to discuss when higher-order terms remain small, thereby supporting the claim of minimal degradation under the evaluated settings. revision: yes

  2. Referee: [rank allocation] Heterogeneous rank-allocation paragraph: first-order calibration loss estimates are computed in the whitened space to prune components. If these estimates reuse the same calibration data or whitening transform that defines the metric, the procedure risks circularity; an explicit statement of data separation or a reduction showing the estimates are independent of the fitted parameters is needed to support the allocation strategy.

    Authors: We agree that explicit separation is necessary to avoid any appearance of circularity. The input whitening transform is computed solely from activation statistics on a first calibration subset, while the first-order loss estimates for heterogeneous rank allocation are performed on a disjoint second calibration subset that does not influence the whitening matrix. In the revision, we will add an explicit statement of this data separation protocol along with a short empirical check confirming that the sensitivity scores remain stable when the whitening transform is held fixed from the first subset. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs the output-side metric via an explicit second-order Taylor expansion of KL divergence over top-K probabilities (abstract and IO-SVD construction), which is an independent approximation rather than a self-definition or fitted input renamed as prediction. Heterogeneous rank allocation scores components using first-order calibration loss estimates on separate calibration data, a standard post-training technique that does not reduce by construction to the whitening space or target performance metrics. No load-bearing self-citations, uniqueness theorems imported from authors, or ansatzes smuggled via prior work are present in the core steps; the hybrid SVD-quantization remapping similarly relies on predicted loss change computed from the same expansion without circular re-use of fitted values. The derivation remains self-contained against external benchmarks and does not equate outputs to inputs by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that a second-order KL expansion provides a reliable sensitivity metric and that first-order loss estimates suffice for pruning decisions under a global budget.

free parameters (1)
  • global rank budget
    Controls total compression ratio; chosen to meet target size while minimizing predicted loss.
axioms (1)
  • domain assumption Second-order Taylor expansion of KL divergence over top-K token probabilities approximates output sensitivity
    Invoked to construct the output-side whitening metric without full loss computation.

pith-pipeline@v0.9.0 · 5812 in / 1168 out tokens · 40592 ms · 2026-05-20T21:10:54.364427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 3 internal anchors

  1. [1]

    Zero sum svd: Balancing loss sensitivity for low rank llm compression, 2026

    Ali Abbasi, Chayne Thrash, Haoran Qin, Shansita Sharma, Sepehr Seifi, and Soheil Kolouri. Zero sum svd: Balancing loss sensitivity for low rank llm compression, 2026. URLhttps://arxiv.org/abs/2602.02848

  2. [2]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

  3. [3]

    MathQA: Towards interpretable math word problem solving with operation-based formalisms

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Li...

  4. [4]

    PIQA: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239

  5. [5]

    QuIP: 2-bit quantization of large language models with guarantees

    Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=xrk9g5vcXR

  6. [6]

    Generalized fisher-weighted svd: Scalable kronecker-factored fisher approximation for compressing large language models, 2025

    Viktoriia Chekalina, Daniil Moskovskiy, Daria Cherniuk, Maxim Kurkin, Andrey Kuznetsov, and Evgeny Frolov. Generalized fisher-weighted svd: Scalable kronecker-factored fisher approximation for compressing large language models, 2025. URLhttps://arxiv.org/abs/2505.17974

  7. [7]

    Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10558–10578, 2024

  8. [8]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https: //arxiv.org/abs/1803.05457

  9. [9]

    Llm.int8(): 8-bit matrix multiplication for transformers at scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

  10. [10]

    The approximation of one matrix by another of lower rank.Psychometrika, 1(3): 211–218, 1936

    Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank.Psychometrika, 1(3): 211–218, 1936

  11. [11]

    Sparsegpt: massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. Sparsegpt: massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  12. [12]

    Optimal brain compression: a framework for accurate post- training quantization and pruning

    Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal brain compression: a framework for accurate post- training quantization and pruning. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

  13. [13]

    OPTQ: Accurate quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=tcbBPnfwxS

  14. [14]

    Marlin: Mixed-precision auto- regressive parallel inference on large language models

    Elias Frantar, Roberto L Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. Marlin: Mixed-precision auto- regressive parallel inference on large language models. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 239–251, 2025

  15. [15]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

  16. [16]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  17. [17]

    Language model compression with weighted low-rank factorization

    Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uPv9Y3gmAI5

  18. [18]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9. 10 IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM CompressionA PREPRINT

  19. [19]

    Seed-bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024

  20. [20]

    Optimal brain decomposition for accurate llm low-rank approximation.arXiv preprint arXiv:2604.00821, 2026

    Yuhang Li, Donghyun Lee, Ruokai Yin, and Priyadarshini Panda. Optimal brain decomposition for accurate llm low-rank approximation.arXiv preprint arXiv:2604.00821, 2026

  21. [21]

    QServe: W4A8KV4 quantization and system co- design for efficient LLM serving.MLSys, 2025

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.arXiv preprint arXiv:2405.04532, 2024

  22. [22]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

  23. [23]

    LLM-Pruner: On the structural pruning of large language models

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-Pruner: On the structural pruning of large language models. Advances in Neural Information Processing Systems, 36:21702–21720, 2023

  24. [24]

    Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

    Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993. URL https://aclanthology. org/J93-2004/

  25. [25]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id= Byj72udxe

  26. [26]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, O...

  27. [27]

    Symmetric gauge functions and unitarily invariant norms.The Quarterly Journal of Mathematics, 11(1):50–59, 1960

    Leon Mirsky. Symmetric gauge functions and unitarily invariant norms.The Quarterly Journal of Mathematics, 11(1):50–59, 1960

  28. [28]

    Ctpd: Cross tokenizer preference distillation.arXiv preprint arXiv:2601.11865, 2026

    Truong Nguyen, Phi Van Dat, Ngan Nguyen, Linh Ngo Van, Trung Le, and Thanh Hong Nguyen. Ctpd: Cross tokenizer preference distillation.arXiv preprint arXiv:2601.11865, 2026

  29. [29]

    Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models.arXiv preprint arXiv:2206.09557, 2022

    Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models.arXiv preprint arXiv:2206.09557, 2022

  30. [30]

    Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives

    Wang Qinsi, Jinghan Ke, Masayoshi Tomizuka, Kurt Keutzer, and Chenfeng Xu. Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=kws76i5XB8

  31. [31]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21(1), January 2020. ISSN 1532-4435

  32. [32]

    Parameter-efficient and student-friendly knowledge distillation.IEEE Transactions on Multimedia, 26:4230–4241, 2023

    Jun Rao, Xv Meng, Liang Ding, Shuhan Qi, Xuebo Liu, Min Zhang, and Dacheng Tao. Parameter-efficient and student-friendly knowledge distillation.IEEE Transactions on Multimedia, 26:4230–4241, 2023

  33. [33]

    Winogrande: an adversarial winograd schema challenge at scale.Commun

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale.Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URLhttps://doi.org/10.1145/3474381

  34. [34]

    Omniquant: Omnidirectionally calibrated quantization for large language models

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. InThe Twelfth International Conference on Learning Representations

  35. [35]

    Overcoming vocabulary mismatch: V ocabulary-agnostic teacher guided language modeling

    Haebin Shin, Lei Ji, Xiao Liu, and Yeyun Gong. Overcoming vocabulary mismatch: V ocabulary-agnostic teacher guided language modeling. InForty-second International Conference on Machine Learning

  36. [36]

    A simple and effective pruning approach for large language models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=PxoFut3dWW

  37. [37]

    Quip#: even better llm quantization with hadamard incoherence and lattice codebooks

    Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: even better llm quantization with hadamard incoherence and lattice codebooks. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 11 IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM CompressionA PREPRINT

  38. [38]

    Model-preserving adaptive rounding, 2025

    Albert Tseng, Zhaofeng Sun, and Christopher De Sa. Model-preserving adaptive rounding, 2025. URL https: //arxiv.org/abs/2505.22988

  39. [39]

    Wsvd: Weighted low-rank approximation for fast and efficient execution of low-precision vision-language models

    Haiyu Wang, Yutong Wang, Jack Jiang, and Sai Qian Zhang. Wsvd: Weighted low-rank approximation for fast and efficient execution of low-precision vision-language models. InThe Fourteenth International Conference on Learning Representations,

  40. [40]

    Lora-ga: Low-rank adaptation with gradient approximation.Advances in Neural Information Processing Systems, 37:54905–54931, 2024

    Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approximation.Advances in Neural Information Processing Systems, 37:54905–54931, 2024

  41. [41]

    In Goldberg, Y ., Kozareva, Z

    Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM v2: Optimizing singular value truncation for large language model compression. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

  42. [42]

    SVD-LLM: Truncation-aware singular value decomposition for large language model compression

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD-LLM: Truncation-aware singular value decomposition for large language model compression. InThe Thirteenth International Conference on Learning Representations,

  43. [43]

    URLhttps://openreview.net/forum?id=LNYIUouhdt

  44. [44]

    Qsvd: Efficient low-rank approximation for unified query-key- value weight compression in low-precision vision-language models

    Yutong Wang, Haiyu Wang, and Sai Qian Zhang. Qsvd: Efficient low-rank approximation for unified query-key- value weight compression in low-precision vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  45. [45]

    Smoothquant: accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  46. [46]

    ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

    Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation- aware singular value decomposition for compressing large language models, 2025. URL https://arxiv.org/ abs/2312.05821

  47. [47]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguisti...

  48. [48]

    Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems, 6:196–209, 2024. 12 IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM CompressionA PREPRINT Appendix A Ad...