pith. sign in

arxiv: 2606.01412 · v1 · pith:L6UUG3GVnew · submitted 2026-05-31 · 💻 cs.LG · cs.IT· math.IT

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

Pith reviewed 2026-06-28 17:26 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT
keywords post-training quantizationlow-rank adaptationGPTQinformation-theoretic boundslayer-wise reconstructionlow-precision weightsHessian augmentation
0
0 comments X

The pith

GPTQ-intrinsic LoRA folds low-rank corrections into the quantization pass to replace full-norm error dependence with a rank-r residual norm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve aggressive low-bit post-training quantization by pairing quantized weights with a low-rank correction of the form Q + LR while respecting finite-alphabet and bounded-rank constraints. It first derives information-theoretic lower bounds on the layer-wise reconstruction objective ||XW - X(Q + LR)||_F^2. The proposed algorithm augments the calibration Hessian so that the low-rank term is optimized jointly with the quantized weights inside a single GPTQ-style pass. When the low-rank factor is set to the top right singular vectors of the calibration matrix, the resulting error bounds scale with the residual norm ||X - X_r||_F^2 rather than the full ||X||_F^2, and these bounds are shown to match the lower bounds in dominant scaling under natural structural assumptions on the data or weights. The same framework also supplies a refinement step that alternates quantization and low-rank compensation while guaranteeing non-increasing error.

Core claim

By choosing L equal to the top right singular vectors V_r of the calibration matrix X and incorporating the low-rank compensation directly through Hessian augmentation, the layer-wise reconstruction error admits provable bounds in which the usual GPTQ term proportional to ||X||_F^2 is replaced by a term proportional to the rank-r residual ||X - X_r||_F^2 (up to regularization). Under the paper's natural structural assumptions these bounds match the information-theoretic lower bounds in their leading scaling, up to constants and mild factors. The algorithm remains training-free and is accompanied by the Bid-Up refinement procedure that alternates optimal low-rank compensation with fixed-grid

What carries the argument

GPTQ-intrinsic LoRA, the procedure that augments the calibration Hessian to embed the low-rank correction L = V_r inside the quantization pass itself.

If this is right

  • The combined quantization-plus-low-rank representation can be realized in one training-free pass rather than sequential quantization followed by separate low-rank compensation.
  • Layer-wise error becomes controlled by the rank-r residual of the calibration matrix instead of its full Frobenius norm.
  • Alternating Bid-Up refinement with low-rank compensation produces a sequence of approximations with strictly non-increasing reconstruction error.
  • Empirical gains appear on both language models (Qwen3) and vision transformers (DeiT) relative to plain GPTQ and to GPTQ followed by post-hoc low-rank correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The Hessian-augmentation technique could be ported to other quantization routines that rely on second-order statistics, potentially extending the residual-norm scaling benefit beyond the GPTQ family.
  • If the structural assumptions hold across typical pretrained weights, then low-bit representations with modest-rank corrections are information-theoretically near-optimal for many network layers.
  • The explicit dependence on the top singular vectors of X suggests that pre-computing a low-rank basis from calibration data may be sufficient for near-optimal performance without further adaptation.
  • Testing the tightness of the bounds on out-of-distribution calibration sets would reveal whether the residual-norm advantage persists when the structural assumptions are only approximately satisfied.

Load-bearing premise

Calibration data or weight matrices possess natural structural properties that let the derived error bounds scale with the residual norm rather than the full norm and thereby approach the information-theoretic limits.

What would settle it

On standard calibration sets, compute the observed layer-wise reconstruction error after applying GPTQ-intrinsic LoRA and check whether it exceeds the predicted residual-norm bound by more than the allowed constant and mild factors.

Figures

Figures reproduced from arXiv: 2606.01412 by Rayan Saab, Shihao Zhang.

Figure 1
Figure 1. Figure 1: Augmenting the data and weight matrices Here pqt ´ w pt´1q t qXt is the error introduced by rounding the t-th entry. The update from w pt´1q ět`1 to w ptq ět`1 adjusts the remaining weights to minimize the resulting distortion in ℓ2, a process referred to as error diffusion in [87]. We denote the full state of the algorithm after step t by w ptq “ pqďt, w ptq ět`1 q P A t ˆ R N´t , with initialization w p0… view at source ↗
Figure 2
Figure 2. Figure 2: GPTQ-intrinsic LoRA Our method entails applying GPTQ to W using the augmented Hessian H “ X JX, together with the usual dampening term. However, we run GPTQ for only N iterations, so that only the first N rows of W are quantized. The result is a quantized matrix Q P ANˆN1 , together with a full-precision matrix R P R rˆN1 . Since R is initialized at zero and updated throughout the N GPTQ steps, it absorbs … view at source ↗
read the original abstract

Post-training quantization is widely used for compressing large neural networks, but aggressive low-bit quantization can significantly degrade model quality. A common remedy is to augment the quantized weights with a low-rank correction, leading to approximations of the form $W\approx Q+LR$. In this paper, we study this low-precision plus low-rank representation through the layer-wise reconstruction objective $\|XW-X(Q+LR)\|_F^2$, where $X$ is a calibration matrix. We establish, to our knowledge, the first information-theoretic lower bounds for this problem under finite-alphabet and bounded low-rank compensation constraints. We then propose GPTQ-intrinsic LoRA, a training-free algorithm that incorporates the low-rank correction directly into a GPTQ-style quantization pass by appropriately augmenting the calibration Hessian. For the choice $L=V_r$, where $V_r$ contains the top right singular vectors of $X$, we prove layer-wise reconstruction error bounds in which the usual GPTQ dependence on $\|X\|_F^2$ is replaced by the rank-$r$ residual $\|X-X_r\|_F^2$, up to regularization terms. Under natural structural assumptions, these bounds match the information-theoretic lower bounds in their dominant scaling, up to constants and mild factors. We also introduce Bid-Up, a fixed-grid quantization refinement step that can be alternated with optimal low-rank compensation with guaranteed non-increasing layer-wise reconstruction error. Experiments on Qwen3 language models and DeiT vision transformers show that GPTQ-intrinsic LoRA improves over GPTQ and GPTQ followed by low-rank compensation, with additional gains from refinement loops.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to establish the first information-theoretic lower bounds for the layer-wise reconstruction objective in low-precision quantization augmented with bounded low-rank compensation. It proposes the GPTQ-intrinsic LoRA algorithm, which augments the GPTQ Hessian to incorporate the low-rank term directly in a training-free manner. For the specific choice L = V_r (top right singular vectors of the calibration matrix X), it proves reconstruction error bounds in which the standard GPTQ ||X||_F^2 dependence is replaced by the rank-r residual ||X - X_r||_F^2 (up to regularization). Under unspecified natural structural assumptions, these upper bounds are claimed to match the lower bounds in dominant scaling. The paper also introduces the Bid-Up fixed-grid refinement procedure with non-increasing error guarantees and reports empirical gains over GPTQ and post-hoc low-rank compensation on Qwen3 and DeiT models.

Significance. If the stated bounds and their matching hold, the work would be a notable contribution to post-training quantization by supplying the first matching lower/upper bounds for the combined low-precision plus low-rank setting and a practical algorithm that achieves the improved residual-norm scaling without additional training. The explicit integration of the low-rank correction into the quantization pass via Hessian augmentation and the guaranteed monotonicity of the Bid-Up refinement are concrete strengths that could influence follow-on algorithm design.

major comments (1)
  1. [Abstract] Abstract: the claim that the derived upper bounds match the information-theoretic lower bounds 'in their dominant scaling, up to constants and mild factors' under 'natural structural assumptions' is the load-bearing statement for the 'near-optimal' characterization, yet the abstract (and therefore the central theoretical contribution) provides no explicit definition or statement of those assumptions (e.g., incoherence, RIP, or spectral conditions on X or W). Without them the matching cannot be verified and the experimental calibration matrices cannot be checked against the proof hypotheses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the need for greater clarity in the abstract. We address the single major comment below and will incorporate the suggested revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the derived upper bounds match the information-theoretic lower bounds 'in their dominant scaling, up to constants and mild factors' under 'natural structural assumptions' is the load-bearing statement for the 'near-optimal' characterization, yet the abstract (and therefore the central theoretical contribution) provides no explicit definition or statement of those assumptions (e.g., incoherence, RIP, or spectral conditions on X or W). Without them the matching cannot be verified and the experimental calibration matrices cannot be checked against the proof hypotheses.

    Authors: We agree that the abstract should explicitly reference the structural assumptions to make the matching claim verifiable. In the body of the manuscript (Section 3.2 and Theorem 3), the assumptions are stated as: (i) the calibration matrix X satisfies a bounded incoherence condition with parameter μ, and (ii) the weight matrix admits a spectral decay such that the residual after rank-r projection is controlled by the singular values. These are the 'natural structural assumptions' referenced. We will revise the abstract to include a concise parenthetical statement of these conditions (e.g., 'under bounded incoherence of X and spectral decay of W'). This change will allow readers to check the hypotheses against both the proofs and the experimental calibration matrices without altering any theorems or empirical results. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper derives information-theoretic lower bounds independently for the layer-wise reconstruction objective under finite-alphabet and bounded low-rank constraints. It then proposes the GPTQ-intrinsic LoRA algorithm that augments the Hessian and proves upper bounds for the specific choice L = V_r (top right singular vectors of X), replacing the usual ||X||_F^2 term with the residual ||X - X_r||_F^2. The claim that these bounds match in dominant scaling occurs only under separately stated 'natural structural assumptions' on the data or weights; these assumptions are not shown to be defined in terms of the target result or to reduce the upper bounds to a fitted quantity by construction. No self-citations, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing steps in the abstract or described chain. The overall derivation therefore retains independent mathematical content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate concrete free parameters or axioms beyond the high-level reliance on a calibration matrix X and the structural assumptions needed for bound matching.

free parameters (1)
  • rank r
    Rank of the low-rank correction chosen from the singular vectors of the calibration matrix.
axioms (1)
  • domain assumption Natural structural assumptions on calibration data or weights
    Invoked to ensure the derived bounds match information-theoretic lower bounds in dominant scaling.

pith-pipeline@v0.9.1-grok · 5832 in / 1360 out tokens · 33111 ms · 2026-06-28T17:26:14.275026+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 40 canonical work pages · 10 internal anchors

  1. [1]

    Ashkboos, A

    S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240, 2024

  2. [2]

    Bailleul

    A. Bailleul. Explicit kronecker–weyl theorems and applications to prime number races.Research in Number Theory, 8(3):43, 2022

  3. [3]

    Bertsimas, R

    D. Bertsimas, R. Cory-Wright, and N. A. Johnson. Sparse plus low rank matrix decomposition: A discrete optimization approach.Journal of Machine Learning Research, 24(267):1–51, 2023

  4. [4]

    E. J. Candes and Y. Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements.IEEE Transactions on Information Theory, 57(4):2342–2359, 2011

  5. [5]

    E. J. Cand` es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?Journal of the ACM (JACM), 58(3):1–37, 2011

  6. [6]

    J. Chee, Y. Cai, V. Kuleshov, and C. M. De Sa. Quip: 2-bit quantization of large language models with guarantees.Advances in Neural Information Processing Systems, 36:4396–4429, 2023

  7. [7]

    M. Chen, M. Wu, H. Jin, Z. Yuan, J. Liu, C. Zhang, Y. Li, J. Huang, J. Ma, Z. Xue, et al. Int vs fp: A comprehensive study of fine-grained low-bit quantization formats.arXiv preprint arXiv:2510.25602, 2025

  8. [8]

    Chen, H.-F

    P. Chen, H.-F. Yu, I. Dhillon, and C.-J. Hsieh. Drone: Data-aware low-rank compression for large nlp models.Advances in neural information processing systems, 34:29321–29334, 2021

  9. [9]

    Y. Chen, E. N. Epperly, J. A. Tropp, and R. J. Webber. Randomly pivoted cholesky: Practical approximation of a kernel matrix with few entry evaluations.arXiv preprint arXiv:2207.06503, 2022

  10. [10]

    Cheng, D

    Y. Cheng, D. Wang, P. Zhou, and T. Zhang. A survey of model compression and acceleration for deep neural networks.arXiv preprint arXiv:1710.09282, 2017

  11. [11]

    R. Child. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

  12. [12]

    Y. Cho, D. Jeon, S. Kim, M. Jeon, and A. No. Preserve-then-quantize: Balancing rank budgets for quantization error reconstruction in llms.arXiv preprint arXiv:2602.02001, 2026

  13. [13]

    J. Cook, J. Guo, G. Xiao, Y. Lin, and S. Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling.arXiv preprint arXiv:2512.02010, 2025

  14. [14]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  15. [15]

    L. Deng, G. Li, S. Han, L. Shi, and Y. Xie. Model compression and hardware acceleration for neural networks: A comprehensive survey.Proceedings of the IEEE, 108(4):485–532, 2020

  16. [16]

    Y. Deng, A. Zhang, N. Wang, S. Gurses, Z. Yang, and P. Yin. Cloq: Enhancing fine-tuning of quantized llms via calibrated lora initialization.arXiv preprint arXiv:2501.18475, 2025

  17. [17]

    Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

    T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

  18. [18]

    Dettmers, A

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in Neural Information Processing Systems, 36, 2024

  19. [19]

    Eckart and G

    C. Eckart and G. Young. The approximation of one matrix by another of lower rank.Psychometrika, 1 (3):211–218, 1936. 23

  20. [20]

    Egiazarian, R

    V. Egiazarian, R. L. Castro, D. Kuznedelev, A. Panferov, E. Kurtic, S. Pandit, A. Marques, M. Kurtz, S. Ashkboos, T. Hoefler, et al. Bridging the gap between promise and performance for microscaling fp4 quantization.arXiv preprint arXiv:2509.23202, 2025

  21. [21]

    Frantar and D

    E. Frantar and D. Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning.Advances in Neural Information Processing Systems, 35:4475–4488, 2022

  22. [22]

    Frantar and D

    E. Frantar and D. Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning, pages 10323–10337. PMLR, 2023

  23. [23]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  24. [24]

    Frantar, U

    E. Frantar, U. Evci, W. Park, N. Houlsby, and D. Alistarh. Compression scaling laws: Unifying sparsity and quantization.arXiv preprint arXiv:2502.16440, 2025

  25. [25]

    Gholami, S

    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for efficient neural network inference. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

  26. [26]

    H. Guo, P. Greengard, E. P. Xing, and Y. Kim. Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning.arXiv preprint arXiv:2311.12023, 2023

  27. [27]

    Halko, P.-G

    N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011

  28. [28]

    Hassibi and D

    B. Hassibi and D. Stork. Second order derivatives for network pruning: Optimal brain surgeon.Advances in neural information processing systems, 5, 1992

  29. [29]

    Hassibi, D

    B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. InIEEE international conference on neural networks, pages 293–299. IEEE, 1993

  30. [30]

    Hoefler, D

    T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks.Journal of Machine Learning Research, 22 (241):1–124, 2021

  31. [31]

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

  32. [32]

    Hubara, Y

    I. Hubara, Y. Nahshan, Y. Hanani, R. Banner, and D. Soudry. Accurate post training quantization with small calibration sets. InInternational conference on machine learning, pages 4466–4475. PMLR, 2021

  33. [33]

    M. Huh, H. Mobahi, R. Zhang, B. Cheung, P. Agrawal, and P. Isola. The low-rank simplicity bias in deep networks.arXiv preprint arXiv:2103.10427, 2021

  34. [34]

    Ikebe, T

    Y. Ikebe, T. Inagaki, and S. Miyamoto. The monotonicity theorem, cauchy’s interlace theorem, and the courant-fischer theorem.The American Mathematical Monthly, 94(4):352–354, 1987

  35. [35]

    gptq.https://github.com/ist-daslab/gptq, 2022

    IST-DASLab. gptq.https://github.com/ist-daslab/gptq, 2022

  36. [36]

    S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023

  37. [37]

    Kuzmin, M

    A. Kuzmin, M. Nagel, M. Van Baalen, A. Behboodi, and T. Blankevoort. Pruning vs quantization: Which is better?Advances in neural information processing systems, 36:62414–62427, 2023

  38. [38]

    LeCun, J

    Y. LeCun, J. Denker, and S. Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989. 24

  39. [39]

    M. Li, Y. Lin, Z. Zhang, T. Cai, X. Li, J. Guo, E. Xie, C. Meng, J.-Y. Zhu, and S. Han. Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

  40. [40]

    Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021

  41. [41]

    Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and T. Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models.arXiv preprint arXiv:2310.08659, 2023

  42. [42]

    Y. Li, D. Lee, R. Yin, and P. Panda. Optimal brain decomposition for accurate llm low-rank approximation. arXiv preprint arXiv:2604.00821, 2026

  43. [43]

    Liang, J

    T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang. Pruning and quantization for deep neural network acceleration: A survey.Neurocomputing, 461:370–403, 2021

  44. [44]

    B. Liao, C. Herold, S. Khadivi, and C. Monz. Apiq: Finetuning of 2-bit quantized large language model. arXiv preprint arXiv:2402.05147, 2024

  45. [45]

    S.-Y. Liu, M. Khadkevich, N. C. Fung, C. Sakr, C.-H. H. Yang, C.-Y. Wang, S. Muralidharan, H. Yin, K.-T. Cheng, J. Kautz, et al. Eora: Training-free compensation for compressed llm with eigenspace low-rank approximation.arXiv preprint arXiv:2410.21271, 2024

  46. [46]

    Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

  47. [47]

    Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort. Spinquant–llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

  48. [48]

    Lybrand and R

    E. Lybrand and R. Saab. A greedy algorithm for quantizing neural networks.Journal of Machine Learning Research, 22(156):1–38, 2021

  49. [49]

    Maalouf, I

    A. Maalouf, I. Jubran, and D. Feldman. Fast and accurate least-mean-squares solvers.Advances in Neural Information Processing Systems, 32, 2019

  50. [50]

    Maalouf, I

    A. Maalouf, I. Jubran, and D. Feldman. Fast and accurate least-mean-squares solvers for high dimensional data.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9977–9994, 2022

  51. [51]

    Pointer Sentinel Mixture Models

    S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

  52. [52]

    J. O. Neill. An overview of neural network compression.arXiv preprint arXiv:2006.03669, 2020

  53. [53]

    Panferov, A

    A. Panferov, A. Volkova, I.-V. Modoranu, V. Egiazarian, M. Safaryan, and D. Alistarh. Unified scaling laws for compressed representations.Advances in Neural Information Processing Systems, 38: 151086–151112, 2026

  54. [54]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. PyTorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  55. [55]

    R. Saha, N. Sagan, V. Srivastava, A. Goldsmith, and M. Pilanci. Compressing large language models using low rank and low precision decomposition.Advances in Neural Information Processing Systems, 37:88981–89018, 2024

  56. [56]

    I. G. Shevtsova. Sharpening of the upper bound of the absolute constant in the berry–esseen inequality. Theory of Probability & Its Applications, 51(3):549–553, 2007

  57. [57]

    Takane and M

    Y. Takane and M. A. Hunter. Constrained principal component analysis: a comprehensive theory. Applicable Algebra in Engineering, Communication and Computing, 12:391–419, 2001. 25

  58. [58]

    Takane and H

    Y. Takane and H. Hwang. Regularized linear and kernel redundancy analysis.Computational Statistics & Data Analysis, 52(1):394–405, 2007

  59. [59]

    Takane and S

    Y. Takane and S. Jung. Regularized partial and/or constrained redundancy analysis.Psychometrika, 73: 671–690, 2008

  60. [60]

    Tanner and S

    J. Tanner and S. Vary. Compressed sensing of low-rank plus sparse matrices.Applied and Computational Harmonic Analysis, 64:254–293, 2023

  61. [61]

    J. M. Ten Berge.Least squares optimization in multivariate analysis. DSWO Press, Leiden University Leiden, 1993

  62. [62]

    Touvron, M

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J´ egou. Training data-efficient image transformers & distillation through attention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021

  63. [63]

    Touvron, M

    H. Touvron, M. Cord, and H. J´ egou. Deit iii: Revenge of the vit. InEuropean conference on computer vision, pages 516–533. Springer, 2022

  64. [64]

    J. A. Tropp and R. J. Webber. Randomized algorithms for low-rank matrix approximation: Design, analysis, and applications.arXiv preprint arXiv:2306.12418, 2023

  65. [65]

    Vershynin

    R. Vershynin. High-dimensional probability.University of California, Irvine, 10:11, 2020

  66. [66]

    Y. Wang, H. Wang, and S. Q. Zhang. Qsvd: Efficient low-rank approximation for unified query-key-value weight compression in low-precision vision-language models.Advances in Neural Information Processing Systems, 38:1789–1820, 2026

  67. [67]

    L. Wei, Z. Ma, C. Yang, and Q. Yao. Advances in the neural network quantization: A comprehensive review.Applied Sciences, 14(17):7445, 2024

  68. [68]

    ResNet strikes back: An improved training procedure in timm

    R. Wightman, H. Touvron, and M. Cordts. Resnet strikes back: An improved training procedure in timm.arXiv preprint arXiv:2110.00476, 2021

  69. [69]

    T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020

  70. [70]

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

  71. [71]

    Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian. Qa-lora: Quantization-aware low-rank adaptation of large language models.arXiv preprint arXiv:2309.14717, 2023

  72. [72]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  73. [73]

    G. Yang, C. He, J. Guo, J. Wu, Y. Ding, A. Liu, H. Qin, P. Ji, and X. Liu. Llmcbench: Benchmarking large language model compression for efficient deployment.Advances in Neural Information Processing Systems, 37:87532–87544, 2024

  74. [74]

    Z. Yao, X. Wu, C. Li, S. Youn, and Y. He. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation.arXiv preprint arXiv:2303.08302, 2023

  75. [75]

    Yu and J

    H. Yu and J. Wu. Compressing transformers: features are low-rank, but weights are not! InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11007–11015, 2023. 26

  76. [76]

    X. Yu, T. Liu, X. Wang, and D. Tao. On compressing deep models by low rank and sparse decomposition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017

  77. [77]

    Zhang, N

    A. Zhang, N. Wang, Y. Deng, X. Li, Z. Yang, and P. Yin. Magr: Weight magnitude reduction for enhancing post-training quantization.arXiv preprint arXiv:2406.00800, 2024

  78. [78]

    Zhang, J

    C. Zhang, J. Cheng, G. A. Constantinides, and Y. Zhao. Lqer: Low-rank quantization error reconstruction for llms.arXiv preprint arXiv:2402.02446, 2024

  79. [79]

    Zhang, J

    C. Zhang, J. T. Wong, C. Xiao, G. A. Constantinides, and Y. Zhao. Qera: an analytical framework for quantization error reconstruction.arXiv preprint arXiv:2410.06040, 2024

  80. [80]

    Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

    H. Zhang, S. Zhang, I. Colbert, and R. Saab. Provable post-training quantization: Theoretical analysis of optq and qronos.arXiv preprint arXiv:2508.04853, 2025

Showing first 80 references.