GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

Rayan Saab; Shihao Zhang

arxiv: 2606.01412 · v1 · pith:L6UUG3GVnew · submitted 2026-05-31 · 💻 cs.LG · cs.IT· math.IT

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

Shihao Zhang , Rayan Saab This is my paper

Pith reviewed 2026-06-28 17:26 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords post-training quantizationlow-rank adaptationGPTQinformation-theoretic boundslayer-wise reconstructionlow-precision weightsHessian augmentation

0 comments

The pith

GPTQ-intrinsic LoRA folds low-rank corrections into the quantization pass to replace full-norm error dependence with a rank-r residual norm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve aggressive low-bit post-training quantization by pairing quantized weights with a low-rank correction of the form Q + LR while respecting finite-alphabet and bounded-rank constraints. It first derives information-theoretic lower bounds on the layer-wise reconstruction objective ||XW - X(Q + LR)||_F^2. The proposed algorithm augments the calibration Hessian so that the low-rank term is optimized jointly with the quantized weights inside a single GPTQ-style pass. When the low-rank factor is set to the top right singular vectors of the calibration matrix, the resulting error bounds scale with the residual norm ||X - X_r||_F^2 rather than the full ||X||_F^2, and these bounds are shown to match the lower bounds in dominant scaling under natural structural assumptions on the data or weights. The same framework also supplies a refinement step that alternates quantization and low-rank compensation while guaranteeing non-increasing error.

Core claim

By choosing L equal to the top right singular vectors V_r of the calibration matrix X and incorporating the low-rank compensation directly through Hessian augmentation, the layer-wise reconstruction error admits provable bounds in which the usual GPTQ term proportional to ||X||_F^2 is replaced by a term proportional to the rank-r residual ||X - X_r||_F^2 (up to regularization). Under the paper's natural structural assumptions these bounds match the information-theoretic lower bounds in their leading scaling, up to constants and mild factors. The algorithm remains training-free and is accompanied by the Bid-Up refinement procedure that alternates optimal low-rank compensation with fixed-grid

What carries the argument

GPTQ-intrinsic LoRA, the procedure that augments the calibration Hessian to embed the low-rank correction L = V_r inside the quantization pass itself.

If this is right

The combined quantization-plus-low-rank representation can be realized in one training-free pass rather than sequential quantization followed by separate low-rank compensation.
Layer-wise error becomes controlled by the rank-r residual of the calibration matrix instead of its full Frobenius norm.
Alternating Bid-Up refinement with low-rank compensation produces a sequence of approximations with strictly non-increasing reconstruction error.
Empirical gains appear on both language models (Qwen3) and vision transformers (DeiT) relative to plain GPTQ and to GPTQ followed by post-hoc low-rank correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Hessian-augmentation technique could be ported to other quantization routines that rely on second-order statistics, potentially extending the residual-norm scaling benefit beyond the GPTQ family.
If the structural assumptions hold across typical pretrained weights, then low-bit representations with modest-rank corrections are information-theoretically near-optimal for many network layers.
The explicit dependence on the top singular vectors of X suggests that pre-computing a low-rank basis from calibration data may be sufficient for near-optimal performance without further adaptation.
Testing the tightness of the bounds on out-of-distribution calibration sets would reveal whether the residual-norm advantage persists when the structural assumptions are only approximately satisfied.

Load-bearing premise

Calibration data or weight matrices possess natural structural properties that let the derived error bounds scale with the residual norm rather than the full norm and thereby approach the information-theoretic limits.

What would settle it

On standard calibration sets, compute the observed layer-wise reconstruction error after applying GPTQ-intrinsic LoRA and check whether it exceeds the predicted residual-norm bound by more than the allowed constant and mild factors.

Figures

Figures reproduced from arXiv: 2606.01412 by Rayan Saab, Shihao Zhang.

**Figure 1.** Figure 1: Augmenting the data and weight matrices Here pqt ´ w pt´1q t qXt is the error introduced by rounding the t-th entry. The update from w pt´1q ět`1 to w ptq ět`1 adjusts the remaining weights to minimize the resulting distortion in ℓ2, a process referred to as error diffusion in [87]. We denote the full state of the algorithm after step t by w ptq “ pqďt, w ptq ět`1 q P A t ˆ R N´t , with initialization w p0… view at source ↗

**Figure 2.** Figure 2: GPTQ-intrinsic LoRA Our method entails applying GPTQ to W using the augmented Hessian H “ X JX, together with the usual dampening term. However, we run GPTQ for only N iterations, so that only the first N rows of W are quantized. The result is a quantized matrix Q P ANˆN1 , together with a full-precision matrix R P R rˆN1 . Since R is initialized at zero and updated throughout the N GPTQ steps, it absorbs … view at source ↗

read the original abstract

Post-training quantization is widely used for compressing large neural networks, but aggressive low-bit quantization can significantly degrade model quality. A common remedy is to augment the quantized weights with a low-rank correction, leading to approximations of the form $W\approx Q+LR$. In this paper, we study this low-precision plus low-rank representation through the layer-wise reconstruction objective $\|XW-X(Q+LR)\|_F^2$, where $X$ is a calibration matrix. We establish, to our knowledge, the first information-theoretic lower bounds for this problem under finite-alphabet and bounded low-rank compensation constraints. We then propose GPTQ-intrinsic LoRA, a training-free algorithm that incorporates the low-rank correction directly into a GPTQ-style quantization pass by appropriately augmenting the calibration Hessian. For the choice $L=V_r$, where $V_r$ contains the top right singular vectors of $X$, we prove layer-wise reconstruction error bounds in which the usual GPTQ dependence on $\|X\|_F^2$ is replaced by the rank-$r$ residual $\|X-X_r\|_F^2$, up to regularization terms. Under natural structural assumptions, these bounds match the information-theoretic lower bounds in their dominant scaling, up to constants and mild factors. We also introduce Bid-Up, a fixed-grid quantization refinement step that can be alternated with optimal low-rank compensation with guaranteed non-increasing layer-wise reconstruction error. Experiments on Qwen3 language models and DeiT vision transformers show that GPTQ-intrinsic LoRA improves over GPTQ and GPTQ followed by low-rank compensation, with additional gains from refinement loops.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new lower bounds and GPTQ-integrated LoRA are worth checking, but the matching claim rests on unspecified structural assumptions that need explicit statement and verification.

read the letter

The paper introduces information-theoretic lower bounds for layer-wise reconstruction under finite-alphabet quantization plus bounded low-rank correction, and it gives an algorithm that folds the low-rank term into the GPTQ Hessian update rather than applying it afterward. For the specific choice of low-rank basis as the top right singular vectors of the calibration matrix, the reconstruction bound replaces the usual ||X||_F^2 factor with the smaller residual ||X - X_r||_F^2, up to regularization. They also add a Bid-Up refinement loop that alternates quantization and low-rank steps with a non-increasing error guarantee.

The integration step is a clean technical move and the lower-bound effort is new for this constrained setting. Experiments on Qwen3 and DeiT models show gains over plain GPTQ and over sequential GPTQ followed by low-rank compensation, which suggests the approach is at least practically useful.

The main weakness is that the upper bounds match the lower bounds in dominant scaling only under "natural structural assumptions" that the abstract never defines. Without an explicit list of those conditions or a check that the calibration matrices used in the experiments satisfy them, the matching result stays conditional. The regularization terms also need to be controlled in the proofs; if they grow with rank or bit-width, the claimed improvement could shrink. The experimental section would be stronger if it reported whether the observed gains track the residual norm reduction or come from other factors.

This work is aimed at people building post-training quantization pipelines for large models who want tighter error control. It has enough new technical content and a practical algorithm that it deserves a serious referee, provided the assumptions are clarified and the proofs are checked in detail.

Referee Report

1 major / 0 minor

Summary. The paper claims to establish the first information-theoretic lower bounds for the layer-wise reconstruction objective in low-precision quantization augmented with bounded low-rank compensation. It proposes the GPTQ-intrinsic LoRA algorithm, which augments the GPTQ Hessian to incorporate the low-rank term directly in a training-free manner. For the specific choice L = V_r (top right singular vectors of the calibration matrix X), it proves reconstruction error bounds in which the standard GPTQ ||X||_F^2 dependence is replaced by the rank-r residual ||X - X_r||_F^2 (up to regularization). Under unspecified natural structural assumptions, these upper bounds are claimed to match the lower bounds in dominant scaling. The paper also introduces the Bid-Up fixed-grid refinement procedure with non-increasing error guarantees and reports empirical gains over GPTQ and post-hoc low-rank compensation on Qwen3 and DeiT models.

Significance. If the stated bounds and their matching hold, the work would be a notable contribution to post-training quantization by supplying the first matching lower/upper bounds for the combined low-precision plus low-rank setting and a practical algorithm that achieves the improved residual-norm scaling without additional training. The explicit integration of the low-rank correction into the quantization pass via Hessian augmentation and the guaranteed monotonicity of the Bid-Up refinement are concrete strengths that could influence follow-on algorithm design.

major comments (1)

[Abstract] Abstract: the claim that the derived upper bounds match the information-theoretic lower bounds 'in their dominant scaling, up to constants and mild factors' under 'natural structural assumptions' is the load-bearing statement for the 'near-optimal' characterization, yet the abstract (and therefore the central theoretical contribution) provides no explicit definition or statement of those assumptions (e.g., incoherence, RIP, or spectral conditions on X or W). Without them the matching cannot be verified and the experimental calibration matrices cannot be checked against the proof hypotheses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the need for greater clarity in the abstract. We address the single major comment below and will incorporate the suggested revision.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the derived upper bounds match the information-theoretic lower bounds 'in their dominant scaling, up to constants and mild factors' under 'natural structural assumptions' is the load-bearing statement for the 'near-optimal' characterization, yet the abstract (and therefore the central theoretical contribution) provides no explicit definition or statement of those assumptions (e.g., incoherence, RIP, or spectral conditions on X or W). Without them the matching cannot be verified and the experimental calibration matrices cannot be checked against the proof hypotheses.

Authors: We agree that the abstract should explicitly reference the structural assumptions to make the matching claim verifiable. In the body of the manuscript (Section 3.2 and Theorem 3), the assumptions are stated as: (i) the calibration matrix X satisfies a bounded incoherence condition with parameter μ, and (ii) the weight matrix admits a spectral decay such that the residual after rank-r projection is controlled by the singular values. These are the 'natural structural assumptions' referenced. We will revise the abstract to include a concise parenthetical statement of these conditions (e.g., 'under bounded incoherence of X and spectral decay of W'). This change will allow readers to check the hypotheses against both the proofs and the experimental calibration matrices without altering any theorems or empirical results. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper derives information-theoretic lower bounds independently for the layer-wise reconstruction objective under finite-alphabet and bounded low-rank constraints. It then proposes the GPTQ-intrinsic LoRA algorithm that augments the Hessian and proves upper bounds for the specific choice L = V_r (top right singular vectors of X), replacing the usual ||X||_F^2 term with the residual ||X - X_r||_F^2. The claim that these bounds match in dominant scaling occurs only under separately stated 'natural structural assumptions' on the data or weights; these assumptions are not shown to be defined in terms of the target result or to reduce the upper bounds to a fitted quantity by construction. No self-citations, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing steps in the abstract or described chain. The overall derivation therefore retains independent mathematical content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate concrete free parameters or axioms beyond the high-level reliance on a calibration matrix X and the structural assumptions needed for bound matching.

free parameters (1)

rank r
Rank of the low-rank correction chosen from the singular vectors of the calibration matrix.

axioms (1)

domain assumption Natural structural assumptions on calibration data or weights
Invoked to ensure the derived bounds match information-theoretic lower bounds in dominant scaling.

pith-pipeline@v0.9.1-grok · 5832 in / 1360 out tokens · 33111 ms · 2026-06-28T17:26:14.275026+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 40 canonical work pages · 10 internal anchors

[1]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240, 2024

2024
[2]

Bailleul

A. Bailleul. Explicit kronecker–weyl theorems and applications to prime number races.Research in Number Theory, 8(3):43, 2022

2022
[3]

Bertsimas, R

D. Bertsimas, R. Cory-Wright, and N. A. Johnson. Sparse plus low rank matrix decomposition: A discrete optimization approach.Journal of Machine Learning Research, 24(267):1–51, 2023

2023
[4]

E. J. Candes and Y. Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements.IEEE Transactions on Information Theory, 57(4):2342–2359, 2011

2011
[5]

E. J. Cand` es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?Journal of the ACM (JACM), 58(3):1–37, 2011

2011
[6]

J. Chee, Y. Cai, V. Kuleshov, and C. M. De Sa. Quip: 2-bit quantization of large language models with guarantees.Advances in Neural Information Processing Systems, 36:4396–4429, 2023

2023
[7]

M. Chen, M. Wu, H. Jin, Z. Yuan, J. Liu, C. Zhang, Y. Li, J. Huang, J. Ma, Z. Xue, et al. Int vs fp: A comprehensive study of fine-grained low-bit quantization formats.arXiv preprint arXiv:2510.25602, 2025

work page arXiv 2025
[8]

Chen, H.-F

P. Chen, H.-F. Yu, I. Dhillon, and C.-J. Hsieh. Drone: Data-aware low-rank compression for large nlp models.Advances in neural information processing systems, 34:29321–29334, 2021

2021
[9]

Y. Chen, E. N. Epperly, J. A. Tropp, and R. J. Webber. Randomly pivoted cholesky: Practical approximation of a kernel matrix with few entry evaluations.arXiv preprint arXiv:2207.06503, 2022

work page arXiv 2022
[10]

Cheng, D

Y. Cheng, D. Wang, P. Zhou, and T. Zhang. A survey of model compression and acceleration for deep neural networks.arXiv preprint arXiv:1710.09282, 2017

work page arXiv 2017
[11]

R. Child. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[12]

Y. Cho, D. Jeon, S. Kim, M. Jeon, and A. No. Preserve-then-quantize: Balancing rank budgets for quantization error reconstruction in llms.arXiv preprint arXiv:2602.02001, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

J. Cook, J. Guo, G. Xiao, Y. Lin, and S. Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling.arXiv preprint arXiv:2512.02010, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[15]

L. Deng, G. Li, S. Han, L. Shi, and Y. Xie. Model compression and hardware acceleration for neural networks: A comprehensive survey.Proceedings of the IEEE, 108(4):485–532, 2020

2020
[16]

Y. Deng, A. Zhang, N. Wang, S. Gurses, Z. Yang, and P. Yin. Cloq: Enhancing fine-tuning of quantized llms via calibrated lora initialization.arXiv preprint arXiv:2501.18475, 2025

work page arXiv 2025
[17]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

work page arXiv 2023
[18]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in Neural Information Processing Systems, 36, 2024

2024
[19]

Eckart and G

C. Eckart and G. Young. The approximation of one matrix by another of lower rank.Psychometrika, 1 (3):211–218, 1936. 23

1936
[20]

Egiazarian, R

V. Egiazarian, R. L. Castro, D. Kuznedelev, A. Panferov, E. Kurtic, S. Pandit, A. Marques, M. Kurtz, S. Ashkboos, T. Hoefler, et al. Bridging the gap between promise and performance for microscaling fp4 quantization.arXiv preprint arXiv:2509.23202, 2025

work page arXiv 2025
[21]

Frantar and D

E. Frantar and D. Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning.Advances in Neural Information Processing Systems, 35:4475–4488, 2022

2022
[22]

Frantar and D

E. Frantar and D. Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning, pages 10323–10337. PMLR, 2023

2023
[23]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Frantar, U

E. Frantar, U. Evci, W. Park, N. Houlsby, and D. Alistarh. Compression scaling laws: Unifying sparsity and quantization.arXiv preprint arXiv:2502.16440, 2025

work page arXiv 2025
[25]

Gholami, S

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for efficient neural network inference. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

2022
[26]

H. Guo, P. Greengard, E. P. Xing, and Y. Kim. Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning.arXiv preprint arXiv:2311.12023, 2023

work page arXiv 2023
[27]

Halko, P.-G

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011

2011
[28]

Hassibi and D

B. Hassibi and D. Stork. Second order derivatives for network pruning: Optimal brain surgeon.Advances in neural information processing systems, 5, 1992

1992
[29]

Hassibi, D

B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. InIEEE international conference on neural networks, pages 293–299. IEEE, 1993

1993
[30]

Hoefler, D

T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks.Journal of Machine Learning Research, 22 (241):1–124, 2021

2021
[31]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Hubara, Y

I. Hubara, Y. Nahshan, Y. Hanani, R. Banner, and D. Soudry. Accurate post training quantization with small calibration sets. InInternational conference on machine learning, pages 4466–4475. PMLR, 2021

2021
[33]

M. Huh, H. Mobahi, R. Zhang, B. Cheung, P. Agrawal, and P. Isola. The low-rank simplicity bias in deep networks.arXiv preprint arXiv:2103.10427, 2021

work page arXiv 2021
[34]

Ikebe, T

Y. Ikebe, T. Inagaki, and S. Miyamoto. The monotonicity theorem, cauchy’s interlace theorem, and the courant-fischer theorem.The American Mathematical Monthly, 94(4):352–354, 1987

1987
[35]

gptq.https://github.com/ist-daslab/gptq, 2022

IST-DASLab. gptq.https://github.com/ist-daslab/gptq, 2022

2022
[36]

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023

work page arXiv 2023
[37]

Kuzmin, M

A. Kuzmin, M. Nagel, M. Van Baalen, A. Behboodi, and T. Blankevoort. Pruning vs quantization: Which is better?Advances in neural information processing systems, 36:62414–62427, 2023

2023
[38]

LeCun, J

Y. LeCun, J. Denker, and S. Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989. 24

1989
[39]

M. Li, Y. Lin, Z. Zhang, T. Cai, X. Li, J. Guo, E. Xie, C. Meng, J.-Y. Zhu, and S. Han. Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

work page arXiv 2024
[40]

Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021

work page arXiv 2021
[41]

Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and T. Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models.arXiv preprint arXiv:2310.08659, 2023

work page arXiv 2023
[42]

Y. Li, D. Lee, R. Yin, and P. Panda. Optimal brain decomposition for accurate llm low-rank approximation. arXiv preprint arXiv:2604.00821, 2026

work page arXiv 2026
[43]

Liang, J

T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang. Pruning and quantization for deep neural network acceleration: A survey.Neurocomputing, 461:370–403, 2021

2021
[44]

B. Liao, C. Herold, S. Khadivi, and C. Monz. Apiq: Finetuning of 2-bit quantized large language model. arXiv preprint arXiv:2402.05147, 2024

work page arXiv 2024
[45]

S.-Y. Liu, M. Khadkevich, N. C. Fung, C. Sakr, C.-H. H. Yang, C.-Y. Wang, S. Muralidharan, H. Yin, K.-T. Cheng, J. Kautz, et al. Eora: Training-free compensation for compressed llm with eigenspace low-rank approximation.arXiv preprint arXiv:2410.21271, 2024

work page arXiv 2024
[46]

Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort. Spinquant–llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Lybrand and R

E. Lybrand and R. Saab. A greedy algorithm for quantizing neural networks.Journal of Machine Learning Research, 22(156):1–38, 2021

2021
[49]

Maalouf, I

A. Maalouf, I. Jubran, and D. Feldman. Fast and accurate least-mean-squares solvers.Advances in Neural Information Processing Systems, 32, 2019

2019
[50]

Maalouf, I

A. Maalouf, I. Jubran, and D. Feldman. Fast and accurate least-mean-squares solvers for high dimensional data.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9977–9994, 2022

2022
[51]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[52]

J. O. Neill. An overview of neural network compression.arXiv preprint arXiv:2006.03669, 2020

work page arXiv 2006
[53]

Panferov, A

A. Panferov, A. Volkova, I.-V. Modoranu, V. Egiazarian, M. Safaryan, and D. Alistarh. Unified scaling laws for compressed representations.Advances in Neural Information Processing Systems, 38: 151086–151112, 2026

2026
[54]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. PyTorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019
[55]

R. Saha, N. Sagan, V. Srivastava, A. Goldsmith, and M. Pilanci. Compressing large language models using low rank and low precision decomposition.Advances in Neural Information Processing Systems, 37:88981–89018, 2024

2024
[56]

I. G. Shevtsova. Sharpening of the upper bound of the absolute constant in the berry–esseen inequality. Theory of Probability & Its Applications, 51(3):549–553, 2007

2007
[57]

Takane and M

Y. Takane and M. A. Hunter. Constrained principal component analysis: a comprehensive theory. Applicable Algebra in Engineering, Communication and Computing, 12:391–419, 2001. 25

2001
[58]

Takane and H

Y. Takane and H. Hwang. Regularized linear and kernel redundancy analysis.Computational Statistics & Data Analysis, 52(1):394–405, 2007

2007
[59]

Takane and S

Y. Takane and S. Jung. Regularized partial and/or constrained redundancy analysis.Psychometrika, 73: 671–690, 2008

2008
[60]

Tanner and S

J. Tanner and S. Vary. Compressed sensing of low-rank plus sparse matrices.Applied and Computational Harmonic Analysis, 64:254–293, 2023

2023
[61]

J. M. Ten Berge.Least squares optimization in multivariate analysis. DSWO Press, Leiden University Leiden, 1993

1993
[62]

Touvron, M

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J´ egou. Training data-efficient image transformers & distillation through attention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021

2021
[63]

Touvron, M

H. Touvron, M. Cord, and H. J´ egou. Deit iii: Revenge of the vit. InEuropean conference on computer vision, pages 516–533. Springer, 2022

2022
[64]

J. A. Tropp and R. J. Webber. Randomized algorithms for low-rank matrix approximation: Design, analysis, and applications.arXiv preprint arXiv:2306.12418, 2023

work page arXiv 2023
[65]

Vershynin

R. Vershynin. High-dimensional probability.University of California, Irvine, 10:11, 2020

2020
[66]

Y. Wang, H. Wang, and S. Q. Zhang. Qsvd: Efficient low-rank approximation for unified query-key-value weight compression in low-precision vision-language models.Advances in Neural Information Processing Systems, 38:1789–1820, 2026

2026
[67]

L. Wei, Z. Ma, C. Yang, and Q. Yao. Advances in the neural network quantization: A comprehensive review.Applied Sciences, 14(17):7445, 2024

2024
[68]

ResNet strikes back: An improved training procedure in timm

R. Wightman, H. Touvron, and M. Cordts. Resnet strikes back: An improved training procedure in timm.arXiv preprint arXiv:2110.00476, 2021

work page arXiv 2021
[69]

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020

2020
[70]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

2023
[71]

Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian. Qa-lora: Quantization-aware low-rank adaptation of large language models.arXiv preprint arXiv:2309.14717, 2023

work page arXiv 2023
[72]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

G. Yang, C. He, J. Guo, J. Wu, Y. Ding, A. Liu, H. Qin, P. Ji, and X. Liu. Llmcbench: Benchmarking large language model compression for efficient deployment.Advances in Neural Information Processing Systems, 37:87532–87544, 2024

2024
[74]

Z. Yao, X. Wu, C. Li, S. Youn, and Y. He. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation.arXiv preprint arXiv:2303.08302, 2023

work page arXiv 2023
[75]

Yu and J

H. Yu and J. Wu. Compressing transformers: features are low-rank, but weights are not! InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11007–11015, 2023. 26

2023
[76]

X. Yu, T. Liu, X. Wang, and D. Tao. On compressing deep models by low rank and sparse decomposition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017

2017
[77]

Zhang, N

A. Zhang, N. Wang, Y. Deng, X. Li, Z. Yang, and P. Yin. Magr: Weight magnitude reduction for enhancing post-training quantization.arXiv preprint arXiv:2406.00800, 2024

work page arXiv 2024
[78]

Zhang, J

C. Zhang, J. Cheng, G. A. Constantinides, and Y. Zhao. Lqer: Low-rank quantization error reconstruction for llms.arXiv preprint arXiv:2402.02446, 2024

work page arXiv 2024
[79]

Zhang, J

C. Zhang, J. T. Wong, C. Xiao, G. A. Constantinides, and Y. Zhao. Qera: an analytical framework for quantization error reconstruction.arXiv preprint arXiv:2410.06040, 2024

work page arXiv 2024
[80]

Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

H. Zhang, S. Zhang, I. Colbert, and R. Saab. Provable post-training quantization: Theoretical analysis of optq and qronos.arXiv preprint arXiv:2508.04853, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240, 2024

2024

[2] [2]

Bailleul

A. Bailleul. Explicit kronecker–weyl theorems and applications to prime number races.Research in Number Theory, 8(3):43, 2022

2022

[3] [3]

Bertsimas, R

D. Bertsimas, R. Cory-Wright, and N. A. Johnson. Sparse plus low rank matrix decomposition: A discrete optimization approach.Journal of Machine Learning Research, 24(267):1–51, 2023

2023

[4] [4]

E. J. Candes and Y. Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements.IEEE Transactions on Information Theory, 57(4):2342–2359, 2011

2011

[5] [5]

E. J. Cand` es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?Journal of the ACM (JACM), 58(3):1–37, 2011

2011

[6] [6]

J. Chee, Y. Cai, V. Kuleshov, and C. M. De Sa. Quip: 2-bit quantization of large language models with guarantees.Advances in Neural Information Processing Systems, 36:4396–4429, 2023

2023

[7] [7]

M. Chen, M. Wu, H. Jin, Z. Yuan, J. Liu, C. Zhang, Y. Li, J. Huang, J. Ma, Z. Xue, et al. Int vs fp: A comprehensive study of fine-grained low-bit quantization formats.arXiv preprint arXiv:2510.25602, 2025

work page arXiv 2025

[8] [8]

Chen, H.-F

P. Chen, H.-F. Yu, I. Dhillon, and C.-J. Hsieh. Drone: Data-aware low-rank compression for large nlp models.Advances in neural information processing systems, 34:29321–29334, 2021

2021

[9] [9]

Y. Chen, E. N. Epperly, J. A. Tropp, and R. J. Webber. Randomly pivoted cholesky: Practical approximation of a kernel matrix with few entry evaluations.arXiv preprint arXiv:2207.06503, 2022

work page arXiv 2022

[10] [10]

Cheng, D

Y. Cheng, D. Wang, P. Zhou, and T. Zhang. A survey of model compression and acceleration for deep neural networks.arXiv preprint arXiv:1710.09282, 2017

work page arXiv 2017

[11] [11]

R. Child. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[12] [12]

Y. Cho, D. Jeon, S. Kim, M. Jeon, and A. No. Preserve-then-quantize: Balancing rank budgets for quantization error reconstruction in llms.arXiv preprint arXiv:2602.02001, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

J. Cook, J. Guo, G. Xiao, Y. Lin, and S. Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling.arXiv preprint arXiv:2512.02010, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009

[15] [15]

L. Deng, G. Li, S. Han, L. Shi, and Y. Xie. Model compression and hardware acceleration for neural networks: A comprehensive survey.Proceedings of the IEEE, 108(4):485–532, 2020

2020

[16] [16]

Y. Deng, A. Zhang, N. Wang, S. Gurses, Z. Yang, and P. Yin. Cloq: Enhancing fine-tuning of quantized llms via calibrated lora initialization.arXiv preprint arXiv:2501.18475, 2025

work page arXiv 2025

[17] [17]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

work page arXiv 2023

[18] [18]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in Neural Information Processing Systems, 36, 2024

2024

[19] [19]

Eckart and G

C. Eckart and G. Young. The approximation of one matrix by another of lower rank.Psychometrika, 1 (3):211–218, 1936. 23

1936

[20] [20]

Egiazarian, R

V. Egiazarian, R. L. Castro, D. Kuznedelev, A. Panferov, E. Kurtic, S. Pandit, A. Marques, M. Kurtz, S. Ashkboos, T. Hoefler, et al. Bridging the gap between promise and performance for microscaling fp4 quantization.arXiv preprint arXiv:2509.23202, 2025

work page arXiv 2025

[21] [21]

Frantar and D

E. Frantar and D. Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning.Advances in Neural Information Processing Systems, 35:4475–4488, 2022

2022

[22] [22]

Frantar and D

E. Frantar and D. Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning, pages 10323–10337. PMLR, 2023

2023

[23] [23]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

Frantar, U

E. Frantar, U. Evci, W. Park, N. Houlsby, and D. Alistarh. Compression scaling laws: Unifying sparsity and quantization.arXiv preprint arXiv:2502.16440, 2025

work page arXiv 2025

[25] [25]

Gholami, S

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer. A survey of quantization methods for efficient neural network inference. InLow-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022

2022

[26] [26]

H. Guo, P. Greengard, E. P. Xing, and Y. Kim. Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning.arXiv preprint arXiv:2311.12023, 2023

work page arXiv 2023

[27] [27]

Halko, P.-G

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011

2011

[28] [28]

Hassibi and D

B. Hassibi and D. Stork. Second order derivatives for network pruning: Optimal brain surgeon.Advances in neural information processing systems, 5, 1992

1992

[29] [29]

Hassibi, D

B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. InIEEE international conference on neural networks, pages 293–299. IEEE, 1993

1993

[30] [30]

Hoefler, D

T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks.Journal of Machine Learning Research, 22 (241):1–124, 2021

2021

[31] [31]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [32]

Hubara, Y

I. Hubara, Y. Nahshan, Y. Hanani, R. Banner, and D. Soudry. Accurate post training quantization with small calibration sets. InInternational conference on machine learning, pages 4466–4475. PMLR, 2021

2021

[33] [33]

M. Huh, H. Mobahi, R. Zhang, B. Cheung, P. Agrawal, and P. Isola. The low-rank simplicity bias in deep networks.arXiv preprint arXiv:2103.10427, 2021

work page arXiv 2021

[34] [34]

Ikebe, T

Y. Ikebe, T. Inagaki, and S. Miyamoto. The monotonicity theorem, cauchy’s interlace theorem, and the courant-fischer theorem.The American Mathematical Monthly, 94(4):352–354, 1987

1987

[35] [35]

gptq.https://github.com/ist-daslab/gptq, 2022

IST-DASLab. gptq.https://github.com/ist-daslab/gptq, 2022

2022

[36] [36]

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629, 2023

work page arXiv 2023

[37] [37]

Kuzmin, M

A. Kuzmin, M. Nagel, M. Van Baalen, A. Behboodi, and T. Blankevoort. Pruning vs quantization: Which is better?Advances in neural information processing systems, 36:62414–62427, 2023

2023

[38] [38]

LeCun, J

Y. LeCun, J. Denker, and S. Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989. 24

1989

[39] [39]

M. Li, Y. Lin, Z. Zhang, T. Cai, X. Li, J. Guo, E. Xie, C. Meng, J.-Y. Zhu, and S. Han. Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

work page arXiv 2024

[40] [40]

Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction.arXiv preprint arXiv:2102.05426, 2021

work page arXiv 2021

[41] [41]

Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and T. Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models.arXiv preprint arXiv:2310.08659, 2023

work page arXiv 2023

[42] [42]

Y. Li, D. Lee, R. Yin, and P. Panda. Optimal brain decomposition for accurate llm low-rank approximation. arXiv preprint arXiv:2604.00821, 2026

work page arXiv 2026

[43] [43]

Liang, J

T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang. Pruning and quantization for deep neural network acceleration: A survey.Neurocomputing, 461:370–403, 2021

2021

[44] [44]

B. Liao, C. Herold, S. Khadivi, and C. Monz. Apiq: Finetuning of 2-bit quantized large language model. arXiv preprint arXiv:2402.05147, 2024

work page arXiv 2024

[45] [45]

S.-Y. Liu, M. Khadkevich, N. C. Fung, C. Sakr, C.-H. H. Yang, C.-Y. Wang, S. Muralidharan, H. Yin, K.-T. Cheng, J. Kautz, et al. Eora: Training-free compensation for compressed llm with eigenspace low-rank approximation.arXiv preprint arXiv:2410.21271, 2024

work page arXiv 2024

[46] [46]

Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort. Spinquant–llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Lybrand and R

E. Lybrand and R. Saab. A greedy algorithm for quantizing neural networks.Journal of Machine Learning Research, 22(156):1–38, 2021

2021

[49] [49]

Maalouf, I

A. Maalouf, I. Jubran, and D. Feldman. Fast and accurate least-mean-squares solvers.Advances in Neural Information Processing Systems, 32, 2019

2019

[50] [50]

Maalouf, I

A. Maalouf, I. Jubran, and D. Feldman. Fast and accurate least-mean-squares solvers for high dimensional data.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9977–9994, 2022

2022

[51] [51]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[52] [52]

J. O. Neill. An overview of neural network compression.arXiv preprint arXiv:2006.03669, 2020

work page arXiv 2006

[53] [53]

Panferov, A

A. Panferov, A. Volkova, I.-V. Modoranu, V. Egiazarian, M. Safaryan, and D. Alistarh. Unified scaling laws for compressed representations.Advances in Neural Information Processing Systems, 38: 151086–151112, 2026

2026

[54] [54]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. PyTorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019

[55] [55]

R. Saha, N. Sagan, V. Srivastava, A. Goldsmith, and M. Pilanci. Compressing large language models using low rank and low precision decomposition.Advances in Neural Information Processing Systems, 37:88981–89018, 2024

2024

[56] [56]

I. G. Shevtsova. Sharpening of the upper bound of the absolute constant in the berry–esseen inequality. Theory of Probability & Its Applications, 51(3):549–553, 2007

2007

[57] [57]

Takane and M

Y. Takane and M. A. Hunter. Constrained principal component analysis: a comprehensive theory. Applicable Algebra in Engineering, Communication and Computing, 12:391–419, 2001. 25

2001

[58] [58]

Takane and H

Y. Takane and H. Hwang. Regularized linear and kernel redundancy analysis.Computational Statistics & Data Analysis, 52(1):394–405, 2007

2007

[59] [59]

Takane and S

Y. Takane and S. Jung. Regularized partial and/or constrained redundancy analysis.Psychometrika, 73: 671–690, 2008

2008

[60] [60]

Tanner and S

J. Tanner and S. Vary. Compressed sensing of low-rank plus sparse matrices.Applied and Computational Harmonic Analysis, 64:254–293, 2023

2023

[61] [61]

J. M. Ten Berge.Least squares optimization in multivariate analysis. DSWO Press, Leiden University Leiden, 1993

1993

[62] [62]

Touvron, M

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J´ egou. Training data-efficient image transformers & distillation through attention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021

2021

[63] [63]

Touvron, M

H. Touvron, M. Cord, and H. J´ egou. Deit iii: Revenge of the vit. InEuropean conference on computer vision, pages 516–533. Springer, 2022

2022

[64] [64]

J. A. Tropp and R. J. Webber. Randomized algorithms for low-rank matrix approximation: Design, analysis, and applications.arXiv preprint arXiv:2306.12418, 2023

work page arXiv 2023

[65] [65]

Vershynin

R. Vershynin. High-dimensional probability.University of California, Irvine, 10:11, 2020

2020

[66] [66]

Y. Wang, H. Wang, and S. Q. Zhang. Qsvd: Efficient low-rank approximation for unified query-key-value weight compression in low-precision vision-language models.Advances in Neural Information Processing Systems, 38:1789–1820, 2026

2026

[67] [67]

L. Wei, Z. Ma, C. Yang, and Q. Yao. Advances in the neural network quantization: A comprehensive review.Applied Sciences, 14(17):7445, 2024

2024

[68] [68]

ResNet strikes back: An improved training procedure in timm

R. Wightman, H. Touvron, and M. Cordts. Resnet strikes back: An improved training procedure in timm.arXiv preprint arXiv:2110.00476, 2021

work page arXiv 2021

[69] [69]

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020

2020

[70] [70]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

2023

[71] [71]

Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian. Qa-lora: Quantization-aware low-rank adaptation of large language models.arXiv preprint arXiv:2309.14717, 2023

work page arXiv 2023

[72] [72]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

G. Yang, C. He, J. Guo, J. Wu, Y. Ding, A. Liu, H. Qin, P. Ji, and X. Liu. Llmcbench: Benchmarking large language model compression for efficient deployment.Advances in Neural Information Processing Systems, 37:87532–87544, 2024

2024

[74] [74]

Z. Yao, X. Wu, C. Li, S. Youn, and Y. He. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation.arXiv preprint arXiv:2303.08302, 2023

work page arXiv 2023

[75] [75]

Yu and J

H. Yu and J. Wu. Compressing transformers: features are low-rank, but weights are not! InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11007–11015, 2023. 26

2023

[76] [76]

X. Yu, T. Liu, X. Wang, and D. Tao. On compressing deep models by low rank and sparse decomposition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017

2017

[77] [77]

Zhang, N

A. Zhang, N. Wang, Y. Deng, X. Li, Z. Yang, and P. Yin. Magr: Weight magnitude reduction for enhancing post-training quantization.arXiv preprint arXiv:2406.00800, 2024

work page arXiv 2024

[78] [78]

Zhang, J

C. Zhang, J. Cheng, G. A. Constantinides, and Y. Zhao. Lqer: Low-rank quantization error reconstruction for llms.arXiv preprint arXiv:2402.02446, 2024

work page arXiv 2024

[79] [79]

Zhang, J

C. Zhang, J. T. Wong, C. Xiao, G. A. Constantinides, and Y. Zhao. Qera: an analytical framework for quantization error reconstruction.arXiv preprint arXiv:2410.06040, 2024

work page arXiv 2024

[80] [80]

Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

H. Zhang, S. Zhang, I. Colbert, and R. Saab. Provable post-training quantization: Theoretical analysis of optq and qronos.arXiv preprint arXiv:2508.04853, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025