arxiv: 2604.26378 · v1 · submitted 2026-04-29 · 💻 cs.LG

Recognition: unknown

CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantizationmixed-precisionlarge language modelssubspace projectionweighted PCAweight-activation joint modelingoutput error minimization

0 comments

The pith

Jointly modeling weight and activation noise yields better high-precision subspaces for mixed-precision LLM quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that prior mixed-precision methods for quantizing large language models select high-precision subspaces using only activation statistics, which overlooks how weight quantization noise also perturbs the output of linear layers. CoQuant instead models the expected output error as driven by additive noise from both sources and derives a closed-form solution that selects the subspace minimizing this error. The solution takes the form of a weighted principal component analysis whose weighting balances the covariances of activations and weights. If correct, this produces subspaces that reduce overall quantization error more effectively, leading to lower perplexity and higher task accuracy when applied to models such as Llama-3.2 and Qwen2.5.

Core claim

CoQuant expresses the output perturbation in a linear layer as the sum of terms arising from weight quantization noise and activation quantization noise. It then minimizes the expected squared error with respect to the choice of high-precision subspace, yielding a closed-form weighted PCA whose weighting matrix incorporates both the activation covariance and the weight covariance.

What carries the argument

The closed-form weighted PCA derived from the modeled expected output error, which balances activation and weight covariances to select the optimal high-precision subspace.

If this is right

Mixed-precision quantization with CoQuant subspaces produces lower WikiText perplexity than activation-only baselines on Llama-3.2 and Qwen2.5.
The same subspaces improve accuracy on zero-shot common-sense reasoning tasks relative to prior methods.
The joint covariance weighting provides a principled criterion for deciding which output dimensions to retain in high precision under ultra-low bit constraints.
The approach respects the linear structure of matrix multiplications by incorporating noise from both operands rather than one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same error-modeling approach could be applied to linear layers in non-transformer architectures where activation and weight statistics are similarly accessible.
If higher-order interactions inside transformer blocks prove significant, the current linear approximation may need augmentation with additional terms.
The closed-form solution enables fast per-layer computation without requiring iterative optimization loops.

Load-bearing premise

Quantization noise from weights and activations can be treated as additive perturbations whose covariances fully determine output error through a linear model.

What would settle it

Direct measurement of actual output error in a quantized transformer block versus the error predicted by the linear covariance model; substantial divergence would indicate the subspace selection is not optimal.

Figures

Figures reproduced from arXiv: 2604.26378 by Duowei Pan, Su Pan, Zhe Ding.

**Figure 1.** Figure 1: Subspace selection under different statistical view at source ↗

**Figure 2.** Figure 2: Isolated layer-wise quantization error analysis on the view at source ↗

**Figure 3.** Figure 3: Overview of CoQuant. Given the input acti view at source ↗

read the original abstract

Post-training quantization (PTQ) has become an important technique for reducing the inference cost of Large Language Models (LLMs). While recent mixed-precision methods improve ultra-low bit quantization by preserving critical subspaces in high precision, they typically construct these subspaces relying solely on activation statistics. This ignores the fundamental nature of linear operations, where the output perturbation is jointly driven by both activation and weight quantization noise. In this paper, we propose CoQuant, a joint weight-activation subspace projection method. By theoretically modeling the expected output error, CoQuant formulates a closed-form weighted PCA solution that balances activation and weight covariances to select the optimal high-precision subspace. Extensive experiments on Llama-3.2 and Qwen2.5 models show that CoQuant consistently outperforms strong PTQ baselines in both WikiText perplexity and zero-shot common-sense reasoning accuracy. These results demonstrate that joint weight-activation subspace modeling provides a principled and effective direction for low-bit LLM quantization. The source code is available at https://github.com/Zachary5895/CoQuant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoQuant adds a joint weight-activation covariance step to subspace selection for mixed-precision LLM quantization and shows gains on two model families, but the linear additive noise model looks like the part that needs the most checking.

read the letter

The main takeaway is that CoQuant derives a closed-form weighted PCA for picking high-precision subspaces by modeling expected output error from both weight and activation quantization noise together. This differs from earlier activation-only methods referenced in the abstract. The experiments report lower WikiText perplexity and better zero-shot accuracy on Llama-3.2 and Qwen2.5 compared with strong PTQ baselines, and the code is released, which makes the claims easier to test directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes CoQuant, a mixed-precision post-training quantization method for LLMs that selects high-precision subspaces via a joint weight-activation projection. It claims to derive this via theoretical modeling of expected output error under additive quantization noise, yielding a closed-form weighted PCA solution that balances weight and activation covariances; experiments on Llama-3.2 and Qwen2.5 models report consistent gains in WikiText perplexity and zero-shot accuracy over PTQ baselines, with code released.

Significance. If the derivation and linear error model are valid, the work supplies a principled, covariance-driven alternative to activation-only subspace methods, potentially improving ultra-low-bit LLM efficiency. The public code release supports reproducibility and is a clear strength.

major comments (2)

[§3] §3 (theoretical derivation of expected output error): the closed-form weighted PCA is obtained by modeling quantization noise as zero-mean additive perturbations whose second-order statistics (weight and activation covariances) suffice to rank subspaces for minimal output error. This linear propagation assumption is load-bearing for the optimality claim, yet the manuscript provides no explicit error bounds or analysis of how the model extends through non-linear blocks (SwiGLU, RMSNorm, softmax).
[Experimental section] Experimental section (results on Llama-3.2/Qwen2.5): while outperformance is reported, there is no ablation isolating the contribution of the joint covariance term versus activation-only PCA, nor any sensitivity analysis to the subspace dimension or bit allocation; without these, it is unclear whether gains are attributable to the claimed theoretical formulation.

minor comments (2)

Notation for the weighted PCA objective could be clarified with an explicit equation number when first introduced, to aid readers tracing the closed-form solution.
The abstract states 'consistent outperformance' but does not specify the exact bit-width configurations (e.g., average bits per weight/activation) used in the main tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas to strengthen the manuscript. We provide point-by-point responses below and commit to revisions where appropriate.

read point-by-point responses

Referee: [§3] §3 (theoretical derivation of expected output error): the closed-form weighted PCA is obtained by modeling quantization noise as zero-mean additive perturbations whose second-order statistics (weight and activation covariances) suffice to rank subspaces for minimal output error. This linear propagation assumption is load-bearing for the optimality claim, yet the manuscript provides no explicit error bounds or analysis of how the model extends through non-linear blocks (SwiGLU, RMSNorm, softmax).

Authors: The derivation models the expected output error using a linear approximation for the propagation of quantization noise, which allows deriving the closed-form weighted PCA from the quadratic error expression involving weight and activation covariances. This is a standard approach in quantization literature for tractability, though we recognize it is an approximation in non-linear networks. The optimality is with respect to this modeled error. To address the concern, we will revise §3 to include a discussion of the linear assumption's validity, potential error bounds under small noise, and how the projection interacts with subsequent non-linear operations like SwiGLU and RMSNorm. We will also include a brief empirical study showing the approximation's accuracy on sample layers. revision: yes
Referee: [Experimental section] Experimental section (results on Llama-3.2/Qwen2.5): while outperformance is reported, there is no ablation isolating the contribution of the joint covariance term versus activation-only PCA, nor any sensitivity analysis to the subspace dimension or bit allocation; without these, it is unclear whether gains are attributable to the claimed theoretical formulation.

Authors: We agree that isolating the joint term's contribution is important to validate the theoretical claim. We will add an ablation study comparing CoQuant against a variant using only activation covariances (activation-only PCA) on the same models and tasks. Additionally, we will include sensitivity analyses varying the subspace dimension (e.g., 10-50% of channels) and different bit allocation strategies, reporting perplexity and accuracy metrics. These additions will clarify that the performance gains arise from the joint modeling as per the derivation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; closed-form derivation from error model is independent

full rationale

The paper derives its weighted PCA subspace selection directly from a first-principles model of expected output error under additive weight/activation perturbations, yielding a closed-form solution that balances the two covariance matrices. No step reduces the result to a fitted parameter defined by the target, a self-citation chain, or an ansatz smuggled from prior work. The linear error propagation assumption is an explicit modeling choice (not a hidden tautology), and the final subspace ranking follows mathematically from that model without circular redefinition. This is the most common honest non-finding for a derivation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a domain assumption about additive quantization noise in linear layers; no free parameters or invented entities are introduced beyond standard PTQ setup.

axioms (1)

domain assumption The output perturbation is jointly driven by both activation and weight quantization noise in linear operations.
This premise is invoked to motivate moving from activation-only to joint covariance modeling.

pith-pipeline@v0.9.0 · 5484 in / 1187 out tokens · 63705 ms · 2026-05-07T13:35:38.103907+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 techni- cal report.arXiv preprint arXiv:2303.08774. Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar

work page internal anchor Pith review arXiv
[2]

InPro- ceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 12562–12584

Llm in a flash: Efficient large lan- guage model inference with limited memory. InPro- ceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 12562–12584. Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, and Dan Alistarh. 2024a. QUIK: Towards...

2024
[3]

Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chap- ter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 2924–2936. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoen...

2019
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question an- swering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457. Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer

work page internal anchor Pith review arXiv
[5]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323. Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhi- hang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou

work page internal anchor Pith review arXiv
[6]

arXiv preprint arXiv:2501.13987

Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. arXiv preprint arXiv:2501.13987. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica

work page arXiv
[7]

Spinquant–llm quantization with learned rotations,

Spinquant: Llm quan- tization with learned rotations.arXiv preprint arXiv:2405.16406. 9 Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher

work page arXiv
[8]

Pointer sentinel mixture mod- els.arXiv preprint arXiv:1609.07843. Meta

work page internal anchor Pith review arXiv
[9]

Qwen2.5 Technical Report

Qwen2.5 technical report.Preprint, arXiv:2412.15115. Mariam Rakka, Mohammed E Fouda, Pramod Khar- gonekar, and Fadi Kurdahi

work page internal anchor Pith review arXiv
[10]

Code Llama: Open Foundation Models for Code

Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- ula, and Yejin Choi

work page internal anchor Pith review arXiv
[11]

Social iqa: Com- monsense reasoning about social interactions. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473. Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang

2019
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christo- pher Re, Ion Stoica, and Ce Zhang

work page internal anchor Pith review arXiv
[13]

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han

Mixed precision dnns: All you need is a good parametrization.arXiv preprint arXiv:1905.11452. Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han

work page arXiv 1905
[14]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8612–8620

Haq: Hardware-aware automated quan- tization with mixed precision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8612–8620. Bernard Widrow and István Kollár. 2008.Quantization noise: roundoff error in digital computation, sig- nal processing, control, and communications. Cam- bridge University Press. Guangxuan ...

2008
[15]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 9785–9800, Suzhou, China

XQuant: Achieving ultra-low bit KV cache quantization with cross-layer compression. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 9785–9800, Suzhou, China. Association for Computational Linguistics. A Detailed Proof of CoQuant Subspace Optimization This appendix provides the detailed derivation from the ...

2025
[16]

Activation-only

and em- pirical practice confirm that the error variance is overwhelmingly dominated by the exponential de- cay of the quantization bit-width. Thus, we provide a stable proxy by defining the relative error coeffi- cients for both activations and weights in the k-th subspace strictly based on their assigned bit-width Nk: α2 k =β 2 k = 1 (2Nk−1 −1) 2 .(20) ...

2025