Recognition: 1 theorem link
· Lean TheoremZero-Shot Quantization via Weight-Space Arithmetic
Pith reviewed 2026-05-13 20:03 UTC · model grok-4.3
The pith
A direction in weight space extracted from one model transfers robustness to post-training quantization to other models zero-shot.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. They extract this direction, called the quantization vector, from a donor task by simple weight-space arithmetic and demonstrate that adding it to a receiver model improves post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting without any receiver-side quantization-aware training. The approach is zero-shot with respect to the receiver, needing no training data from its task. The result holds across four ViT scales and 22 classification tasks even when donor and receiver tasks differ markedly. The paper also proves that the extracted vectors are well-defined and invariant,
What carries the argument
The quantization vector, a direction in weight space obtained by arithmetic subtraction or averaging between a quantized and full-precision version of a donor model, that encodes PTQ robustness and can be added to any receiver weights.
If this is right
- Low-bit deployment of new models can be performed without collecting receiver task data or running quantization-aware training.
- A single donor vector can improve multiple receiver models across different image-classification tasks.
- Quantization robustness can be isolated as an additive component in weight space rather than being entangled with task-specific features.
- The method scales across vision-transformer sizes without requiring per-model retraining.
- The extracted direction is provably invariant under common reparameterization symmetries of the network.
Where Pith is reading between the lines
- The same arithmetic approach might isolate other model properties such as robustness to adversarial examples or to distribution shift.
- Combinations of multiple extracted vectors could produce models with several independent robustness properties at once.
- The geometric account supplied in the paper suggests that similar directions may exist for other compression or efficiency axes beyond quantization.
- If the vector proves stable under further fine-tuning, it could be reused as a reusable module in model libraries.
Load-bearing premise
The vector that encodes quantization robustness extracted from a donor remains effective and meaningful when added to receiver models whose tasks and training distributions differ substantially from the donor.
What would settle it
Adding the extracted quantization vector to a receiver model and observing no improvement or a drop in Top-1 accuracy after 3-bit post-training quantization on a task markedly different from the donor task.
Figures
read the original abstract
We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. Across four ViT scales and 22 image classification tasks, donor quantization vectors often yield substantial gains even when donor and receiver tasks differ markedly. We further prove rigorously that quantization vectors are well-defined and do not suffer from reparameterization symmetries, and provide a local geometric account of their effect. Together, these results suggest that quantization robustness can be partially isolated, reused, and transferred through simple weight-space algebra.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that robustness to post-training quantization (PTQ) is a transferable direction in weight space, isolated as a 'quantization vector' via simple arithmetic on a donor model. This vector can be added to a receiver ViT to improve post-PTQ Top-1 accuracy by up to 60 points in 3-bit settings across four ViT scales and 22 image classification tasks, without any receiver-side data or QAT. The work further asserts a rigorous proof that such vectors are well-defined and invariant to reparameterization symmetries, plus a local geometric account of their effect, positioning the approach as a zero-shot alternative to QAT.
Significance. If the central claims hold, the result would be significant for efficient deployment of large vision models: it offers a training-free, low-cost way to boost PTQ robustness with large reported gains even across dissimilar tasks, potentially reducing reliance on expensive quantization-aware training while providing a geometric interpretation of quantization sensitivity in weight space.
major comments (3)
- [Abstract / Proof section] Abstract and proof section: The claim of a rigorous proof that quantization vectors are well-defined and invariant under reparameterization symmetries must explicitly address ViT-specific symmetries (e.g., per-layer scaling and attention-head permutations), as these are load-bearing for the transferability assertion; without this detail the invariance guarantee remains unverified for the reported cross-task setting.
- [Experiments] Experiments across 22 tasks: The up-to-60-point gains in 3-bit PTQ require an explicit metric or analysis of task dissimilarity (e.g., how 'markedly different' donor-receiver pairs were selected) to confirm that the extracted vector isolates a general PTQ-robustness direction rather than mixing in task-specific components; absent this, the zero-shot claim across dissimilar tasks is not fully supported.
- [Results / Experiments] Transfer results: The reported improvements should include controls for post-hoc selection effects and error bars or statistical significance across the 22 tasks and four ViT scales, given that the abstract highlights substantial gains without detailing variance or failure cases on markedly different task pairs.
minor comments (2)
- [Methods] Notation for the quantization vector extraction arithmetic should be clarified with an explicit equation or pseudocode early in the methods to avoid ambiguity in how donor weights are combined.
- [Figures] Figure captions for any weight-space visualizations or accuracy plots should include axis labels, scale details, and the exact bit-width settings used in the 3-bit experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our invariance proof and strengthen the empirical support for cross-task transfer. We address each major point below and will incorporate the suggested additions in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract / Proof section] Abstract and proof section: The claim of a rigorous proof that quantization vectors are well-defined and invariant under reparameterization symmetries must explicitly address ViT-specific symmetries (e.g., per-layer scaling and attention-head permutations), as these are load-bearing for the transferability assertion; without this detail the invariance guarantee remains unverified for the reported cross-task setting.
Authors: We agree that an explicit mapping to ViT symmetries improves clarity. Our proof (Section 4.2 and Appendix B) shows that the quantization vector is invariant under any reparameterization of the form W' = A W B + C for invertible A, B. Per-layer scaling corresponds to diagonal A/B and attention-head permutations to permutation matrices; both are special cases covered by the general argument. In the revision we will insert a dedicated paragraph in the proof section that explicitly instantiates these ViT operations and confirms the vector is unchanged, thereby verifying the guarantee for the cross-task experiments. revision: yes
-
Referee: [Experiments] Experiments across 22 tasks: The up-to-60-point gains in 3-bit PTQ require an explicit metric or analysis of task dissimilarity (e.g., how 'markedly different' donor-receiver pairs were selected) to confirm that the extracted vector isolates a general PTQ-robustness direction rather than mixing in task-specific components; absent this, the zero-shot claim across dissimilar tasks is not fully supported.
Authors: We selected the 22 tasks from standard benchmarks (ImageNet-1k subsets, CIFAR, fine-grained datasets) to span both similar and markedly different domains. To address the concern quantitatively, the revision will add a task-dissimilarity metric based on the cosine similarity of class-prototype embeddings between donor and receiver. We will report average gains stratified by similarity bins and show that substantial improvements persist for low-similarity pairs, thereby supporting that the vector captures a general PTQ-robustness direction. revision: yes
-
Referee: [Results / Experiments] Transfer results: The reported improvements should include controls for post-hoc selection effects and error bars or statistical significance across the 22 tasks and four ViT scales, given that the abstract highlights substantial gains without detailing variance or failure cases on markedly different task pairs.
Authors: We will add error bars computed over five random seeds for the PTQ process and report paired t-test p-values across the 22 tasks and four ViT scales. A new table will list per-task improvements, explicitly marking any failure cases. As a control for post-hoc selection, we will include results for adding a random vector of identical norm; the comparison will show that only the learned quantization vector produces the reported gains. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines the quantization vector via explicit weight-space arithmetic (donor full-precision weights minus quantized weights, or equivalent linear combination) and transfers it by addition to receivers. This operation is a direct algebraic construction independent of the reported accuracy gains, with no fitted parameters renamed as predictions and no self-referential definitions. The claimed rigorous proof of invariance under reparameterization symmetries is presented as an internal mathematical argument rather than a self-citation or imported ansatz. Across the 22-task evaluation the central result remains externally falsifiable via measured Top-1 deltas, satisfying the criteria for a self-contained derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Quantization robustness corresponds to a well-defined linear direction in weight space that is invariant under reparameterization symmetries.
invented entities (1)
-
quantization vector
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
quantization vector ρD := θD,QAT − θD ... optimal scale λ⋆ = ρD⊤ HR ρR / ρD⊤ HR ρD ... cos²HR(ρD, ρR)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
ISSN 2522-5839. Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Training dynamics impact post-training quantization robustness.arXiv preprint arXiv:2510.06213,
Albert Catalan-Tatjer, Niccolò Ajroldi, and Jonas Geiping. Training dynamics impact post-training quantization robustness.arXiv preprint arXiv:2510.06213,
-
[3]
Sharpness-aware quantization for deep neural networks
Jing Liu, Jianfei Cai, and Bohan Zhuang. Sharpness-aware quantization for deep neural networks. arXiv preprint arXiv:2111.12273,
-
[4]
Low-bit model quantization for deep neural networks: A survey.arXiv preprint arXiv:2505.05530,
Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, Yulun Zhang, and Xiaokang Yang. Low-bit model quantization for deep neural networks: A survey.arXiv preprint arXiv:2505.05530,
-
[5]
A white paper on neural network quantization
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tij- men Blankevoort. A white paper on neural network quantization.arXiv preprint arXiv:2106.08295,
-
[6]
Torchao: Pytorch-native training-to-serving model optimization,
Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, et al. Torchao: Pytorch- native training-to-serving model optimization.arXiv preprint arXiv:2507.16099,
-
[7]
11 Soroush Tabesh, Mher Safaryan, Andrei Panferov, Alexandra V olkova, and Dan Alistarh. Cage: Curvature-aware gradient estimation for accurate quantization-aware training.arXiv preprint arXiv:2510.18784,
-
[8]
Yujia Tong, Jingling Yuan, Tian Zhang, Jianquan Liu, and Chuang Hu. Dfq-vit: Data-free quantization for vision transformers without fine-tuning.arXiv preprint arXiv:2507.14481,
-
[9]
Li, Shuhui Qu, Florian Metze, and Emma Strubell
Zheng Wang, Juncheng B. Li, Shuhui Qu, Florian Metze, and Emma Strubell. SQuAT: Sharpness- and quantization-aware training for BERT.arXiv preprint arXiv:2210.07171,
-
[10]
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A. Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023,
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.