pith. machine review for the scientific record. sign in

arxiv: 2604.03420 · v3 · submitted 2026-04-03 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 1 theorem link

· Lean Theorem

Zero-Shot Quantization via Weight-Space Arithmetic

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords post-training quantizationweight space arithmeticzero-shot transfervision transformersmodel patchinglow-bit deploymentquantization robustness
0
0 comments X

The pith

A direction in weight space extracted from one model transfers robustness to post-training quantization to other models zero-shot.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that robustness to post-training quantization is not an isolated property of a single trained model but exists as a transferable direction within the space of model weights. This direction, obtained from a donor model through basic arithmetic on its weights, can be added to a separate receiver model to raise its accuracy after aggressive low-bit quantization. The transfer succeeds across vision transformers of varying scales and across 22 different image-classification tasks, even when donor and receiver tasks differ substantially, and requires no training data or fine-tuning on the receiver side. If the result holds, low-bit deployment of new models becomes far cheaper because expensive quantization-aware retraining can be replaced by a single vector addition.

Core claim

The authors show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. They extract this direction, called the quantization vector, from a donor task by simple weight-space arithmetic and demonstrate that adding it to a receiver model improves post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting without any receiver-side quantization-aware training. The approach is zero-shot with respect to the receiver, needing no training data from its task. The result holds across four ViT scales and 22 classification tasks even when donor and receiver tasks differ markedly. The paper also proves that the extracted vectors are well-defined and invariant,

What carries the argument

The quantization vector, a direction in weight space obtained by arithmetic subtraction or averaging between a quantized and full-precision version of a donor model, that encodes PTQ robustness and can be added to any receiver weights.

If this is right

  • Low-bit deployment of new models can be performed without collecting receiver task data or running quantization-aware training.
  • A single donor vector can improve multiple receiver models across different image-classification tasks.
  • Quantization robustness can be isolated as an additive component in weight space rather than being entangled with task-specific features.
  • The method scales across vision-transformer sizes without requiring per-model retraining.
  • The extracted direction is provably invariant under common reparameterization symmetries of the network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same arithmetic approach might isolate other model properties such as robustness to adversarial examples or to distribution shift.
  • Combinations of multiple extracted vectors could produce models with several independent robustness properties at once.
  • The geometric account supplied in the paper suggests that similar directions may exist for other compression or efficiency axes beyond quantization.
  • If the vector proves stable under further fine-tuning, it could be reused as a reusable module in model libraries.

Load-bearing premise

The vector that encodes quantization robustness extracted from a donor remains effective and meaningful when added to receiver models whose tasks and training distributions differ substantially from the donor.

What would settle it

Adding the extracted quantization vector to a receiver model and observing no improvement or a drop in Top-1 accuracy after 3-bit post-training quantization on a task markedly different from the donor task.

Figures

Figures reproduced from arXiv: 2604.03420 by Adrian Robert Minut, Alessandro Zirilli, Antonio Andrea Gargiulo, Daniele Solombrino, Emanuele Rodol\`a, Luca Zhou.

Figure 1
Figure 1. Figure 1: Zero-shot QV patching. A donor quantization vector ρD := θD,QAT − θD, ex￾tracted as the weight-space displacement be￾tween a standard fine-tuned donor checkpoint and its QAT counterpart, is added to a re￾ceiver checkpoint to obtain the patched model θR←D = θR +λρD. The plot is schematic and not to scale: it illustrates the intended operat￾ing regime of our method, namely improving low-bit accuracy over PTQ… view at source ↗
Figure 2
Figure 2. Figure 2: Geometric view of donor patching. The blue vector ρR is the receiver’s own QV (unknown in our setting), the green ray is the donor direction ρD, and the red vector λ ⋆ρD is the orthogonal projection of ρR onto that donor line. Proposition 1 states that the frac￾tion of receiver-side QAT gain recovered by this best donor patch is exactly cos2 γ, where γ is the angle between ρD and ρR. Illustration in whiten… view at source ↗
Figure 2
Figure 2. Figure 2: The optimal scale λ ⋆ is the projection coefficient of the receiver QV onto the donor direction, and the recoverable fraction of receiver-side QAT gain is exactly the squared cosine of the angle between the two in the HR-geometry (i.e. all inner products are weighted by HR). This theoretical result captures the second-order term of the local transfer geometry, thus it is exact for a purely quadratic local … view at source ↗
Figure 3
Figure 3. Figure 3: Quantization vector transferability for ViT/B-16. Top-1 accuracy change (∆) from patching receiver r with donor d quantization vector, relative to vanilla 3-bit PTQ. Left shows transfer with a constant scaling factor, while right demonstrates that modulating the magnitude λ eliminates destructive interference and maximizes gains. The contrast with the unscaled transfer is stark. First, we observe a clear m… view at source ↗
Figure 4
Figure 4. Figure 4: Quantization vector transferability for ViT/T-16. Top-1 accuracy change (∆) from patching receiver r with donor d quantization vector, relative to vanilla 3-bit PTQ. Left shows transfer with a constant scaling factor, while right demonstrates that modulating the magnitude λ eliminates destructive interference and maximizes gains. Stanford Cars CIFAR-10 CIFAR-100 DTD EMNIST EuroSAT Fashion-MNIST FER2013 Flo… view at source ↗
Figure 5
Figure 5. Figure 5: Quantization vector transferability for ViT/L-16. Top-1 accuracy change (∆) from patching receiver r with donor d quantization vector, relative to vanilla 3-bit PTQ. Left shows transfer with a constant scaling factor, while right demonstrates that modulating the magnitude λ eliminates destructive interference and maximizes gains. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
read the original abstract

We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. Across four ViT scales and 22 image classification tasks, donor quantization vectors often yield substantial gains even when donor and receiver tasks differ markedly. We further prove rigorously that quantization vectors are well-defined and do not suffer from reparameterization symmetries, and provide a local geometric account of their effect. Together, these results suggest that quantization robustness can be partially isolated, reused, and transferred through simple weight-space algebra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that robustness to post-training quantization (PTQ) is a transferable direction in weight space, isolated as a 'quantization vector' via simple arithmetic on a donor model. This vector can be added to a receiver ViT to improve post-PTQ Top-1 accuracy by up to 60 points in 3-bit settings across four ViT scales and 22 image classification tasks, without any receiver-side data or QAT. The work further asserts a rigorous proof that such vectors are well-defined and invariant to reparameterization symmetries, plus a local geometric account of their effect, positioning the approach as a zero-shot alternative to QAT.

Significance. If the central claims hold, the result would be significant for efficient deployment of large vision models: it offers a training-free, low-cost way to boost PTQ robustness with large reported gains even across dissimilar tasks, potentially reducing reliance on expensive quantization-aware training while providing a geometric interpretation of quantization sensitivity in weight space.

major comments (3)
  1. [Abstract / Proof section] Abstract and proof section: The claim of a rigorous proof that quantization vectors are well-defined and invariant under reparameterization symmetries must explicitly address ViT-specific symmetries (e.g., per-layer scaling and attention-head permutations), as these are load-bearing for the transferability assertion; without this detail the invariance guarantee remains unverified for the reported cross-task setting.
  2. [Experiments] Experiments across 22 tasks: The up-to-60-point gains in 3-bit PTQ require an explicit metric or analysis of task dissimilarity (e.g., how 'markedly different' donor-receiver pairs were selected) to confirm that the extracted vector isolates a general PTQ-robustness direction rather than mixing in task-specific components; absent this, the zero-shot claim across dissimilar tasks is not fully supported.
  3. [Results / Experiments] Transfer results: The reported improvements should include controls for post-hoc selection effects and error bars or statistical significance across the 22 tasks and four ViT scales, given that the abstract highlights substantial gains without detailing variance or failure cases on markedly different task pairs.
minor comments (2)
  1. [Methods] Notation for the quantization vector extraction arithmetic should be clarified with an explicit equation or pseudocode early in the methods to avoid ambiguity in how donor weights are combined.
  2. [Figures] Figure captions for any weight-space visualizations or accuracy plots should include axis labels, scale details, and the exact bit-width settings used in the 3-bit experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our invariance proof and strengthen the empirical support for cross-task transfer. We address each major point below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / Proof section] Abstract and proof section: The claim of a rigorous proof that quantization vectors are well-defined and invariant under reparameterization symmetries must explicitly address ViT-specific symmetries (e.g., per-layer scaling and attention-head permutations), as these are load-bearing for the transferability assertion; without this detail the invariance guarantee remains unverified for the reported cross-task setting.

    Authors: We agree that an explicit mapping to ViT symmetries improves clarity. Our proof (Section 4.2 and Appendix B) shows that the quantization vector is invariant under any reparameterization of the form W' = A W B + C for invertible A, B. Per-layer scaling corresponds to diagonal A/B and attention-head permutations to permutation matrices; both are special cases covered by the general argument. In the revision we will insert a dedicated paragraph in the proof section that explicitly instantiates these ViT operations and confirms the vector is unchanged, thereby verifying the guarantee for the cross-task experiments. revision: yes

  2. Referee: [Experiments] Experiments across 22 tasks: The up-to-60-point gains in 3-bit PTQ require an explicit metric or analysis of task dissimilarity (e.g., how 'markedly different' donor-receiver pairs were selected) to confirm that the extracted vector isolates a general PTQ-robustness direction rather than mixing in task-specific components; absent this, the zero-shot claim across dissimilar tasks is not fully supported.

    Authors: We selected the 22 tasks from standard benchmarks (ImageNet-1k subsets, CIFAR, fine-grained datasets) to span both similar and markedly different domains. To address the concern quantitatively, the revision will add a task-dissimilarity metric based on the cosine similarity of class-prototype embeddings between donor and receiver. We will report average gains stratified by similarity bins and show that substantial improvements persist for low-similarity pairs, thereby supporting that the vector captures a general PTQ-robustness direction. revision: yes

  3. Referee: [Results / Experiments] Transfer results: The reported improvements should include controls for post-hoc selection effects and error bars or statistical significance across the 22 tasks and four ViT scales, given that the abstract highlights substantial gains without detailing variance or failure cases on markedly different task pairs.

    Authors: We will add error bars computed over five random seeds for the PTQ process and report paired t-test p-values across the 22 tasks and four ViT scales. A new table will list per-task improvements, explicitly marking any failure cases. As a control for post-hoc selection, we will include results for adding a random vector of identical norm; the comparison will show that only the learned quantization vector produces the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines the quantization vector via explicit weight-space arithmetic (donor full-precision weights minus quantized weights, or equivalent linear combination) and transfers it by addition to receivers. This operation is a direct algebraic construction independent of the reported accuracy gains, with no fitted parameters renamed as predictions and no self-referential definitions. The claimed rigorous proof of invariance under reparameterization symmetries is presented as an internal mathematical argument rather than a self-citation or imported ansatz. Across the 22-task evaluation the central result remains externally falsifiable via measured Top-1 deltas, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a linear, transferable direction for quantization robustness that can be isolated by arithmetic and remains invariant to reparameterization.

axioms (1)
  • domain assumption Quantization robustness corresponds to a well-defined linear direction in weight space that is invariant under reparameterization symmetries.
    Invoked to justify extraction by simple subtraction and transfer across tasks.
invented entities (1)
  • quantization vector no independent evidence
    purpose: Direction in weight space encoding PTQ robustness
    New entity introduced to enable the arithmetic transfer; no independent falsifiable handle provided in abstract.

pith-pipeline@v0.9.0 · 5479 in / 1260 out tokens · 57521 ms · 2026-05-13T20:03:04.203944+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    ISSN 2522-5839. Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,

  2. [2]

    Training dynamics impact post-training quantization robustness.arXiv preprint arXiv:2510.06213,

    Albert Catalan-Tatjer, Niccolò Ajroldi, and Jonas Geiping. Training dynamics impact post-training quantization robustness.arXiv preprint arXiv:2510.06213,

  3. [3]

    Sharpness-aware quantization for deep neural networks

    Jing Liu, Jianfei Cai, and Bohan Zhuang. Sharpness-aware quantization for deep neural networks. arXiv preprint arXiv:2111.12273,

  4. [4]

    Low-bit model quantization for deep neural networks: A survey.arXiv preprint arXiv:2505.05530,

    Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, Yulun Zhang, and Xiaokang Yang. Low-bit model quantization for deep neural networks: A survey.arXiv preprint arXiv:2505.05530,

  5. [5]

    A white paper on neural network quantization

    Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tij- men Blankevoort. A white paper on neural network quantization.arXiv preprint arXiv:2106.08295,

  6. [6]

    Torchao: Pytorch-native training-to-serving model optimization,

    Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, et al. Torchao: Pytorch- native training-to-serving model optimization.arXiv preprint arXiv:2507.16099,

  7. [7]

    Cage: Curvature-aware gradient estimation for accurate quantization-aware training.arXiv preprint arXiv:2510.18784,

    11 Soroush Tabesh, Mher Safaryan, Andrei Panferov, Alexandra V olkova, and Dan Alistarh. Cage: Curvature-aware gradient estimation for accurate quantization-aware training.arXiv preprint arXiv:2510.18784,

  8. [8]

    Dfq-vit: Data-free quantization for vision transformers without fine-tuning.arXiv preprint arXiv:2507.14481,

    Yujia Tong, Jingling Yuan, Tian Zhang, Jianquan Liu, and Chuang Hu. Dfq-vit: Data-free quantization for vision transformers without fine-tuning.arXiv preprint arXiv:2507.14481,

  9. [9]

    Li, Shuhui Qu, Florian Metze, and Emma Strubell

    Zheng Wang, Juncheng B. Li, Shuhui Qu, Florian Metze, and Emma Strubell. SQuAT: Sharpness- and quantization-aware training for BERT.arXiv preprint arXiv:2210.07171,

  10. [10]

    Raffel, and Mohit Bansal

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin A. Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023,