Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy
Pith reviewed 2026-05-21 22:20 UTC · model grok-4.3
The pith
Quantization of vision-language models improves accuracy, calibration, OOD detection, and noise robustness by dampening high-rank spectral components and shifting reliance to low-rank features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Quantization dampens high-rank spectral components in VLMs, compelling the model to rely more heavily on robust low-rank features; this spectral filtering drives simultaneous gains in accuracy, calibration, OOD detection, and noise robustness, though not in handling covariate shift or spurious correlations.
What carries the argument
Spectral filtering effect of quantization, which suppresses high-rank components and redirects the model toward stable low-rank features.
If this is right
- Quantized VLMs can be deployed directly for tasks requiring both speed and better calibration without separate post-processing.
- OOD detection performance rises as a byproduct of quantization, reducing the need for dedicated detection modules in some settings.
- Noise robustness improves, supporting use in real-world environments with sensor or input perturbations.
- No automatic gains occur for covariate shift or spurious correlations, so separate techniques remain necessary for those failure modes.
Where Pith is reading between the lines
- Similar spectral effects might appear when applying quantization to other large multimodal models beyond VLMs.
- The approach could be tested as a default preprocessing step for any efficiency-driven deployment of foundation models.
- Spectral analysis before and after quantization might serve as a diagnostic tool to predict reliability improvements on new datasets.
Load-bearing premise
The reliability gains are caused by the spectral filtering mechanism rather than other side effects of quantization or the specific models and datasets tested.
What would settle it
If models with high-rank components artificially suppressed by non-quantization methods fail to show matching gains in accuracy, calibration, and OOD detection, the causal link would be disproved.
Figures
read the original abstract
Vision-Language Models (VLMs) such as CLIP have revolutionized zero-shot classification and safety-critical tasks, including Out-of-Distribution (OOD) detection. However, their high computational cost hinders efficient real-world deployment. While quantization is a standard solution for efficiency, its broader impact on reliability metrics beyond simple Top-1 accuracy remains critically under-explored. In this study, we conduct a large-scale evaluation of VLM quantization across a comprehensive experimental suite of over 700k evaluation runs with varying configurations. We find that, contrary to the assumption that quantization's noise degrades performance, it can simultaneously improve accuracy, calibration, OOD detection, and robustness to noise, though not to covariate shift or spurious correlations. We leverage these counterintuitive findings to characterize the mechanics of quantization beyond simple regularization: we show that quantization dampens high-rank spectral components, compelling the model to rely more heavily on robust, low-rank features. Ultimately, this spectral filtering effect drives the observed improvements in generalization and noise tolerance, establishing a pathway to deploy faster, more reliable VLMs by utilizing quantization beyond its conventional role.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports results from a large-scale empirical study involving over 700k evaluation runs on quantizing Vision-Language Models such as CLIP. It claims that quantization can simultaneously improve accuracy, calibration, OOD detection, and robustness to additive noise (while not improving robustness to covariate shift or spurious correlations) and attributes these gains to a spectral filtering mechanism in which quantization dampens high-rank components, causing the model to rely more on robust low-rank features.
Significance. If the central empirical observations are robust, the work would be significant for efficient and reliable VLM deployment: it challenges the standard view of quantization as a pure efficiency-accuracy tradeoff and suggests a pathway to obtain reliability benefits at reduced precision. The experimental volume is a notable strength. The proposed spectral explanation is intriguing but currently rests on post-hoc interpretation rather than an isolated causal test, limiting the strength of the mechanistic contribution.
major comments (2)
- [Spectral Analysis section] Spectral Analysis section: the claim that quantization improves reliability metrics by dampening high-rank spectral components (forcing reliance on low-rank features) is load-bearing for the counterintuitive positive effects. The evidence consists of post-hoc SVD comparisons between quantized and full-precision weights; without an intervention that applies equivalent rank damping independently of precision reduction (e.g., explicit low-rank projection or controlled noise), the causal attribution remains vulnerable to confounding by other quantization side-effects such as clipping or dynamic-range reduction.
- [Experimental Results section] Experimental Results section: with >700k runs across many model/dataset/quantization configurations and multiple reliability metrics, the manuscript reports consistent improvements yet provides no information on statistical controls, multiple-testing correction, or whether the spectral hypotheses were pre-specified versus post-hoc. This directly affects the reliability of the claimed gains and should be addressed to support the central claims.
minor comments (2)
- [Abstract] Abstract: the phrase 'over 700k evaluation runs' should be accompanied by a brief breakdown of the number of models, bit-widths, and datasets to allow immediate assessment of coverage.
- [Figures] Figure captions and legends: spectral plots would benefit from explicit indication of whether error bars represent standard deviation across random seeds or across datasets.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to strengthen the mechanistic claims and statistical reporting. We address each major comment below and outline the revisions we will incorporate.
read point-by-point responses
-
Referee: [Spectral Analysis section] Spectral Analysis section: the claim that quantization improves reliability metrics by dampening high-rank spectral components (forcing reliance on low-rank features) is load-bearing for the counterintuitive positive effects. The evidence consists of post-hoc SVD comparisons between quantized and full-precision weights; without an intervention that applies equivalent rank damping independently of precision reduction (e.g., explicit low-rank projection or controlled noise), the causal attribution remains vulnerable to confounding by other quantization side-effects such as clipping or dynamic-range reduction.
Authors: We agree that an explicit causal intervention would provide stronger support for attributing the reliability gains specifically to rank damping. Our current analysis shows that quantization systematically attenuates high singular values while the observed reliability improvements scale with the degree of this attenuation across models and bit-widths. To isolate this mechanism from other quantization effects, we will add experiments in the revision that apply controlled low-rank projections directly to full-precision weights and compare the resulting reliability metrics against those obtained via quantization. This addition will clarify the contribution of spectral filtering. revision: yes
-
Referee: [Experimental Results section] Experimental Results section: with >700k runs across many model/dataset/quantization configurations and multiple reliability metrics, the manuscript reports consistent improvements yet provides no information on statistical controls, multiple-testing correction, or whether the spectral hypotheses were pre-specified versus post-hoc. This directly affects the reliability of the claimed gains and should be addressed to support the central claims.
Authors: We appreciate this point on statistical transparency. The manuscript prioritizes reporting the direction and consistency of effects across the full experimental grid rather than per-comparison significance tests. In the revised version we will add a statistical considerations subsection that (i) quantifies the fraction of configurations exhibiting each improvement, (ii) applies appropriate multiple-testing corrections to aggregated comparisons, and (iii) explicitly notes that the spectral analysis was exploratory yet directly motivated by the empirical patterns. These additions will address concerns about reliability without changing the reported trends. revision: yes
Circularity Check
No circularity: empirical observations and post-hoc spectral interpretation
full rationale
The paper reports results from a large-scale empirical study (>700k runs) on quantization effects across accuracy, calibration, OOD, and robustness metrics. The spectral filtering claim is presented as an interpretation of observed weight spectra (dampening of high-rank components) rather than a mathematical derivation or fitted parameter renamed as prediction. No self-citations, uniqueness theorems, or ansatzes are invoked to close the argument; the central claims rest on direct experimental comparisons that remain falsifiable against external data. This matches the default case of a self-contained empirical paper with no load-bearing reduction to its own inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
How Robustly do LLMs Understand Execution Semantics?
Frontier LLMs like GPT-5.2 show large accuracy drops on perturbed program-output prediction tasks while open-source reasoning models remain more stable, exposing limits in code semantics understanding.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Arnez Yagualca, F. A. 2023. Deep neural network uncertainty runtime monitoring for robust and safe AI-based automated navigation . Theses, Universit \'e Paris-Saclay
work page 2023
-
[4]
Bishop, C. M. 2006. Pattern recognition and machine learning. Springer
work page 2006
-
[5]
Bondarenko, Y.; Chiaro, R. D.; and Nagel, M. 2024. Low-Rank Quantization-Aware Training for LLMs . arXiv:2406.06385
-
[6]
Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; and Vedaldi, A. 2014. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 3606--3613
work page 2014
-
[7]
Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems (NIPS), volume 28
work page 2015
-
[8]
Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. In Advances in neural information processing systems (NIPS), volume 29
work page 2016
-
[9]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248--255. Ieee
work page 2009
-
[10]
Desai, S.; and Durrett, G. 2020. Calibration of Pre-trained Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3979--3991
work page 2020
-
[11]
Esser, S. K.; McKinstry, J. L.; Bablani, D.; Mallya, A.; Appuswamy, R.; and Rath, D. 2020. Learned step size quantization. In International Conference on Learning Representations (ICLR)
work page 2020
-
[12]
European Parliament and Council of the European Union . 2024. Artificial Intelligence Act . https://artificialintelligenceact.eu/fr/article/15/. Regulation (EU) 2024/1689. Specifically referencing Article 15 on 'Accuracy, robustness and cybersecurity'. Accessed: 2025-08-01
work page 2024
-
[13]
Fawcett, T. 2006. An introduction to ROC analysis. Pattern recognition letters, 27(8): 861--874
work page 2006
-
[14]
Frankle, J.; and Carbin, M. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR)
work page 2019
-
[15]
Gong, R.; Liu, X.; Jiang, S.; Li, T.; Fua, P.; and Yan, S. 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 4852--4861
work page 2019
-
[16]
Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On calibration of modern neural networks. In International conference on machine learning (ICML), 1321--1330. PMLR
work page 2017
-
[17]
Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Hvilshoj, M.; et al. 2021 a . The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8340--8349
work page 2021
-
[18]
Hendrycks, D.; Carlini, N.; Schulman, J.; and Steinhardt, J. 2021 b . Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Hendrycks, D.; and Dietterich, T. 2019. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR)
work page 2019
-
[20]
Hendrycks, D.; and Gimpel, K. 2017. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations (ICLR)
work page 2017
- [21]
-
[22]
Hochreiter, S.; and Schmidhuber, J. 1997. Flat minima. Neural computation, 9(1): 1--42
work page 1997
-
[23]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2021. LoRA : Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. Published at the International Conference on Learning Representations (ICLR) 2022
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; and Kalenichenko, D. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2704--2713
work page 2018
-
[25]
O.; Choi, D.; Bhattacharjee, B.; Lien, A.-T.; and Pfister, T
Kar, P.; Ar k, S. O.; Choi, D.; Bhattacharjee, B.; Lien, A.-T.; and Pfister, T. 2023. LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning. In The Eleventh International Conference on Learning Representations (ICLR)
work page 2023
-
[26]
A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13): 3521--3526
work page 2017
-
[27]
Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto
work page 2009
-
[28]
Kull, M.; Perello-Nieto, M.; K \"a ng, M.; Filho, T. M.; Song, H.; and Flach, P. 2019. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. In Advances in Neural Information Processing Systems (NeurIPS), volume 32
work page 2019
-
[29]
Lee, K.; Lee, K.; Lee, H.; and Shin, J. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in neural information processing systems (NIPS), volume 31
work page 2018
-
[30]
Li, Y.; Xu, S.; Zhang, B.; Cao, X.; Gao, P.; and Guo, G. 2022 a . Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer. In Advances in Neural Information Processing Systems (NeurIPS)
work page 2022
-
[31]
Li, Z.; Cui, C.; Liu, X.; Zhang, Y.; Chang, S.; Cheng, H.; Cheng, Y.; and Chen, J. 2022 b . CLIP-Q : Turning full-precision CLIP into a 4-bit model. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 24031--24043
work page 2022
- [32]
-
[33]
Liu, W.; Wang, X.; Owens, J.; and Li, Y. 2020. Energy-based out-of-distribution detection. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, 21464--21475
work page 2020
- [34]
-
[35]
Ming, Y.; and Li, Y. 2022. Delving into the Open-Set World: A Framework for Unsupervised Out-of-Distribution Detection. In European Conference on Computer Vision (ECCV)
work page 2022
-
[36]
H.; Liu, Z.; Yamasaki, T.; and Aizawa, K
Miyai, A.; Yang, J.; Zhang, J.; Ming, Y.; Lin, Y.; Yu, Q.; Irie, G.; Joty, S.; Li, Y.; Li, H. H.; Liu, Z.; Yamasaki, T.; and Aizawa, K. 2025. Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey. In Transactions on Machine Learning Research (TMLR)
work page 2025
-
[37]
Nakkiran, P.; Kaplun, G.; Bansal, Y.; Yang, T.; Barak, B.; and Sutskever, I. 2021. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12): 124003
work page 2021
- [38]
-
[39]
Polino, A.; Pascanu, R.; and Alistarh, D. 2018. Quantization-aware knowledge distillation. In International Conference on Learning Representations (ICLR) Workshop
work page 2018
-
[40]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML), 8748--8763. PMLR
work page 2021
-
[41]
Recht, B.; Roelofs, R.; Schmidt, L.; and Shankar, V. 2019. Do ImageNet Classifiers Generalize to ImageNet? In International Conference on Machine Learning (ICML), 5389--5400. PMLR
work page 2019
-
[42]
Saqib, J.; Hieu, L.; and Mathieu, S. 2025. QT-DoG : Quantization-aware Training for Domain Generalization. In International Conference on Learning Representations (ICLR)
work page 2025
-
[43]
Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. 2022. LAION-5B : An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track
work page 2022
-
[44]
Shao, W.; Zhao, L.; He, Z.; Jiao, Z.; Chen, P.; and Ng, K.-T. 2023. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models. In The Eleventh International Conference on Learning Representations (ICLR)
work page 2023
-
[45]
Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556--2565
work page 2018
- [46]
-
[47]
Teney, D.; Abbasi, E.; and van den Hengel, A. 2022. On the Pitfalls of Spurious Correlations for OOD Generalization. In International Conference on Learning Representations (ICLR)
work page 2022
-
[48]
Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, 368--377
work page 2000
-
[49]
Tu, W.; Deng, W.; and Gedeon, T. 2023. A closer look at the robustness of contrastive language-image pre-training (clip). Advances in Neural Information Processing Systems, 36: 13678--13691
work page 2023
-
[50]
Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.; Shepard, A.; Adam, H.; Perona, P.; and Belongie, S. 2018. The iNaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 8769--8778
work page 2018
-
[51]
Wang, H.; Ge, S.; Lipton, Z.; and Xing, E. P. 2019. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), volume 32
work page 2019
-
[52]
Wang, Q.; Lin, Y.; Chen, Y.; Schmidt, L.; Han, B.; and Zhang, T. 2024. A Sober Look at the Robustness of CLIPs to Spurious Features. In Advances in Neural Information Processing Systems (NeurIPS)
work page 2024
-
[53]
Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; and Han, S. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (ICML), 38087--38101. PMLR
work page 2023
-
[54]
A.; Oliva, A.; and Torralba, A
Xiao, J.; Hays, J.; Ehinger, K. A.; Oliva, A.; and Torralba, A. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, 3485--3492. IEEE
work page 2010
-
[55]
Yang, J.; Wang, P.; Zou, D.; Zhou, Z.; Ding, K.; Peng, W.; Wang, H.; Chen, G.; Li, B.; Sun, Y.; Du, X.; Zhou, K.; Zhang, W.; Hendrycks, D.; Li, Y.; and Liu, Z. 2022. OpenOOD : Benchmarking Generalized Out-of-Distribution Detection. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 30150--30164
work page 2022
-
[56]
Yang, J.; Zhou, K.; and Liu, Z. 2022. Full-Spectrum Out-of-Distribution Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16293--16302
work page 2022
-
[57]
Yin, D.; Gontareva, A.; Gontarev, I.; Kornblith, S.; Gu, S.; and Le, Q. V. 2019. A Fourier perspective on the generalization of deep neural networks. In International Conference on Machine Learning (ICML), 7133--7142. PMLR
work page 2019
-
[58]
Zhang, J.; Yang, J.; Wang, P.; Wang, H.; Lin, Y.; Zhang, H.; Sun, Y.; Du, X.; Li, Y.; Liu, Z.; Chen, Y.; and Li, H. 2024. OpenOOD v1.5 : Enhanced Benchmark for Out-of-Distribution Detection. Journal of Data-centric Machine Learning Research
work page 2024
-
[59]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Places: A 10 million image database for scene recognition. In IEEE transactions on pattern analysis and machine intelligence, volume 40, 1452--1464. IEEE
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.