Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy

Alexandra Gomez-Villa; Aymen Bouguerra; Chokri Mraidha; Daniel Montoya; Fabio Arnez

arxiv: 2509.21173 · v6 · pith:BC6XOGXXnew · submitted 2025-09-25 · 💻 cs.CV · cs.AI· cs.LG

Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy

Aymen Bouguerra , Daniel Montoya , Alexandra Gomez-Villa , Chokri Mraidha , Fabio Arnez This is my paper

Pith reviewed 2026-05-21 22:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords quantizationvision-language modelsVLMsOOD detectionmodel calibrationrobustnessspectral analysislow-rank features

0 comments

The pith

Quantization of vision-language models improves accuracy, calibration, OOD detection, and noise robustness by dampening high-rank spectral components and shifting reliance to low-rank features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how quantization affects vision-language models like CLIP across more than 700,000 runs, focusing on reliability metrics beyond basic accuracy. It finds that reducing precision can simultaneously raise accuracy, improve calibration, enhance out-of-distribution detection, and increase robustness to noise, though it does not help with covariate shifts or spurious correlations. The authors trace these gains to a spectral filtering process in which quantization suppresses high-rank components, forcing the model to depend on more stable low-rank features instead. A reader would care because this turns a standard efficiency tool into a way to make deployed VLMs more trustworthy for safety-critical applications without extra training or data.

Core claim

Quantization dampens high-rank spectral components in VLMs, compelling the model to rely more heavily on robust low-rank features; this spectral filtering drives simultaneous gains in accuracy, calibration, OOD detection, and noise robustness, though not in handling covariate shift or spurious correlations.

What carries the argument

Spectral filtering effect of quantization, which suppresses high-rank components and redirects the model toward stable low-rank features.

If this is right

Quantized VLMs can be deployed directly for tasks requiring both speed and better calibration without separate post-processing.
OOD detection performance rises as a byproduct of quantization, reducing the need for dedicated detection modules in some settings.
Noise robustness improves, supporting use in real-world environments with sensor or input perturbations.
No automatic gains occur for covariate shift or spurious correlations, so separate techniques remain necessary for those failure modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar spectral effects might appear when applying quantization to other large multimodal models beyond VLMs.
The approach could be tested as a default preprocessing step for any efficiency-driven deployment of foundation models.
Spectral analysis before and after quantization might serve as a diagnostic tool to predict reliability improvements on new datasets.

Load-bearing premise

The reliability gains are caused by the spectral filtering mechanism rather than other side effects of quantization or the specific models and datasets tested.

What would settle it

If models with high-rank components artificially suppressed by non-quantization methods fail to show matching gains in accuracy, calibration, and OOD detection, the causal link would be disproved.

Figures

Figures reproduced from arXiv: 2509.21173 by Alexandra Gomez-Villa, Aymen Bouguerra, Chokri Mraidha, Daniel Montoya, Fabio Arnez.

**Figure 2.** Figure 2: Average In-distribution accuracy change for WIT [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Robustness to Decreasing Quantization Precision. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy evolution of quantized model accuracy [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of QAT Methods on CLIP Model Calibra [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Direct Impact of QAT on Calibration. These plots [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 8.** Figure 8: Divergent Impact of QAT on Robustness to Co [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 7.** Figure 7: Final State after QAT & Logit Scale ReAdaptation. After adapting the logit scale to the new quantized model, calibration is further improved. Please refer to our appendix for the dataset-specific bin-wise shift. squeezing effect towards the dashed line in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 9.** Figure 9: Impact of quantization on OOD Detection (AUROC). Average AUROC across quantization methods (higher is better). [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Frequency-domain impact of quantization. Top: [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

**Figure 11.** Figure 11: Spectral analysis of a ViT-B/32 model with 6- bit quantization. From left to right: the FP32 baseline spectrum, the PTQ spectrum showing severe high-frequency attenuation, and the partially restored QAT spectrum. Bottom row shows corresponding RSE maps. 5.2 Refined Analysis with Standard Quantization While informative, these initial results arise from conditions—aggressive 6-bit precision and a low-res… view at source ↗

**Figure 13.** Figure 13: Systematic benchmark suite categorized by Co [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: Evolution of ViT/B-32 Quantized model accu [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: A conceptual illustration of how QAT forces the [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Dataset-specific reliability diagrams showing the direct impact of Quantization-Aware Training (Phase 1). This plot [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: Dataset-specific reliability diagrams showing the final calibration state after the full two-phase process (QAT + Logit [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

**Figure 18.** Figure 18: The trade-off between zero-shot accuracy and Negative Log-Likelihood (NLL). The ideal outcome (green, top-left) [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Full-size teaser figure: The dichotomous impact of quantization on zero-shot Performance. WIT models (blue) con [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

**Figure 20.** Figure 20: Average FPR@95 across all Far-OOD datasets. Lower values are better. This plot complements the AUROC results [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

**Figure 21.** Figure 21: Confidence bins shift after QAT and Logit Adaptation, we can clearly see the trend where the overconfident LAION [PITH_FULL_IMAGE:figures/full_fig_p019_21.png] view at source ↗

**Figure 22.** Figure 22: The change in OOD detection (AUROC) performance on various covariate shift datasets after applying Light QAT [PITH_FULL_IMAGE:figures/full_fig_p019_22.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) such as CLIP have revolutionized zero-shot classification and safety-critical tasks, including Out-of-Distribution (OOD) detection. However, their high computational cost hinders efficient real-world deployment. While quantization is a standard solution for efficiency, its broader impact on reliability metrics beyond simple Top-1 accuracy remains critically under-explored. In this study, we conduct a large-scale evaluation of VLM quantization across a comprehensive experimental suite of over 700k evaluation runs with varying configurations. We find that, contrary to the assumption that quantization's noise degrades performance, it can simultaneously improve accuracy, calibration, OOD detection, and robustness to noise, though not to covariate shift or spurious correlations. We leverage these counterintuitive findings to characterize the mechanics of quantization beyond simple regularization: we show that quantization dampens high-rank spectral components, compelling the model to rely more heavily on robust, low-rank features. Ultimately, this spectral filtering effect drives the observed improvements in generalization and noise tolerance, establishing a pathway to deploy faster, more reliable VLMs by utilizing quantization beyond its conventional role.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports results from a large-scale empirical study involving over 700k evaluation runs on quantizing Vision-Language Models such as CLIP. It claims that quantization can simultaneously improve accuracy, calibration, OOD detection, and robustness to additive noise (while not improving robustness to covariate shift or spurious correlations) and attributes these gains to a spectral filtering mechanism in which quantization dampens high-rank components, causing the model to rely more on robust low-rank features.

Significance. If the central empirical observations are robust, the work would be significant for efficient and reliable VLM deployment: it challenges the standard view of quantization as a pure efficiency-accuracy tradeoff and suggests a pathway to obtain reliability benefits at reduced precision. The experimental volume is a notable strength. The proposed spectral explanation is intriguing but currently rests on post-hoc interpretation rather than an isolated causal test, limiting the strength of the mechanistic contribution.

major comments (2)

[Spectral Analysis section] Spectral Analysis section: the claim that quantization improves reliability metrics by dampening high-rank spectral components (forcing reliance on low-rank features) is load-bearing for the counterintuitive positive effects. The evidence consists of post-hoc SVD comparisons between quantized and full-precision weights; without an intervention that applies equivalent rank damping independently of precision reduction (e.g., explicit low-rank projection or controlled noise), the causal attribution remains vulnerable to confounding by other quantization side-effects such as clipping or dynamic-range reduction.
[Experimental Results section] Experimental Results section: with >700k runs across many model/dataset/quantization configurations and multiple reliability metrics, the manuscript reports consistent improvements yet provides no information on statistical controls, multiple-testing correction, or whether the spectral hypotheses were pre-specified versus post-hoc. This directly affects the reliability of the claimed gains and should be addressed to support the central claims.

minor comments (2)

[Abstract] Abstract: the phrase 'over 700k evaluation runs' should be accompanied by a brief breakdown of the number of models, bit-widths, and datasets to allow immediate assessment of coverage.
[Figures] Figure captions and legends: spectral plots would benefit from explicit indication of whether error bars represent standard deviation across random seeds or across datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to strengthen the mechanistic claims and statistical reporting. We address each major comment below and outline the revisions we will incorporate.

read point-by-point responses

Referee: [Spectral Analysis section] Spectral Analysis section: the claim that quantization improves reliability metrics by dampening high-rank spectral components (forcing reliance on low-rank features) is load-bearing for the counterintuitive positive effects. The evidence consists of post-hoc SVD comparisons between quantized and full-precision weights; without an intervention that applies equivalent rank damping independently of precision reduction (e.g., explicit low-rank projection or controlled noise), the causal attribution remains vulnerable to confounding by other quantization side-effects such as clipping or dynamic-range reduction.

Authors: We agree that an explicit causal intervention would provide stronger support for attributing the reliability gains specifically to rank damping. Our current analysis shows that quantization systematically attenuates high singular values while the observed reliability improvements scale with the degree of this attenuation across models and bit-widths. To isolate this mechanism from other quantization effects, we will add experiments in the revision that apply controlled low-rank projections directly to full-precision weights and compare the resulting reliability metrics against those obtained via quantization. This addition will clarify the contribution of spectral filtering. revision: yes
Referee: [Experimental Results section] Experimental Results section: with >700k runs across many model/dataset/quantization configurations and multiple reliability metrics, the manuscript reports consistent improvements yet provides no information on statistical controls, multiple-testing correction, or whether the spectral hypotheses were pre-specified versus post-hoc. This directly affects the reliability of the claimed gains and should be addressed to support the central claims.

Authors: We appreciate this point on statistical transparency. The manuscript prioritizes reporting the direction and consistency of effects across the full experimental grid rather than per-comparison significance tests. In the revised version we will add a statistical considerations subsection that (i) quantifies the fraction of configurations exhibiting each improvement, (ii) applies appropriate multiple-testing corrections to aggregated comparisons, and (iii) explicitly notes that the spectral analysis was exploratory yet directly motivated by the empirical patterns. These additions will address concerns about reliability without changing the reported trends. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations and post-hoc spectral interpretation

full rationale

The paper reports results from a large-scale empirical study (>700k runs) on quantization effects across accuracy, calibration, OOD, and robustness metrics. The spectral filtering claim is presented as an interpretation of observed weight spectra (dampening of high-rank components) rather than a mathematical derivation or fitted parameter renamed as prediction. No self-citations, uniqueness theorems, or ansatzes are invoked to close the argument; the central claims rest on direct experimental comparisons that remain falsifiable against external data. This matches the default case of a self-contained empirical paper with no load-bearing reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical study whose central claims rest on experimental observations and post-experiment spectral analysis rather than on explicit axioms or invented theoretical entities.

pith-pipeline@v0.9.0 · 5746 in / 1062 out tokens · 39921 ms · 2026-05-21T22:20:36.634790+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How Robustly do LLMs Understand Execution Semantics?
cs.SE 2026-02 unverdicted novelty 6.0

Frontier LLMs like GPT-5.2 show large accuracy drops on perturbed program-output prediction tasks while open-source reasoning models remain more stable, exposing limits in code semantics understanding.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Arnez Yagualca, F. A. 2023. Deep neural network uncertainty runtime monitoring for robust and safe AI-based automated navigation . Theses, Universit \'e Paris-Saclay

work page 2023
[4]

Bishop, C. M. 2006. Pattern recognition and machine learning. Springer

work page 2006
[5]

D.; and Nagel, M

Bondarenko, Y.; Chiaro, R. D.; and Nagel, M. 2024. Low-Rank Quantization-Aware Training for LLMs . arXiv:2406.06385

work page arXiv 2024
[6]

Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; and Vedaldi, A. 2014. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 3606--3613

work page 2014
[7]

Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems (NIPS), volume 28

work page 2015
[8]

Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. In Advances in neural information processing systems (NIPS), volume 29

work page 2016
[9]

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248--255. Ieee

work page 2009
[10]

Desai, S.; and Durrett, G. 2020. Calibration of Pre-trained Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3979--3991

work page 2020
[11]

K.; McKinstry, J

Esser, S. K.; McKinstry, J. L.; Bablani, D.; Mallya, A.; Appuswamy, R.; and Rath, D. 2020. Learned step size quantization. In International Conference on Learning Representations (ICLR)

work page 2020
[12]

European Parliament and Council of the European Union . 2024. Artificial Intelligence Act . https://artificialintelligenceact.eu/fr/article/15/. Regulation (EU) 2024/1689. Specifically referencing Article 15 on 'Accuracy, robustness and cybersecurity'. Accessed: 2025-08-01

work page 2024
[13]

Fawcett, T. 2006. An introduction to ROC analysis. Pattern recognition letters, 27(8): 861--874

work page 2006
[14]

Frankle, J.; and Carbin, M. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR)

work page 2019
[15]

Gong, R.; Liu, X.; Jiang, S.; Li, T.; Fua, P.; and Yan, S. 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 4852--4861

work page 2019
[16]

Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On calibration of modern neural networks. In International conference on machine learning (ICML), 1321--1330. PMLR

work page 2017
[17]

Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Hvilshoj, M.; et al. 2021 a . The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8340--8349

work page 2021
[18]

Hendrycks, D.; Carlini, N.; Schulman, J.; and Steinhardt, J. 2021 b . Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Hendrycks, D.; and Dietterich, T. 2019. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR)

work page 2019
[20]

Hendrycks, D.; and Gimpel, K. 2017. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations (ICLR)

work page 2017
[21]

Hochlehnert, A.; Bhatnagar, H.; Udandarao, V.; Albanie, S.; Prabhu, A.; and Bethge, M. 2025. A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility. arXiv preprint arXiv:2504.07086

work page arXiv 2025
[22]

Hochreiter, S.; and Schmidhuber, J. 1997. Flat minima. Neural computation, 9(1): 1--42

work page 1997
[23]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2021. LoRA : Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. Published at the International Conference on Learning Representations (ICLR) 2022

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; and Kalenichenko, D. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2704--2713

work page 2018
[25]

O.; Choi, D.; Bhattacharjee, B.; Lien, A.-T.; and Pfister, T

Kar, P.; Ar k, S. O.; Choi, D.; Bhattacharjee, B.; Lien, A.-T.; and Pfister, T. 2023. LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning. In The Eleventh International Conference on Learning Representations (ICLR)

work page 2023
[26]

A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al

Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13): 3521--3526

work page 2017
[27]

Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto

work page 2009
[28]

M.; Song, H.; and Flach, P

Kull, M.; Perello-Nieto, M.; K \"a ng, M.; Filho, T. M.; Song, H.; and Flach, P. 2019. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. In Advances in Neural Information Processing Systems (NeurIPS), volume 32

work page 2019
[29]

Lee, K.; Lee, K.; Lee, H.; and Shin, J. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in neural information processing systems (NIPS), volume 31

work page 2018
[30]

Li, Y.; Xu, S.; Zhang, B.; Cao, X.; Gao, P.; and Guo, G. 2022 a . Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer. In Advances in Neural Information Processing Systems (NeurIPS)

work page 2022
[31]

Li, Z.; Cui, C.; Liu, X.; Zhang, Y.; Chang, S.; Cheng, H.; Cheng, Y.; and Chen, J. 2022 b . CLIP-Q : Turning full-precision CLIP into a 4-bit model. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 24031--24043

work page 2022
[32]

Li, Z.; Wang, F.; Zhang, Z.; and Li, F. 2023. NegCLIP : A Negative-Prompt-based Method for OOD Detection in Vision-Language Models. arXiv preprint arXiv:2310.03114

work page arXiv 2023
[33]

Liu, W.; Wang, X.; Owens, J.; and Li, Y. 2020. Energy-based out-of-distribution detection. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, 21464--21475

work page 2020
[34]

Mayilvahanan, P.; Wiedemer, T.; Rusak, E.; Bethge, M.; and Brendel, W. 2023. Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity? arXiv preprint arXiv:2310.09562

work page arXiv 2023
[35]

Ming, Y.; and Li, Y. 2022. Delving into the Open-Set World: A Framework for Unsupervised Out-of-Distribution Detection. In European Conference on Computer Vision (ECCV)

work page 2022
[36]

H.; Liu, Z.; Yamasaki, T.; and Aizawa, K

Miyai, A.; Yang, J.; Zhang, J.; Ming, Y.; Lin, Y.; Yu, Q.; Irie, G.; Joty, S.; Li, Y.; Li, H. H.; Liu, Z.; Yamasaki, T.; and Aizawa, K. 2025. Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey. In Transactions on Machine Learning Research (TMLR)

work page 2025
[37]

Nakkiran, P.; Kaplun, G.; Bansal, Y.; Yang, T.; Barak, B.; and Sutskever, I. 2021. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12): 124003

work page 2021
[38]

Noda, S.; Miyai, A.; Yu, Q.; Irie, G.; and Aizawa, K. 2025. A Benchmark and Evaluation for Real-World Out-of-Distribution Detection using Vision-Language Models. arXiv preprint arXiv:2501.18463v1

work page arXiv 2025
[39]

Polino, A.; Pascanu, R.; and Alistarh, D. 2018. Quantization-aware knowledge distillation. In International Conference on Learning Representations (ICLR) Workshop

work page 2018
[40]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML), 8748--8763. PMLR

work page 2021
[41]

Recht, B.; Roelofs, R.; Schmidt, L.; and Shankar, V. 2019. Do ImageNet Classifiers Generalize to ImageNet? In International Conference on Machine Learning (ICML), 5389--5400. PMLR

work page 2019
[42]

Saqib, J.; Hieu, L.; and Mathieu, S. 2025. QT-DoG : Quantization-aware Training for Domain Generalization. In International Conference on Learning Representations (ICLR)

work page 2025
[43]

Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. 2022. LAION-5B : An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page 2022
[44]

Shao, W.; Zhao, L.; He, Z.; Jiao, Z.; Chen, P.; and Ng, K.-T. 2023. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models. In The Eleventh International Conference on Learning Representations (ICLR)

work page 2023
[45]

Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556--2565

work page 2018
[46]

Tallec, C.; Blier, L.; and Ollivier, Y. 2023. Revisiting the Regularization Effect of Quantization. arXiv preprint arXiv:2310.03113

work page arXiv 2023
[47]

Teney, D.; Abbasi, E.; and van den Hengel, A. 2022. On the Pitfalls of Spurious Correlations for OOD Generalization. In International Conference on Learning Representations (ICLR)

work page 2022
[48]

C.; and Bialek, W

Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, 368--377

work page 2000
[49]

Tu, W.; Deng, W.; and Gedeon, T. 2023. A closer look at the robustness of contrastive language-image pre-training (clip). Advances in Neural Information Processing Systems, 36: 13678--13691

work page 2023
[50]

Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.; Shepard, A.; Adam, H.; Perona, P.; and Belongie, S. 2018. The iNaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 8769--8778

work page 2018
[51]

Wang, H.; Ge, S.; Lipton, Z.; and Xing, E. P. 2019. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), volume 32

work page 2019
[52]

Wang, Q.; Lin, Y.; Chen, Y.; Schmidt, L.; Han, B.; and Zhang, T. 2024. A Sober Look at the Robustness of CLIPs to Spurious Features. In Advances in Neural Information Processing Systems (NeurIPS)

work page 2024
[53]

Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; and Han, S. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (ICML), 38087--38101. PMLR

work page 2023
[54]

A.; Oliva, A.; and Torralba, A

Xiao, J.; Hays, J.; Ehinger, K. A.; Oliva, A.; and Torralba, A. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, 3485--3492. IEEE

work page 2010
[55]

Yang, J.; Wang, P.; Zou, D.; Zhou, Z.; Ding, K.; Peng, W.; Wang, H.; Chen, G.; Li, B.; Sun, Y.; Du, X.; Zhou, K.; Zhang, W.; Hendrycks, D.; Li, Y.; and Liu, Z. 2022. OpenOOD : Benchmarking Generalized Out-of-Distribution Detection. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 30150--30164

work page 2022
[56]

Yang, J.; Zhou, K.; and Liu, Z. 2022. Full-Spectrum Out-of-Distribution Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16293--16302

work page 2022
[57]

Yin, D.; Gontareva, A.; Gontarev, I.; Kornblith, S.; Gu, S.; and Le, Q. V. 2019. A Fourier perspective on the generalization of deep neural networks. In International Conference on Machine Learning (ICML), 7133--7142. PMLR

work page 2019
[58]

Zhang, J.; Yang, J.; Wang, P.; Wang, H.; Lin, Y.; Zhang, H.; Sun, Y.; Du, X.; Li, Y.; Liu, Z.; Chen, Y.; and Li, H. 2024. OpenOOD v1.5 : Enhanced Benchmark for Out-of-Distribution Detection. Journal of Data-centric Machine Learning Research

work page 2024
[59]

Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Places: A 10 million image database for scene recognition. In IEEE transactions on pattern analysis and machine intelligence, volume 40, 1452--1464. IEEE

work page 2017

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Arnez Yagualca, F. A. 2023. Deep neural network uncertainty runtime monitoring for robust and safe AI-based automated navigation . Theses, Universit \'e Paris-Saclay

work page 2023

[4] [4]

Bishop, C. M. 2006. Pattern recognition and machine learning. Springer

work page 2006

[5] [5]

D.; and Nagel, M

Bondarenko, Y.; Chiaro, R. D.; and Nagel, M. 2024. Low-Rank Quantization-Aware Training for LLMs . arXiv:2406.06385

work page arXiv 2024

[6] [6]

Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; and Vedaldi, A. 2014. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 3606--3613

work page 2014

[7] [7]

Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems (NIPS), volume 28

work page 2015

[8] [8]

Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. In Advances in neural information processing systems (NIPS), volume 29

work page 2016

[9] [9]

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248--255. Ieee

work page 2009

[10] [10]

Desai, S.; and Durrett, G. 2020. Calibration of Pre-trained Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3979--3991

work page 2020

[11] [11]

K.; McKinstry, J

Esser, S. K.; McKinstry, J. L.; Bablani, D.; Mallya, A.; Appuswamy, R.; and Rath, D. 2020. Learned step size quantization. In International Conference on Learning Representations (ICLR)

work page 2020

[12] [12]

European Parliament and Council of the European Union . 2024. Artificial Intelligence Act . https://artificialintelligenceact.eu/fr/article/15/. Regulation (EU) 2024/1689. Specifically referencing Article 15 on 'Accuracy, robustness and cybersecurity'. Accessed: 2025-08-01

work page 2024

[13] [13]

Fawcett, T. 2006. An introduction to ROC analysis. Pattern recognition letters, 27(8): 861--874

work page 2006

[14] [14]

Frankle, J.; and Carbin, M. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR)

work page 2019

[15] [15]

Gong, R.; Liu, X.; Jiang, S.; Li, T.; Fua, P.; and Yan, S. 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 4852--4861

work page 2019

[16] [16]

Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On calibration of modern neural networks. In International conference on machine learning (ICML), 1321--1330. PMLR

work page 2017

[17] [17]

Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Hvilshoj, M.; et al. 2021 a . The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8340--8349

work page 2021

[18] [18]

Hendrycks, D.; Carlini, N.; Schulman, J.; and Steinhardt, J. 2021 b . Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Hendrycks, D.; and Dietterich, T. 2019. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR)

work page 2019

[20] [20]

Hendrycks, D.; and Gimpel, K. 2017. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations (ICLR)

work page 2017

[21] [21]

Hochlehnert, A.; Bhatnagar, H.; Udandarao, V.; Albanie, S.; Prabhu, A.; and Bethge, M. 2025. A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility. arXiv preprint arXiv:2504.07086

work page arXiv 2025

[22] [22]

Hochreiter, S.; and Schmidhuber, J. 1997. Flat minima. Neural computation, 9(1): 1--42

work page 1997

[23] [23]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2021. LoRA : Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. Published at the International Conference on Learning Representations (ICLR) 2022

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; and Kalenichenko, D. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2704--2713

work page 2018

[25] [25]

O.; Choi, D.; Bhattacharjee, B.; Lien, A.-T.; and Pfister, T

Kar, P.; Ar k, S. O.; Choi, D.; Bhattacharjee, B.; Lien, A.-T.; and Pfister, T. 2023. LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning. In The Eleventh International Conference on Learning Representations (ICLR)

work page 2023

[26] [26]

A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al

Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13): 3521--3526

work page 2017

[27] [27]

Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto

work page 2009

[28] [28]

M.; Song, H.; and Flach, P

Kull, M.; Perello-Nieto, M.; K \"a ng, M.; Filho, T. M.; Song, H.; and Flach, P. 2019. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. In Advances in Neural Information Processing Systems (NeurIPS), volume 32

work page 2019

[29] [29]

Lee, K.; Lee, K.; Lee, H.; and Shin, J. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in neural information processing systems (NIPS), volume 31

work page 2018

[30] [30]

Li, Y.; Xu, S.; Zhang, B.; Cao, X.; Gao, P.; and Guo, G. 2022 a . Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer. In Advances in Neural Information Processing Systems (NeurIPS)

work page 2022

[31] [31]

Li, Z.; Cui, C.; Liu, X.; Zhang, Y.; Chang, S.; Cheng, H.; Cheng, Y.; and Chen, J. 2022 b . CLIP-Q : Turning full-precision CLIP into a 4-bit model. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 24031--24043

work page 2022

[32] [32]

Li, Z.; Wang, F.; Zhang, Z.; and Li, F. 2023. NegCLIP : A Negative-Prompt-based Method for OOD Detection in Vision-Language Models. arXiv preprint arXiv:2310.03114

work page arXiv 2023

[33] [33]

Liu, W.; Wang, X.; Owens, J.; and Li, Y. 2020. Energy-based out-of-distribution detection. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, 21464--21475

work page 2020

[34] [34]

Mayilvahanan, P.; Wiedemer, T.; Rusak, E.; Bethge, M.; and Brendel, W. 2023. Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity? arXiv preprint arXiv:2310.09562

work page arXiv 2023

[35] [35]

Ming, Y.; and Li, Y. 2022. Delving into the Open-Set World: A Framework for Unsupervised Out-of-Distribution Detection. In European Conference on Computer Vision (ECCV)

work page 2022

[36] [36]

H.; Liu, Z.; Yamasaki, T.; and Aizawa, K

Miyai, A.; Yang, J.; Zhang, J.; Ming, Y.; Lin, Y.; Yu, Q.; Irie, G.; Joty, S.; Li, Y.; Li, H. H.; Liu, Z.; Yamasaki, T.; and Aizawa, K. 2025. Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey. In Transactions on Machine Learning Research (TMLR)

work page 2025

[37] [37]

Nakkiran, P.; Kaplun, G.; Bansal, Y.; Yang, T.; Barak, B.; and Sutskever, I. 2021. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12): 124003

work page 2021

[38] [38]

Noda, S.; Miyai, A.; Yu, Q.; Irie, G.; and Aizawa, K. 2025. A Benchmark and Evaluation for Real-World Out-of-Distribution Detection using Vision-Language Models. arXiv preprint arXiv:2501.18463v1

work page arXiv 2025

[39] [39]

Polino, A.; Pascanu, R.; and Alistarh, D. 2018. Quantization-aware knowledge distillation. In International Conference on Learning Representations (ICLR) Workshop

work page 2018

[40] [40]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML), 8748--8763. PMLR

work page 2021

[41] [41]

Recht, B.; Roelofs, R.; Schmidt, L.; and Shankar, V. 2019. Do ImageNet Classifiers Generalize to ImageNet? In International Conference on Machine Learning (ICML), 5389--5400. PMLR

work page 2019

[42] [42]

Saqib, J.; Hieu, L.; and Mathieu, S. 2025. QT-DoG : Quantization-aware Training for Domain Generalization. In International Conference on Learning Representations (ICLR)

work page 2025

[43] [43]

Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. 2022. LAION-5B : An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page 2022

[44] [44]

Shao, W.; Zhao, L.; He, Z.; Jiao, Z.; Chen, P.; and Ng, K.-T. 2023. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models. In The Eleventh International Conference on Learning Representations (ICLR)

work page 2023

[45] [45]

Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556--2565

work page 2018

[46] [46]

Tallec, C.; Blier, L.; and Ollivier, Y. 2023. Revisiting the Regularization Effect of Quantization. arXiv preprint arXiv:2310.03113

work page arXiv 2023

[47] [47]

Teney, D.; Abbasi, E.; and van den Hengel, A. 2022. On the Pitfalls of Spurious Correlations for OOD Generalization. In International Conference on Learning Representations (ICLR)

work page 2022

[48] [48]

C.; and Bialek, W

Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, 368--377

work page 2000

[49] [49]

Tu, W.; Deng, W.; and Gedeon, T. 2023. A closer look at the robustness of contrastive language-image pre-training (clip). Advances in Neural Information Processing Systems, 36: 13678--13691

work page 2023

[50] [50]

Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.; Shepard, A.; Adam, H.; Perona, P.; and Belongie, S. 2018. The iNaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 8769--8778

work page 2018

[51] [51]

Wang, H.; Ge, S.; Lipton, Z.; and Xing, E. P. 2019. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (NeurIPS), volume 32

work page 2019

[52] [52]

Wang, Q.; Lin, Y.; Chen, Y.; Schmidt, L.; Han, B.; and Zhang, T. 2024. A Sober Look at the Robustness of CLIPs to Spurious Features. In Advances in Neural Information Processing Systems (NeurIPS)

work page 2024

[53] [53]

Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; and Han, S. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (ICML), 38087--38101. PMLR

work page 2023

[54] [54]

A.; Oliva, A.; and Torralba, A

Xiao, J.; Hays, J.; Ehinger, K. A.; Oliva, A.; and Torralba, A. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, 3485--3492. IEEE

work page 2010

[55] [55]

Yang, J.; Wang, P.; Zou, D.; Zhou, Z.; Ding, K.; Peng, W.; Wang, H.; Chen, G.; Li, B.; Sun, Y.; Du, X.; Zhou, K.; Zhang, W.; Hendrycks, D.; Li, Y.; and Liu, Z. 2022. OpenOOD : Benchmarking Generalized Out-of-Distribution Detection. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 30150--30164

work page 2022

[56] [56]

Yang, J.; Zhou, K.; and Liu, Z. 2022. Full-Spectrum Out-of-Distribution Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16293--16302

work page 2022

[57] [57]

Yin, D.; Gontareva, A.; Gontarev, I.; Kornblith, S.; Gu, S.; and Le, Q. V. 2019. A Fourier perspective on the generalization of deep neural networks. In International Conference on Machine Learning (ICML), 7133--7142. PMLR

work page 2019

[58] [58]

Zhang, J.; Yang, J.; Wang, P.; Wang, H.; Lin, Y.; Zhang, H.; Sun, Y.; Du, X.; Li, Y.; Liu, Z.; Chen, Y.; and Li, H. 2024. OpenOOD v1.5 : Enhanced Benchmark for Out-of-Distribution Detection. Journal of Data-centric Machine Learning Research

work page 2024

[59] [59]

Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Places: A 10 million image database for scene recognition. In IEEE transactions on pattern analysis and machine intelligence, volume 40, 1452--1464. IEEE

work page 2017