arxiv: 2604.15027 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Quality-Aware Calibration for AI-Generated Image Detection in the Wild

Davide Cozzolino, Fabrizio Guillaro, Luisa Verdoliva, Vincenzo De Rosa

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated image detectionnear-duplicatesquality-aware fusionimage forensicsdeepfake detectionviral contentdegradation simulationdetection calibration

0 comments

The pith

Aggregating detector scores from near-duplicates weighted by image quality improves AI-generated image detection accuracy by about 8 percent on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that AI-generated images in the wild appear as multiple near-duplicates that have been compressed, resized, or cropped, causing the same detector to give inconsistent results on different versions. It proposes retrieving those duplicates for any query image, running a detector on each, and combining the scores with weights derived from estimated quality so that cleaner versions contribute more. This matters because single-image checks miss the collective evidence available online and can be misled by heavily processed copies. Experiments across several detectors confirm consistent gains, averaging roughly 8 percent higher balanced accuracy than simple averaging of scores. Two new datasets support evaluation at scale, one simulating degradation trees in the lab and one drawn from real viral web content.

Core claim

The central claim is that quality-aware fusion of detector outputs across retrieved near-duplicates produces more reliable decisions than any single instance or unweighted average. Given a query image, the method finds its online near-duplicates, feeds each to an off-the-shelf detector, and aggregates the scores using per-image quality estimates as weights. This accounts for the reduced trustworthiness of degraded versions while still using all available information. The approach is tested on a 136k-image lab dataset of stochastic degradation trees and a 10k-image real-world collection of viral near-duplicates, showing average balanced-accuracy gains of around 8 percent over plain averaging.

What carries the argument

QuAD, the framework that retrieves near-duplicates of a query image, runs a detector on each, and fuses the scores using estimated quality as a weighting factor.

Load-bearing premise

Near-duplicates can be reliably retrieved at web scale and image quality can be estimated accurately enough to serve as a trustworthy weighting factor for the detector scores.

What would settle it

A controlled experiment in which duplicate retrieval is restricted to low-quality versions only, or quality estimates are replaced with random weights, would show whether the reported accuracy gains disappear.

Figures

Figures reproduced from arXiv: 2604.15027 by Davide Cozzolino, Fabrizio Guillaro, Luisa Verdoliva, Vincenzo De Rosa.

**Figure 1.** Figure 1: We study AI-generated image detection in real-world online settings. Given a query image, we first retrieve near-duplicate [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The oldest or largest image (day 2) is not necessarily the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Distribution of JPEG quality factors (left), and crop size [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: AncesTree, we build a tree of progressive degradations used to generate near-duplicate image instances. Starting from a clean [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Score distributions of several forensic detectors [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 8.** Figure 8: Performance in terms of average Balanced Accuracy [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

read the original abstract

Significant progress has been made in detecting synthetic images, however most existing approaches operate on a single image instance and overlook a key characteristic of real-world dissemination: as viral images circulate on the web, multiple near-duplicate versions appear and lose quality due to repeated operations like recompression, resizing and cropping. As a consequence, the same image may yield inconsistent forensic predictions based on which version has been analyzed. In this work, to address this issue we propose QuAD (Quality-Aware calibration with near-Duplicates) a novel framework that makes decisions based on all available near-duplicates of the same image. Given a query, we retrieve its online near-duplicates and feed them to a detector: the resulting scores are then aggregated based on the estimated quality of the corresponding instance. By doing so, we take advantage of all pieces of information while accounting for the reduced reliability of images impaired by multiple processing steps. To support large-scale evaluation, we introduce two datasets: AncesTree, an in-lab dataset of 136k images organized in stochastic degradation trees that simulate online reposting dynamics, and ReWIND, a real-world dataset of nearly 10k near-duplicate images collected from viral web content. Experiments on several state-of-the-art detectors show that our quality-aware fusion improves their performance consistently, with an average gain of around 8% in terms of balanced accuracy compared to plain average. Our results highlight the importance of jointly processing all the images available online to achieve reliable detection of AI-generated content in real-world applications. Code and data are publicly available at https://grip-unina.github.io/QuAD/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QuAD shows gains from quality-weighted fusion of near-duplicates on two new datasets, but the quality scores are not shown to track actual detector reliability under degradation.

read the letter

The paper's main addition is a simple aggregation rule: pull near-duplicates of a query image, run a detector on each, and combine the scores with weights derived from per-image quality estimates. This targets the real inconsistency that appears when the same synthetic image circulates and gets recompressed or resized. They back it with two fresh datasets—AncesTree, built from controlled degradation trees in the lab, and ReWIND, scraped from actual viral web content—which are released with code. That release is the clearest practical value here; anyone testing detector robustness now has a ready benchmark for multi-instance cases. The reported 8% average lift in balanced accuracy over plain averaging is consistent across the detectors they tried. The soft spot is exactly the one the stress-test flagged. The abstract and summary give no procedure for estimating quality, no ablation that checks whether those quality numbers actually predict lower error rates on degraded versions, and no controls for retrieval noise or failed matches. Without that link, the improvement could just be an effect of ensemble size rather than calibrated weighting. The central assumption—that quality serves as a trustworthy proxy for score reliability—remains untested in the material provided. This work is aimed at people building or evaluating web-scale AI-image detectors who already have access to retrieval pipelines. A reader who needs to move from single-image lab results to real dissemination patterns will find the datasets and the aggregation idea worth looking at, even if the method still needs tighter justification. It deserves a serious referee because the practical framing and the public resources are substantive enough to justify the time, provided the quality-correlation gap gets addressed in revision.

Referee Report

3 major / 2 minor

Summary. The paper proposes QuAD, a framework that retrieves near-duplicates of a query image online, estimates their quality, and fuses scores from AI-generated image detectors using quality-based weighting rather than uniform averaging. It introduces AncesTree (136k images in controlled stochastic degradation trees) and ReWIND (~10k real-world viral near-duplicates) to evaluate robustness under realistic reposting degradations. Experiments on multiple SOTA detectors report a consistent ~8% gain in balanced accuracy over plain averaging, with public code and data released.

Significance. If the quality estimates prove to be a reliable proxy for per-instance detector trustworthiness under degradation, the work would meaningfully advance practical deployment of forensic detectors by exploiting web-scale duplicates. The controlled AncesTree dataset and real-world ReWIND collection are useful contributions for the community, and the public release of code/data supports reproducibility. The empirical gains, however, rest on unverified assumptions about quality-accuracy correlation.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the reported average 8% balanced-accuracy gain over plain averaging lacks error bars, statistical significance tests, or ablations that isolate the contribution of quality weighting from retrieval success rate or ensemble size; this makes it impossible to attribute the lift specifically to the proposed calibration.
[Method] Method section (quality estimation and fusion): no correlation analysis or ablation is presented that links the estimated quality scores to actual per-instance detection error rates on AncesTree's controlled degradation trees; without this, the central assumption that quality serves as a valid proxy for detector reliability remains unverified.
[§4] §4 (ReWIND dataset construction): the manuscript provides no quantitative controls or failure-mode analysis for near-duplicate retrieval at scale (e.g., false-positive retrievals or missed duplicates), which directly affects whether the observed gains generalize beyond the collected set.

minor comments (2)

[Figures] Figure captions and axis labels in the results plots could more explicitly state the exact detectors and quality estimator used for each curve.
[Method] The notation for the quality-weighted aggregation formula should be introduced with a clear equation number and variable definitions in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the positive assessment of QuAD, the AncesTree and ReWIND datasets, and the public release of code and data. We address each major comment below and will incorporate revisions to strengthen the empirical validation and presentation of results.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the reported average 8% balanced-accuracy gain over plain averaging lacks error bars, statistical significance tests, or ablations that isolate the contribution of quality weighting from retrieval success rate or ensemble size; this makes it impossible to attribute the lift specifically to the proposed calibration.

Authors: We agree that the results section would benefit from greater statistical rigor and targeted ablations. In the revised manuscript we will report error bars (standard deviation across repeated runs or cross-validation folds) for all balanced-accuracy figures, include paired statistical significance tests (e.g., t-tests) comparing QuAD against plain averaging, and add ablations that systematically vary ensemble size and retrieval success rate while holding quality weighting fixed. These additions will allow readers to isolate the contribution of the quality-aware fusion more clearly. revision: yes
Referee: [Method] Method section (quality estimation and fusion): no correlation analysis or ablation is presented that links the estimated quality scores to actual per-instance detection error rates on AncesTree's controlled degradation trees; without this, the central assumption that quality serves as a valid proxy for detector reliability remains unverified.

Authors: We acknowledge that a direct quantitative link between the estimated quality scores and per-instance detector accuracy on the controlled AncesTree trees would provide stronger support for the core modeling assumption. Although the consistent gains observed across degradation levels already suggest the utility of quality weighting, we will add an explicit correlation analysis in the revised version: Pearson and Spearman coefficients between quality estimates and detection accuracy (or error) computed across the stochastic degradation trees, together with scatter plots stratified by degradation depth. This analysis will be placed in the Method or Experiments section. revision: yes
Referee: [§4] §4 (ReWIND dataset construction): the manuscript provides no quantitative controls or failure-mode analysis for near-duplicate retrieval at scale (e.g., false-positive retrievals or missed duplicates), which directly affects whether the observed gains generalize beyond the collected set.

Authors: We agree that quantitative characterization of the retrieval pipeline is necessary to assess potential biases in ReWIND. In the revised §4 we will report results from manual verification on a sampled subset of retrieved near-duplicates (precision estimate), discuss observed failure modes such as false-positive retrievals and missed duplicates, and analyze how these factors could influence the reported performance gains. This will help readers evaluate the generalizability of the real-world experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out datasets

full rationale

The paper proposes the QuAD framework for quality-aware aggregation of detector scores from retrieved near-duplicates and validates it through direct experiments on two independently constructed datasets (AncesTree with controlled degradation trees and ReWIND with real-world viral content). The reported average 8% balanced-accuracy improvement is an observed empirical quantity on held-out test sets rather than a derived prediction, fitted parameter, or quantity obtained by reducing any equation to its own inputs. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the method description or evaluation chain. The derivation is therefore self-contained as a practical engineering approach whose effectiveness is assessed externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on two unproven domain assumptions: that quality can be estimated from observable degradation cues and that the retrieved near-duplicates are sufficiently representative of the image's dissemination history. No free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Image quality after repeated online operations can be estimated reliably enough to weight detector scores
The method explicitly uses estimated quality as the weighting factor; if this estimation is noisy or biased the fusion gain disappears.
domain assumption Near-duplicates of a given image can be retrieved at scale from the open web
The framework presupposes successful retrieval; retrieval failures would leave the system with only the original query image.

pith-pipeline@v0.9.0 · 5601 in / 1414 out tokens · 46039 ms · 2026-05-10T10:58:44.422302+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Synthbuster: Towards detection of diffu- sion model generated images.IEEE Open Journal of Signal Processing, 2023

Quentin Bammey. Synthbuster: Towards detection of diffu- sion model generated images.IEEE Open Journal of Signal Processing, 2023. 4

2023
[2]

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Lorenzo Baraldi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, and Rita Cucchiara. Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities. InECCV, 2024. 5, 7

2024
[3]

Now Foundations and Trends, 2024

Clark Barrett, Brad Boyd, Elie Burzstein, Nicholas Car- lini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, and Soheil Feizi et al.Iden- tifying and Mitigating the Security Risks of Generative AI. Now Foundations and Trends, 2024. 1

2024
[4]

Generative AI and disin- formation: recent advances, challenges, and opportunities

Kalina Bontcheva, Symeon Papadopoulous, Filareti Tsalakanidou, Riccardo Gallotti, Lidia Dutkiewicz, No ´emie Krack, Denis Teyssou, Francesco Severio Nucci, Jochen Spangenberg, Ivan Srba, et al. Generative AI and disin- formation: recent advances, challenges, and opportunities
[5]

DRCT: Diffusion Reconstruction Contrastive Training to- wards Universal Detection of Diffusion Generated Images

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. DRCT: Diffusion Reconstruction Contrastive Training to- wards Universal Detection of Diffusion Generated Images. InICML, 2024. 7

2024
[6]

CO-SPY: Combining Seman- tic and Pixel Features to Detect Synthetic Images by AI

Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. CO-SPY: Combining Seman- tic and Pixel Features to Detect Synthetic Images by AI. In CVPR, pages 13455–13465, 2025. 7

2025
[7]

On the de- tection of synthetic images generated by diffusion models

Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Gio- vanni Poggi, Koki Nagano, and Luisa Verdoliva. On the de- tection of synthetic images generated by diffusion models. InICASSP, pages 1–5, 2023. 2, 5, 6

2023
[8]

RAISE: a raw images dataset for dig- ital image forensics

Duc-Tien Dang-Nguyen, Cecilia Pasquini, Valentina Conot- ter, and Giulia Boato. RAISE: a raw images dataset for dig- ital image forensics. InACM Multimedia Systems Confer- ence, page 219–224. Association for Computing Machinery,
[9]

TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social Networks

Stefano Dell’Anna, Andrea Montibeller, and Giulia Boato. TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social Networks. InInterna- tional Joint Conference on Neural Networks (IJCNN), pages 1–8, 2025. 3

2025
[10]

AMMeBa: A large-scale survey and dataset of media-based misinformation in-the-wild,

Nicholas Dufour, Arkanath Pathak, Pouya Samangouei, Nikki Hariri, Shashi Deshetti, Andrew Dudfield, Christo- pher Guess, Pablo Hern´andez Escayola, Bobby Tran, Mevan Babakar, et al. AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild.arXiv preprint arXiv:2405.11697, 2024. 3

work page arXiv 2024
[11]

Alireza Golestaneh, Saba Dadsetan, and Kris M

S. Alireza Golestaneh, Saba Dadsetan, and Kris M. Kitani. No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency. InWACV, pages 1220–1230, 2022. 2, 6, 7

2022
[12]

A Bias-Free Training Paradigm for More General AI-generated Image Detection

Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A Bias-Free Training Paradigm for More General AI-generated Image Detection. InCVPR, pages 18685–18694, 2025. 3, 4, 5, 7

2025
[13]

MIRAGE: Towards AI-Generated Image De- tection in the Wild.AAAI, 40(7):5076–5084, 2026

OuCheng Huang, Manxi Lin, Jiexiang Tan, Xiaoxiong Du, Yang Qiu, Junjun Zheng, Xiangheng Kong, Yuning Jiang, and Bo Zheng. MIRAGE: Towards AI-Generated Image De- tection in the Wild.AAAI, 40(7):5076–5084, 2026. 3

2026
[14]

SIDA: Social Media Image Deepfake Detec- tion, Localization and Explanation with Large Multimodal Model

Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guan- gliang Cheng. SIDA: Social Media Image Deepfake Detec- tion, Localization and Explanation with Large Multimodal Model. InCVPR, 2025. 3

2025
[15]

A New Deepfake Detection Method with No- Reference Image Quality Assessment to Resist Image Degra- dation.Eng, 6(10), 2025

Jiajun Jiang, Wen-Chao Yang, Chung-Hao Chen, and Tim- othy Young. A New Deepfake Detection Method with No- Reference Image Quality Assessment to Resist Image Degra- dation.Eng, 6(10), 2025. 3

2025
[16]

Evolution of Detection Performance through- out the Online Lifespan of Synthetic Images

Dimitrios Karageogiou, Quentin Bammey, Valentin Por- cellini, Bertrand Goupil, Denis Teyssou, and Symeon Pa- padopoulos1. Evolution of Detection Performance through- out the Online Lifespan of Synthetic Images. InEur. Conf. Comput. Vis. Worksh., 2024. 2, 3, 5, 8

2024
[17]

On the correlation between deepfake detection per- formance and image quality metrics

Hyunjoon Kim, Jaehee Lee, Leo Hyun Park, and Taekyoung Kwon. On the correlation between deepfake detection per- formance and image quality metrics. In3rd ACM Workshop on the Security Implications of Deepfakes and Cheapfakes (WDC), 2026. 3

2026
[18]

Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challeng- ing Scenarios

Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, and Yao Zhu. Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challeng- ing Scenarios. InICCV, 2025. 1, 3, 4

2025
[19]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, and Shu Hu. Detecting multimedia generated by large AI models: A survey.arXiv preprint arXiv:2204.06125, 2024. 1

work page internal anchor Pith review arXiv 2024
[20]

Aniano Porcile, Jack Gindi, Shivansh Mundra, James R

Gonzalo J. Aniano Porcile, Jack Gindi, Shivansh Mundra, James R. Verbus1, and Hany Farid. Finding AI-Generated Faces in the Wild. InCVPR Workshops, 2023. 1

2023
[21]

Evaluating Predictive Uncertainty Challenge

Joaquin Qui ˜nonero-Candela, Carl Edward Rasmussen, Fabian Sinz, Olivier Bousquet, and Bernhard Sch ¨olkopf. Evaluating Predictive Uncertainty Challenge. InMachine Learning Challenges. Evaluating Predictive Uncertainty, Vi- sual Object Classification, and Recognising Textual Entail- ment, pages 1–27. Springer Berlin Heidelberg, 2006. 6

2006
[22]

AI-Generated Faces in the Real World: A Large-Scale Case Study of Twitter Profile Images

Jonas Ricker, Dennis Assenmacher, Thorsten Holz, Asja Fis- cher, and Erwin Quiring. AI-Generated Faces in the Real World: A Large-Scale Case Study of Twitter Profile Images. InInternational Symposium on Research in Attacks, Intru- sions and Defenses, 2024. 1

2024
[23]

Blind Image Quality Assessment Based on Geometric Order Learning

Nyeong-Ho Shin, Seon-Ho Lee, and Chang-Su Kim. Blind Image Quality Assessment Based on Geometric Order Learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12799–12808, 2024. 2, 6, 7

2024
[24]

Not All Fakes are Equal: A Quality- Centric Framework for Deepfake Detection.arXiv preprint arXiv:2411.05335v3, 2024

Wentang Song, Zhiyuan Yan, Yuzhen Lin, Taiping Yao, Changsheng Chen1, Shen Chen, Yandan Zhao, Shouhong Ding, and Bin Li. Not All Fakes are Equal: A Quality- Centric Framework for Deepfake Detection.arXiv preprint arXiv:2411.05335v3, 2024. 3

work page arXiv 2024
[25]

Synthetic 9 Image Verification in the Era of Generative AI: What Works and What Isn’t There Yet.IEEE Security & Privacy, 22: 37–49, 2024

Diangarti Tariang, Riccardo Corvi, Davide Cozzolino, Gio- vanni Poggi, Koki Nagano, and Luisa Verdoliva. Synthetic 9 Image Verification in the Era of Generative AI: What Works and What Isn’t There Yet.IEEE Security & Privacy, 22: 37–49, 2024. 2

2024
[26]

CNN-generated images are sur- prisingly easy to spot

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. CNN-generated images are sur- prisingly easy to spot... for now. InCVPR, pages 8695–8704,
[27]

Robust image forgery detection over online social network shared images

Haiwei Wu, Jiantao Zhou, Jinyu Tian, and Jun Liu. Robust image forgery detection over online social network shared images. InCVPR, 2022. 2

2022
[28]

Boosting Image Quality Assessment through Efficient Transformer Adaptation with Local Feature Enhancement

Kangmin Xu, Liang Liao, Jing Xiao, Chaofeng Chen, Haon- ing Wu, Qiong Yan, and Weisi Lin. Boosting Image Quality Assessment through Efficient Transformer Adaptation with Local Feature Enhancement. InCVPR, pages 2662–2672,
[29]

Dˆ3: Scaling Up Deepfake Detection by Learning from Discrepancy

Yongqi Yang, Zhihao Qian, Ye Zhu, Olga Russakovsky, and Yu Wu. Dˆ3: Scaling Up Deepfake Detection by Learning from Discrepancy. InCVPR, pages 23850–23859, 2025. 5, 7

2025
[30]

Multimodal Image Synthesis and Editing: The Generative AI Era.IEEE TPAMI, 45(12): 15098–15119, 2021

Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu, Lingjie Liu, Adam Kortylewski, Christian Theobalt, and Eric Xing. Multimodal Image Synthesis and Editing: The Generative AI Era.IEEE TPAMI, 45(12): 15098–15119, 2021. 1 10

2021