arxiv: 2511.16136 · v2 · submitted 2025-11-20 · 💻 cs.CV

How Noise Benefits AI-generated Image Detection

Ziqiang Li , Jiazhen Yan , Fan Wang , Kai Zeng , Zhangjie Fu This is my paper

Pith reviewed 2026-05-17 20:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated image detectionpositive-incentive noiseCLIP feature spaceshortcut suppressionout-of-distribution generalizationforensic cuescross-attention fusionvariational training

0 comments

The pith

Constructing positive-incentive noise in feature space helps CLIP suppress shortcuts and detect AI-generated images more reliably across many generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix the fact that detectors for AI-made images often fail when faced with pictures from generators they did not see during training. It links this failure to the networks latching onto easy but unreliable patterns instead of lasting visual traces left by generation. The proposed solution builds a special noise signal in the feature space using cross-attention between image and category information, then adds that noise while retraining the visual encoder so that shortcut directions weaken and stable forensic signals strengthen. If the approach works, detectors could handle a much wider range of current and future generators without frequent retraining or model-specific fixes. Readers would care because the method turns a known training weakness into a controllable way to improve robustness.

Core claim

We introduce Positive-Incentive Noise for CLIP (PiN-CLIP) that jointly optimizes a noise generator and a detection network under a variational positive-incentive principle. Noise is formed in feature space by cross-attention fusion of visual and categorical semantic features. When this noise is injected during fine-tuning of the visual encoder, shortcut-sensitive directions are suppressed while stable forensic cues are amplified, producing more robust and generalized artifact representations.

What carries the argument

Positive-incentive noise built via cross-attention fusion of visual and categorical semantic features and injected into the feature space to fine-tune the visual encoder.

If this is right

The method reaches new state-of-the-art accuracy on an open-world dataset of images from 42 distinct generative models.
It delivers an average accuracy gain of 5.4 points compared with previous approaches.
The detector extracts more robust forensic features that generalize beyond the training distribution.
Reliance on spurious shortcuts learned during training is reduced without generator-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same noise-injection idea might help other vision models that suffer from shortcut learning in tasks like object recognition or medical imaging.
Extending the approach to newer generators released after the 42-model dataset could test whether the gained robustness scales with rapid model progress.
Replacing the CLIP backbone with other vision-language encoders would show whether the benefit depends on the particular architecture or on the noise principle itself.

Load-bearing premise

The specific noise will suppress only the unwanted shortcut directions without creating new failure modes or needing adjustments for each generator.

What would settle it

A drop or no gain in accuracy when the trained model is tested on synthetic images produced by generative models held out from the noise-construction and fine-tuning stages.

Figures

Figures reproduced from arXiv: 2511.16136 by Fan Wang, Jiazhen Yan, Kai Zeng, Zhangjie Fu, Ziqiang Li.

**Figure 1.** Figure 1: The Cross-entropy Loss during Training. When training the CLIP-LoRA network, the loss rapidly drops to 0.1 within only about 100 iterations, indicating early overfitting. Introducing tiny feature-space perturbations (CLIP-Random) mitigates this effect to some extent, maintaining a higher loss of around 0.18 over the same iterations. In contrast, our proposed PiN-CLIP slows the early descent and stabilize… view at source ↗

**Figure 2.** Figure 2: Architecture of PiN-CLIP for Generalizable AI-Generated Image Detection. It includes a conditional noise generator and a detection network, enabling task-adaptive, forgery-conditional noise injection guided by the optimization objective. This mechanism suppresses shortcut-sensitive directions while amplifying stable, task-relevant forensic evidence under beneficial stochastic transformations. Based on this… view at source ↗

**Figure 3.** Figure 3: T-SNE Visualization of Features Extracted Using CLIP, CLIP-random and Ours. Our method achieves strong real/fake discrimination, while adding random noise to the visual features also improves discriminative ability. able performance degradation, further validating that our designed perturbation suppresses shortcut-sensitive directions while amplifying stable forensic cues under beneficial stochastic trans… view at source ↗

read the original abstract

The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PiN-CLIP uses cross-attention fused noise under a variational principle to reduce shortcuts in CLIP-based AI image detectors, with reported gains on a 42-model dataset but thin evidence that the noise works the way claimed.

read the letter

The main point is that this paper tries to fix poor generalization in AI-generated image detection by injecting a specially constructed noise into CLIP features to suppress training shortcuts. They train a noise generator jointly with the detector under what they call a variational positive-incentive principle. The noise is made by cross-attention between visual and categorical features, then added during fine-tuning to push the encoder away from spurious directions and toward more stable forensic signals. On their open-world test set with images from 42 generators they report a 5.4 point accuracy improvement over prior methods, which is the headline result.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes PiN-CLIP, which jointly optimizes a noise generator and detection network under a variational positive-incentive principle. Positive-incentive noise is constructed in CLIP feature space via cross-attention fusion of visual and categorical semantic features; this noise is injected to suppress shortcut-sensitive directions while amplifying stable forensic cues. Comparative experiments on an open-world dataset of synthetic images from 42 generative models report new state-of-the-art performance with a 5.4 average accuracy gain over prior methods.

Significance. If the reported gains are reproducible and the mechanism is isolated, the work would advance out-of-distribution generalization in AI-generated image detection, an area of high practical importance for forensic and misinformation applications. The variational noise-injection framework offers a controllable alternative to ad-hoc augmentation and could influence shortcut-mitigation techniques in other vision tasks.

major comments (2)

[Abstract] Abstract: the central SOTA claim rests on a 5.4 average accuracy improvement over existing approaches, yet the abstract (and by extension the experimental section) provides no error bars, exact baseline implementations, data exclusion rules, or ablation details on the noise generator and cross-attention weights; this leaves the performance claim unverifiable from the reported summary.
[Method] Method / Experiments: the positive-incentive noise is asserted to specifically suppress shortcut-sensitive directions in the CLIP encoder without introducing new failure modes or requiring per-generator tuning, but no directional analyses, saliency maps, or controlled ablations against non-semantic noise are described to rule out gains from generic perturbation or joint optimization alone.

minor comments (1)

[Abstract] Abstract: the phrase 'variational positive-incentive principle' is used without a concise mathematical statement or pointer to its definition in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of verifiability and mechanistic validation that we address below. We have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central SOTA claim rests on a 5.4 average accuracy improvement over existing approaches, yet the abstract (and by extension the experimental section) provides no error bars, exact baseline implementations, data exclusion rules, or ablation details on the noise generator and cross-attention weights; this leaves the performance claim unverifiable from the reported summary.

Authors: We agree that greater transparency in the abstract would improve verifiability. In the revised version we will expand the abstract to note that all reported accuracies are means over five independent runs with standard deviations provided in the experimental tables, that baselines follow the original authors' public implementations with our re-implementations using the same training protocol, and that the open-world dataset excludes only images with obvious artifacts as described in Section 4.1. We will also add a concise reference to the ablation results on the noise generator and cross-attention fusion weights. These additions preserve the 5.4-point gain while making the claim directly verifiable from the summary. revision: yes
Referee: [Method] Method / Experiments: the positive-incentive noise is asserted to specifically suppress shortcut-sensitive directions in the CLIP encoder without introducing new failure modes or requiring per-generator tuning, but no directional analyses, saliency maps, or controlled ablations against non-semantic noise are described to rule out gains from generic perturbation or joint optimization alone.

Authors: We recognize that explicit isolation of the mechanism would strengthen the contribution. While the current experiments demonstrate consistent gains across 42 generators without per-model tuning, we will incorporate the requested analyses in the revision: directional comparisons of the top principal components and gradient directions in CLIP feature space before versus after noise injection; saliency visualizations that highlight reduced activation on known shortcut regions; and a controlled ablation replacing semantic positive-incentive noise with isotropic Gaussian noise of matched magnitude under otherwise identical joint optimization. These additions will directly address whether the observed improvements arise from targeted suppression rather than generic regularization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SOTA claim rests on external open-world test set

full rationale

The paper defines PiN-CLIP as a joint training procedure under a variational positive-incentive principle that constructs and injects cross-attention-fused noise to fine-tune the CLIP encoder. The load-bearing result is an accuracy improvement of 5.4 on a held-out open-world dataset spanning 42 distinct generators. This performance metric is measured on external data and does not reduce to any fitted parameter or self-defined quantity by construction. No equations equate the claimed suppression of shortcut directions to the training objective itself, and no self-citation chain is invoked to justify uniqueness or the core ansatz. The derivation therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that variational optimization can produce beneficial noise and on the empirical observation that feature perturbations mitigate shortcuts; several training parameters are fitted jointly.

free parameters (2)

noise generator parameters
Learned jointly with the detection network under the variational objective; specific values not reported in abstract.
cross-attention fusion weights
Control how visual and categorical features combine to form the positive-incentive noise.

axioms (1)

domain assumption Variational positive-incentive principle can be applied to generate helpful rather than adversarial noise in feature space
Invoked to justify joint training of noise generator and detector.

invented entities (1)

Positive-incentive noise no independent evidence
purpose: Suppress shortcut-sensitive directions while amplifying stable forensic cues in the visual encoder
Constructed via cross-attention fusion and injected during fine-tuning; no independent falsifiable handle provided outside the training procedure.

pith-pipeline@v0.9.0 · 5488 in / 1375 out tokens · 53185 ms · 2026-05-17T20:50:08.695484+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle... suppressing shortcut-sensitive directions while amplifying stable forensic cues

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 12 internal anchors

[1]

Better fine- tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156, 2020

Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Na- man Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine- tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156, 2020. 3

work page arXiv 2008
[2]

Contrasting deepfakes diffusion via contrastive learning and global-local similarities

Lorenzo Baraldi, Federico Cocchi, Marcella Cornia, Alessan- dro Nicolosi, and Rita Cucchiara. Contrasting deepfakes diffusion via contrastive learning and global-local similarities. InEuropean Conference on Computer Vision, pages 199–216. Springer, 2025. 1

work page 2025
[3]

The mechanism of stochastic resonance.Journal of Physics A: mathematical and general, 14(11):L453, 1981

Roberto Benzi, Alfonso Sutera, and Angelo Vulpiani. The mechanism of stochastic resonance.Journal of Physics A: mathematical and general, 14(11):L453, 1981. 3

work page 1981
[4]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock. Large scale gan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Cavia, E

Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. Real-time deepfake detection in the real-world.arXiv preprint arXiv:2406.09398, 2024. 6, 7

work page arXiv 2024
[6]

Fakeinversion: Learning to detect images from un- seen text-to-image models by inverting stable diffusion

George Cazenavette, Avneesh Sud, Thomas Leung, and Ben Usman. Fakeinversion: Learning to detect images from un- seen text-to-image models by inverting stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10759–10769, 2024. 2

work page 2024
[7]

Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty- first International Conference on Machine Learning, 2024. 2, 3

work page 2024
[8]

Simswap: An efficient framework for high fidelity face swapping

Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. InProceedings of the 28th ACM international conference on multimedia, pages 2003–2011, 2020. 6

work page 2003
[9]

Dual data alignment makes ai- generated image detector easier generalizable.arXiv preprint arXiv:2505.14359, 2025

Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, et al. Dual data alignment makes ai- generated image detector easier generalizable.arXiv preprint arXiv:2505.14359, 2025. 2

work page arXiv 2025
[10]

Manipulated face detector: Joint spatial and frequency domain attention network.arXiv preprint arXiv:2005.02958, 1(2):4, 2020

Zehao Chen and Hua Yang. Manipulated face detector: Joint spatial and frequency domain attention network.arXiv preprint arXiv:2005.02958, 1(2):4, 2020. 1, 2

work page arXiv 2005
[11]

Stargan: Unified generative adversarial networks for multi-domain image-to-image trans- lation

Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image trans- lation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018. 6

work page 2018
[12]

Fire: Robust detection of diffusion- generated images via frequency-guided reconstruction error

Beilin Chu, Xuan Xu, Xin Wang, Yufei Zhang, Weike You, and Linna Zhou. Fire: Robust detection of diffusion- generated images via frequency-guided reconstruction error. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 12830–12839, 2025. 2

work page 2025
[13]

Diffusion models in vision: A survey

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE transactions on pattern analysis and machine intelli- gence, 45(9):10850–10869, 2023. 1

work page 2023
[14]

Google DeepMind. Imagen3. https://deepmind. google/technologies/imagen-3. 2024. 6

work page 2024
[15]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021. 6

work page 2021
[16]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 6

work page 2024
[17]

Vector quan- tized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quan- tized diffusion model for text-to-image synthesis. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10696–10706, 2022. 6

work page 2022
[18]

A bias-free training paradigm for more general ai-generated image detection

Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18685–18694, 2025. 1, 3

work page 2025
[19]

The gan is dead; long live the gan! a modern gan baseline.Advances in Neural Information Processing Systems, 37:44177–44215, 2024

Nick Huang, Aaron Gokaslan, V olodymyr Kuleshov, and James Tompkin. The gan is dead; long live the gan! a modern gan baseline.Advances in Neural Information Processing Systems, 37:44177–44215, 2024. 6

work page 2024
[20]

Enhance vision-language alignment with noise

Sida Huang, Hongyuan Zhang, and Xuelong Li. Enhance vision-language alignment with noise. InProceedings of the AAAI Conference on Artificial Intelligence, pages 17449– 17457, 2025. 2, 3

work page 2025
[21]

Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. InProceedings of the 58th annual meeting of the Association for Computational Linguistics, pages 2177–2190, 2020. 3

work page 2020
[22]

Progressive growing of GANs for improved quality, stabil- ity, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stabil- ity, and variation. InInternational Conference on Learning Representations, 2018. 6 9

work page 2018
[23]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 1, 6

work page 2019
[24]

Analyzing and improv- ing the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8110–8119, 2020. 6

work page 2020
[25]

Alias-free generative adversarial networks.Advances in neural informa- tion processing systems, 34:852–863, 2021

Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks.Advances in neural informa- tion processing systems, 34:852–863, 2021. 6

work page 2021
[26]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 6

work page internal anchor Pith review Pith/arXiv arXiv 2014
[27]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 4

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection

Christos Koutlis and Symeon Papadopoulos. Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on Computer Vi- sion, pages 394–411. Springer, 2024. 1, 2

work page 2024
[29]

Flux.1-dev

Black Forest Labs. Flux.1-dev. https://huggingface. co/black-forest-labs/FLUX.1-dev. 2024. 6

work page 2024
[30]

Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR,

work page
[31]

Improving synthetic image detection towards generalization: An image transformation perspective

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspective. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 2405– 2414, 2025. 2, 3, 6, 7

work page 2025
[32]

Positive-incentive noise.IEEE Transactions on Neural Networks and Learning Systems, 35(6):8708–8714,

Xuelong Li. Positive-incentive noise.IEEE Transactions on Neural Networks and Learning Systems, 35(6):8708–8714,

work page
[33]

Fakeclr: Exploring contrastive learning for solving latent discontinuity in data-efficient gans

Ziqiang Li, Chaoyue Wang, Heliang Zheng, Jing Zhang, and Bin Li. Fakeclr: Exploring contrastive learning for solving latent discontinuity in data-efficient gans. InEuropean Con- ference on Computer Vision, pages 598–615. Springer, 2022. 1

work page 2022
[34]

A systematic survey of regularization and normalization in gans.ACM Computing Surveys, 55(11):1–37, 2023

Ziqiang Li, Muhammad Usman, Rentuo Tao, Pengfei Xia, Chaoyue Wang, Huanhuan Chen, and Bin Li. A systematic survey of regularization and normalization in gans.ACM Computing Surveys, 55(11):1–37, 2023. 1

work page 2023
[35]

Photomaker: Customizing realistic human photos via stacked id embedding

Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8640–8650, 2024. 6

work page 2024
[36]

Is artificial intelligence gen- erated image detection a solved problem?arXiv preprint arXiv:2505.12335, 2025

Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, and Zhangjie Fu. Is artificial intelligence gen- erated image detection a solved problem?arXiv preprint arXiv:2505.12335, 2025. 6, 7, 8

work page arXiv 2025
[37]

Transfer learning of real image features with soft contrastive loss for fake image detection

Ziyou Liang, Weifeng Liu, Run Wang, Mengjie Wu, Boheng Li, Yuyang Zhang, Lina Wang, and Xinyi Yang. Transfer learning of real image features with soft contrastive loss for fake image detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26281–26289, 2025. 1, 2

work page 2025
[38]

Forgery-aware adaptive trans- former for generalizable synthetic image detection

Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive trans- former for generalizable synthetic image detection. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10770–10780, 2024. 2

work page 2024
[39]

Fine-grained face swapping via regional gan inversion

Zhian Liu, Maomao Li, Yong Zhang, Cairong Wang, Qi Zhang, Jue Wang, and Yongwei Nie. Fine-grained face swapping via regional gan inversion. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8578–8587, 2023. 6

work page 2023
[40]

General- izing face forgery detection with high-frequency features

Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. General- izing face forgery detection with high-frequency features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021. 1, 2

work page 2021
[41]

Faceswap

Marek. Faceswap. https : / / github . com / MarekKowalski/FaceSwap. 2020. 6

work page 2020
[42]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Towards uni- versal fake image detectors that generalize across generative models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 24480–24489,

work page
[44]

Semantic image synthesis with spatially-adaptive nor- malization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346,

work page
[45]

Pytorch: An imperative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 6

work page 2019
[46]

Multi-layer random perturbation training for improving model generalization efficiently

Lis Kanashiro Pereira, Yuki Taya, and Ichiro Kobayashi. Multi-layer random perturbation training for improving model generalization efficiently. InProceedings of the Fourth Black- boxNLP Workshop on Analyzing and Interpreting Neural Net- works for NLP, pages 303–310, 2021. 3

work page 2021
[47]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Thinking in frequency: Face forgery detection by mining frequency-aware clues

Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020. 1, 2

work page 2020
[49]

Learning transferable visual models from natural language supervi- 10 sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- 10 sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[50]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents.arXiv preprint arXiv:2204.06125, 1(2): 3, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 6

work page 2022
[52]

Faceforen- sics++: Learning to detect manipulated facial images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 6

work page 2019
[53]

Stylegan- xl: Scaling stylegan to large diverse datasets

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InACM SIG- GRAPH 2022 conference proceedings, pages 1–10, 2022. 6

work page 2022
[54]

Crackling noise.nature, 410(6825):242–250, 2001

James P Sethna, Karin A Dahmen, and Christopher R Myers. Crackling noise.nature, 410(6825):242–250, 2001. 3

work page 2001
[55]

Blendface: Re-designing identity encoders for face-swapping

Kaede Shiohara, Xingchao Yang, and Takafumi Taketomi. Blendface: Re-designing identity encoders for face-swapping. InProceedings of the IEEE/CVF international conference on computer vision, pages 7634–7644, 2023. 6

work page 2023
[56]

Learning on gradients: Generalized arti- facts representation for gan-generated images detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. Learning on gradients: Generalized arti- facts representation for gan-generated images detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12105–12114, 2023. 2

work page 2023
[57]

Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 1, 2, 6

work page 2024
[58]

Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 28130–28139, 2024. 1, 2, 6, 7, 8

work page 2024
[59]

C2p-clip: In- jecting category common prompt in clip to enhance gener- alization in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: In- jecting category common prompt in clip to enhance gener- alization in deepfake detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7184–7192, 2025. 1, 2

work page 2025
[60]

Renshuai Tao, Chuangchuang Tan, Huan Liu, Jiakai Wang, Haotong Qin, Yakun Chang, Wei Wang, Rongrong Ni, and Yao Zhao. Sagnet: Decoupling semantic-agnostic artifacts from limited training data for robust generalization in deep- fake detection.IEEE Transactions on Information Forensics and Security, 2025. 1

work page 2025
[61]

Midjourney v6.1

Midjourney Team. Midjourney v6.1. https://www. midjourney.com/home, . 2024. 6

work page 2024
[62]

Dall-e 3 ai image generator

OpenAI Team. Dall-e 3 ai image generator. https:// dalle3.ai/, . 2024. 6

work page 2024
[63]

Haofan Wang, Ashley Kleynhans, and Abe Estrada. Inswap. https://github.com/haofanwang/inswapper

work page
[64]

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 6

work page internal anchor Pith review arXiv 2024
[65]

Cnn-generated images are surprisingly easy to spot

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8695–8704, 2020. 2, 3, 6, 7, 8

work page 2020
[66]

Dire for diffusion- generated image detection

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion- generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22445– 22455, 2023. 2

work page 2023
[67]

Which face is real? https: //www.whichfaceisreal.com/

Jevin West and Carl Bergstrom. Which face is real? https: //www.whichfaceisreal.com/. 2019. 6

work page 2019
[68]

Infinite-id: Identity-preserved personalization via id-semantics decoupling paradigm

Yi Wu, Ziqiang Li, Heliang Zheng, Chaoyue Wang, and Bin Li. Infinite-id: Identity-preserved personalization via id-semantics decoupling paradigm. InEuropean Conference on Computer Vision, pages 279–296. Springer, 2024. 1, 6

work page 2024
[69]

https://xihe.mindspore.cn/modelzoo/wukong

Wukong. https://xihe.mindspore.cn/modelzoo/wukong. 2022. 6

work page 2022
[70]

Data Noising as Smoothing in Neural Network Language Models

Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y Ng. Data noising as smoothing in neural network language models.arXiv preprint arXiv:1703.02573, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[71]

Gener- alizable deepfake detection via effective local-global feature extraction.arXiv preprint arXiv:2501.15253, 2025

Jiazhen Yan, Ziqiang Li, Ziwen He, and Zhangjie Fu. Gener- alizable deepfake detection via effective local-global feature extraction.arXiv preprint arXiv:2501.15253, 2025. 1, 2, 6, 7

work page arXiv 2025
[72]

NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

Jiazhen Yan, Fan Wang, Weiwei Jiang, Ziqiang Li, and Zhangjie Fu. Ns-net: Decoupling clip semantic informa- tion through null-space for generalizable ai-generated image detection.arXiv preprint arXiv:2508.01248, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024. 6, 7, 8

work page arXiv 2024
[74]

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decompo- sition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024. 2, 6, 7, 8

work page internal anchor Pith review arXiv 2024
[75]

Dˆ 3: Scaling up deepfake detection by learning from discrepancy

Yongqi Yang, Zhihao Qian, Ye Zhu, Olga Russakovsky, and Yu Wu. Dˆ 3: Scaling up deepfake detection by learning from discrepancy. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23850–23859, 2025. 1, 3

work page 2025
[76]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv
[77]

Low-rank few-shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024. 6 11

work page 2024
[78]

Styleswin: Transformer-based gan for high-resolution image generation

Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin: Transformer-based gan for high-resolution image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11304–11314, 2022. 6

work page 2022
[79]

Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise

Hongyuan Zhang, Yanchen Xu, Sida Huang, and Xuelong Li. Data augmentation of contrastive learning is estimating positive-incentive noise.arXiv preprint arXiv:2408.09929,

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Towards universal ai-generated image de- tection by variational information bottleneck network

Haifeng Zhang, Qinghui He, Xiuli Bi, Weisheng Li, Bo Liu, and Bin Xiao. Towards universal ai-generated image de- tection by variational information bottleneck network. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23828–23837, 2025. 1, 2, 6, 7, 8

work page 2025

Showing first 80 references.