pith. machine review for the scientific record. sign in

arxiv: 2511.16136 · v2 · submitted 2025-11-20 · 💻 cs.CV

How Noise Benefits AI-generated Image Detection

Pith reviewed 2026-05-17 20:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated image detectionpositive-incentive noiseCLIP feature spaceshortcut suppressionout-of-distribution generalizationforensic cuescross-attention fusionvariational training
0
0 comments X

The pith

Constructing positive-incentive noise in feature space helps CLIP suppress shortcuts and detect AI-generated images more reliably across many generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix the fact that detectors for AI-made images often fail when faced with pictures from generators they did not see during training. It links this failure to the networks latching onto easy but unreliable patterns instead of lasting visual traces left by generation. The proposed solution builds a special noise signal in the feature space using cross-attention between image and category information, then adds that noise while retraining the visual encoder so that shortcut directions weaken and stable forensic signals strengthen. If the approach works, detectors could handle a much wider range of current and future generators without frequent retraining or model-specific fixes. Readers would care because the method turns a known training weakness into a controllable way to improve robustness.

Core claim

We introduce Positive-Incentive Noise for CLIP (PiN-CLIP) that jointly optimizes a noise generator and a detection network under a variational positive-incentive principle. Noise is formed in feature space by cross-attention fusion of visual and categorical semantic features. When this noise is injected during fine-tuning of the visual encoder, shortcut-sensitive directions are suppressed while stable forensic cues are amplified, producing more robust and generalized artifact representations.

What carries the argument

Positive-incentive noise built via cross-attention fusion of visual and categorical semantic features and injected into the feature space to fine-tune the visual encoder.

If this is right

  • The method reaches new state-of-the-art accuracy on an open-world dataset of images from 42 distinct generative models.
  • It delivers an average accuracy gain of 5.4 points compared with previous approaches.
  • The detector extracts more robust forensic features that generalize beyond the training distribution.
  • Reliance on spurious shortcuts learned during training is reduced without generator-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same noise-injection idea might help other vision models that suffer from shortcut learning in tasks like object recognition or medical imaging.
  • Extending the approach to newer generators released after the 42-model dataset could test whether the gained robustness scales with rapid model progress.
  • Replacing the CLIP backbone with other vision-language encoders would show whether the benefit depends on the particular architecture or on the noise principle itself.

Load-bearing premise

The specific noise will suppress only the unwanted shortcut directions without creating new failure modes or needing adjustments for each generator.

What would settle it

A drop or no gain in accuracy when the trained model is tested on synthetic images produced by generative models held out from the noise-construction and fine-tuning stages.

Figures

Figures reproduced from arXiv: 2511.16136 by Fan Wang, Jiazhen Yan, Kai Zeng, Zhangjie Fu, Ziqiang Li.

Figure 1
Figure 1. Figure 1: The Cross-entropy Loss during Training. When train￾ing the CLIP-LoRA network, the loss rapidly drops to 0.1 within only about 100 iterations, indicating early overfitting. Introducing tiny feature-space perturbations (CLIP-Random) mitigates this ef￾fect to some extent, maintaining a higher loss of around 0.18 over the same iterations. In contrast, our proposed PiN-CLIP slows the early descent and stabilize… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of PiN-CLIP for Generalizable AI-Generated Image Detection. It includes a conditional noise generator and a detection network, enabling task-adaptive, forgery-conditional noise injection guided by the optimization objective. This mechanism suppresses shortcut-sensitive directions while amplifying stable, task-relevant forensic evidence under beneficial stochastic transformations. Based on this… view at source ↗
Figure 3
Figure 3. Figure 3: T-SNE Visualization of Features Extracted Using CLIP, CLIP-random and Ours. Our method achieves strong real/fake discrimination, while adding random noise to the visual features also improves discriminative ability. able performance degradation, further validating that our designed perturbation suppresses shortcut-sensitive direc￾tions while amplifying stable forensic cues under beneficial stochastic trans… view at source ↗
read the original abstract

The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes PiN-CLIP, which jointly optimizes a noise generator and detection network under a variational positive-incentive principle. Positive-incentive noise is constructed in CLIP feature space via cross-attention fusion of visual and categorical semantic features; this noise is injected to suppress shortcut-sensitive directions while amplifying stable forensic cues. Comparative experiments on an open-world dataset of synthetic images from 42 generative models report new state-of-the-art performance with a 5.4 average accuracy gain over prior methods.

Significance. If the reported gains are reproducible and the mechanism is isolated, the work would advance out-of-distribution generalization in AI-generated image detection, an area of high practical importance for forensic and misinformation applications. The variational noise-injection framework offers a controllable alternative to ad-hoc augmentation and could influence shortcut-mitigation techniques in other vision tasks.

major comments (2)
  1. [Abstract] Abstract: the central SOTA claim rests on a 5.4 average accuracy improvement over existing approaches, yet the abstract (and by extension the experimental section) provides no error bars, exact baseline implementations, data exclusion rules, or ablation details on the noise generator and cross-attention weights; this leaves the performance claim unverifiable from the reported summary.
  2. [Method] Method / Experiments: the positive-incentive noise is asserted to specifically suppress shortcut-sensitive directions in the CLIP encoder without introducing new failure modes or requiring per-generator tuning, but no directional analyses, saliency maps, or controlled ablations against non-semantic noise are described to rule out gains from generic perturbation or joint optimization alone.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'variational positive-incentive principle' is used without a concise mathematical statement or pointer to its definition in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of verifiability and mechanistic validation that we address below. We have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central SOTA claim rests on a 5.4 average accuracy improvement over existing approaches, yet the abstract (and by extension the experimental section) provides no error bars, exact baseline implementations, data exclusion rules, or ablation details on the noise generator and cross-attention weights; this leaves the performance claim unverifiable from the reported summary.

    Authors: We agree that greater transparency in the abstract would improve verifiability. In the revised version we will expand the abstract to note that all reported accuracies are means over five independent runs with standard deviations provided in the experimental tables, that baselines follow the original authors' public implementations with our re-implementations using the same training protocol, and that the open-world dataset excludes only images with obvious artifacts as described in Section 4.1. We will also add a concise reference to the ablation results on the noise generator and cross-attention fusion weights. These additions preserve the 5.4-point gain while making the claim directly verifiable from the summary. revision: yes

  2. Referee: [Method] Method / Experiments: the positive-incentive noise is asserted to specifically suppress shortcut-sensitive directions in the CLIP encoder without introducing new failure modes or requiring per-generator tuning, but no directional analyses, saliency maps, or controlled ablations against non-semantic noise are described to rule out gains from generic perturbation or joint optimization alone.

    Authors: We recognize that explicit isolation of the mechanism would strengthen the contribution. While the current experiments demonstrate consistent gains across 42 generators without per-model tuning, we will incorporate the requested analyses in the revision: directional comparisons of the top principal components and gradient directions in CLIP feature space before versus after noise injection; saliency visualizations that highlight reduced activation on known shortcut regions; and a controlled ablation replacing semantic positive-incentive noise with isotropic Gaussian noise of matched magnitude under otherwise identical joint optimization. These additions will directly address whether the observed improvements arise from targeted suppression rather than generic regularization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SOTA claim rests on external open-world test set

full rationale

The paper defines PiN-CLIP as a joint training procedure under a variational positive-incentive principle that constructs and injects cross-attention-fused noise to fine-tune the CLIP encoder. The load-bearing result is an accuracy improvement of 5.4 on a held-out open-world dataset spanning 42 distinct generators. This performance metric is measured on external data and does not reduce to any fitted parameter or self-defined quantity by construction. No equations equate the claimed suppression of shortcut directions to the training objective itself, and no self-citation chain is invoked to justify uniqueness or the core ansatz. The derivation therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that variational optimization can produce beneficial noise and on the empirical observation that feature perturbations mitigate shortcuts; several training parameters are fitted jointly.

free parameters (2)
  • noise generator parameters
    Learned jointly with the detection network under the variational objective; specific values not reported in abstract.
  • cross-attention fusion weights
    Control how visual and categorical features combine to form the positive-incentive noise.
axioms (1)
  • domain assumption Variational positive-incentive principle can be applied to generate helpful rather than adversarial noise in feature space
    Invoked to justify joint training of noise generator and detector.
invented entities (1)
  • Positive-incentive noise no independent evidence
    purpose: Suppress shortcut-sensitive directions while amplifying stable forensic cues in the visual encoder
    Constructed via cross-attention fusion and injected during fine-tuning; no independent falsifiable handle provided outside the training procedure.

pith-pipeline@v0.9.0 · 5488 in / 1375 out tokens · 53185 ms · 2026-05-17T20:50:08.695484+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle... suppressing shortcut-sensitive directions while amplifying stable forensic cues

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 12 internal anchors

  1. [1]

    Better fine- tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156, 2020

    Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Na- man Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine- tuning by reducing representational collapse.arXiv preprint arXiv:2008.03156, 2020. 3

  2. [2]

    Contrasting deepfakes diffusion via contrastive learning and global-local similarities

    Lorenzo Baraldi, Federico Cocchi, Marcella Cornia, Alessan- dro Nicolosi, and Rita Cucchiara. Contrasting deepfakes diffusion via contrastive learning and global-local similarities. InEuropean Conference on Computer Vision, pages 199–216. Springer, 2025. 1

  3. [3]

    The mechanism of stochastic resonance.Journal of Physics A: mathematical and general, 14(11):L453, 1981

    Roberto Benzi, Alfonso Sutera, and Angelo Vulpiani. The mechanism of stochastic resonance.Journal of Physics A: mathematical and general, 14(11):L453, 1981. 3

  4. [4]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock. Large scale gan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096,

  5. [5]

    Cavia, E

    Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. Real-time deepfake detection in the real-world.arXiv preprint arXiv:2406.09398, 2024. 6, 7

  6. [6]

    Fakeinversion: Learning to detect images from un- seen text-to-image models by inverting stable diffusion

    George Cazenavette, Avneesh Sud, Thomas Leung, and Ben Usman. Fakeinversion: Learning to detect images from un- seen text-to-image models by inverting stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10759–10769, 2024. 2

  7. [7]

    Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

    Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty- first International Conference on Machine Learning, 2024. 2, 3

  8. [8]

    Simswap: An efficient framework for high fidelity face swapping

    Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. InProceedings of the 28th ACM international conference on multimedia, pages 2003–2011, 2020. 6

  9. [9]

    Dual data alignment makes ai- generated image detector easier generalizable.arXiv preprint arXiv:2505.14359, 2025

    Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, et al. Dual data alignment makes ai- generated image detector easier generalizable.arXiv preprint arXiv:2505.14359, 2025. 2

  10. [10]

    Manipulated face detector: Joint spatial and frequency domain attention network.arXiv preprint arXiv:2005.02958, 1(2):4, 2020

    Zehao Chen and Hua Yang. Manipulated face detector: Joint spatial and frequency domain attention network.arXiv preprint arXiv:2005.02958, 1(2):4, 2020. 1, 2

  11. [11]

    Stargan: Unified generative adversarial networks for multi-domain image-to-image trans- lation

    Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image trans- lation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018. 6

  12. [12]

    Fire: Robust detection of diffusion- generated images via frequency-guided reconstruction error

    Beilin Chu, Xuan Xu, Xin Wang, Yufei Zhang, Weike You, and Linna Zhou. Fire: Robust detection of diffusion- generated images via frequency-guided reconstruction error. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 12830–12839, 2025. 2

  13. [13]

    Diffusion models in vision: A survey

    Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE transactions on pattern analysis and machine intelli- gence, 45(9):10850–10869, 2023. 1

  14. [14]

    Google DeepMind. Imagen3. https://deepmind. google/technologies/imagen-3. 2024. 6

  15. [15]

    Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021. 6

  16. [16]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 6

  17. [17]

    Vector quan- tized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quan- tized diffusion model for text-to-image synthesis. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10696–10706, 2022. 6

  18. [18]

    A bias-free training paradigm for more general ai-generated image detection

    Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18685–18694, 2025. 1, 3

  19. [19]

    The gan is dead; long live the gan! a modern gan baseline.Advances in Neural Information Processing Systems, 37:44177–44215, 2024

    Nick Huang, Aaron Gokaslan, V olodymyr Kuleshov, and James Tompkin. The gan is dead; long live the gan! a modern gan baseline.Advances in Neural Information Processing Systems, 37:44177–44215, 2024. 6

  20. [20]

    Enhance vision-language alignment with noise

    Sida Huang, Hongyuan Zhang, and Xuelong Li. Enhance vision-language alignment with noise. InProceedings of the AAAI Conference on Artificial Intelligence, pages 17449– 17457, 2025. 2, 3

  21. [21]

    Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization

    Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. InProceedings of the 58th annual meeting of the Association for Computational Linguistics, pages 2177–2190, 2020. 3

  22. [22]

    Progressive growing of GANs for improved quality, stabil- ity, and variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stabil- ity, and variation. InInternational Conference on Learning Representations, 2018. 6 9

  23. [23]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 1, 6

  24. [24]

    Analyzing and improv- ing the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8110–8119, 2020. 6

  25. [25]

    Alias-free generative adversarial networks.Advances in neural informa- tion processing systems, 34:852–863, 2021

    Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks.Advances in neural informa- tion processing systems, 34:852–863, 2021. 6

  26. [26]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 6

  27. [27]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 4

  28. [28]

    Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection

    Christos Koutlis and Symeon Papadopoulos. Leveraging rep- resentations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on Computer Vi- sion, pages 394–411. Springer, 2024. 1, 2

  29. [29]

    Flux.1-dev

    Black Forest Labs. Flux.1-dev. https://huggingface. co/black-forest-labs/FLUX.1-dev. 2024. 6

  30. [30]

    Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR,

  31. [31]

    Improving synthetic image detection towards generalization: An image transformation perspective

    Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspective. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 2405– 2414, 2025. 2, 3, 6, 7

  32. [32]

    Positive-incentive noise.IEEE Transactions on Neural Networks and Learning Systems, 35(6):8708–8714,

    Xuelong Li. Positive-incentive noise.IEEE Transactions on Neural Networks and Learning Systems, 35(6):8708–8714,

  33. [33]

    Fakeclr: Exploring contrastive learning for solving latent discontinuity in data-efficient gans

    Ziqiang Li, Chaoyue Wang, Heliang Zheng, Jing Zhang, and Bin Li. Fakeclr: Exploring contrastive learning for solving latent discontinuity in data-efficient gans. InEuropean Con- ference on Computer Vision, pages 598–615. Springer, 2022. 1

  34. [34]

    A systematic survey of regularization and normalization in gans.ACM Computing Surveys, 55(11):1–37, 2023

    Ziqiang Li, Muhammad Usman, Rentuo Tao, Pengfei Xia, Chaoyue Wang, Huanhuan Chen, and Bin Li. A systematic survey of regularization and normalization in gans.ACM Computing Surveys, 55(11):1–37, 2023. 1

  35. [35]

    Photomaker: Customizing realistic human photos via stacked id embedding

    Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8640–8650, 2024. 6

  36. [36]

    Is artificial intelligence gen- erated image detection a solved problem?arXiv preprint arXiv:2505.12335, 2025

    Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, and Zhangjie Fu. Is artificial intelligence gen- erated image detection a solved problem?arXiv preprint arXiv:2505.12335, 2025. 6, 7, 8

  37. [37]

    Transfer learning of real image features with soft contrastive loss for fake image detection

    Ziyou Liang, Weifeng Liu, Run Wang, Mengjie Wu, Boheng Li, Yuyang Zhang, Lina Wang, and Xinyi Yang. Transfer learning of real image features with soft contrastive loss for fake image detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26281–26289, 2025. 1, 2

  38. [38]

    Forgery-aware adaptive trans- former for generalizable synthetic image detection

    Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive trans- former for generalizable synthetic image detection. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10770–10780, 2024. 2

  39. [39]

    Fine-grained face swapping via regional gan inversion

    Zhian Liu, Maomao Li, Yong Zhang, Cairong Wang, Qi Zhang, Jue Wang, and Yongwei Nie. Fine-grained face swapping via regional gan inversion. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8578–8587, 2023. 6

  40. [40]

    General- izing face forgery detection with high-frequency features

    Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. General- izing face forgery detection with high-frequency features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021. 1, 2

  41. [41]

    Faceswap

    Marek. Faceswap. https : / / github . com / MarekKowalski/FaceSwap. 2020. 6

  42. [42]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 6

  43. [43]

    Towards uni- versal fake image detectors that generalize across generative models

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 24480–24489,

  44. [44]

    Semantic image synthesis with spatially-adaptive nor- malization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346,

  45. [45]

    Pytorch: An imperative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 6

  46. [46]

    Multi-layer random perturbation training for improving model generalization efficiently

    Lis Kanashiro Pereira, Yuki Taya, and Ichiro Kobayashi. Multi-layer random perturbation training for improving model generalization efficiently. InProceedings of the Fourth Black- boxNLP Workshop on Analyzing and Interpreting Neural Net- works for NLP, pages 303–310, 2021. 3

  47. [47]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 6

  48. [48]

    Thinking in frequency: Face forgery detection by mining frequency-aware clues

    Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. InEuropean conference on computer vision, pages 86–103. Springer, 2020. 1, 2

  49. [49]

    Learning transferable visual models from natural language supervi- 10 sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- 10 sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  50. [50]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents.arXiv preprint arXiv:2204.06125, 1(2): 3, 2022. 6

  51. [51]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 6

  52. [52]

    Faceforen- sics++: Learning to detect manipulated facial images

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 6

  53. [53]

    Stylegan- xl: Scaling stylegan to large diverse datasets

    Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InACM SIG- GRAPH 2022 conference proceedings, pages 1–10, 2022. 6

  54. [54]

    Crackling noise.nature, 410(6825):242–250, 2001

    James P Sethna, Karin A Dahmen, and Christopher R Myers. Crackling noise.nature, 410(6825):242–250, 2001. 3

  55. [55]

    Blendface: Re-designing identity encoders for face-swapping

    Kaede Shiohara, Xingchao Yang, and Takafumi Taketomi. Blendface: Re-designing identity encoders for face-swapping. InProceedings of the IEEE/CVF international conference on computer vision, pages 7634–7644, 2023. 6

  56. [56]

    Learning on gradients: Generalized arti- facts representation for gan-generated images detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. Learning on gradients: Generalized arti- facts representation for gan-generated images detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12105–12114, 2023. 2

  57. [57]

    Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 1, 2, 6

  58. [58]

    Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 28130–28139, 2024. 1, 2, 6, 7, 8

  59. [59]

    C2p-clip: In- jecting category common prompt in clip to enhance gener- alization in deepfake detection

    Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: In- jecting category common prompt in clip to enhance gener- alization in deepfake detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7184–7192, 2025. 1, 2

  60. [60]

    Renshuai Tao, Chuangchuang Tan, Huan Liu, Jiakai Wang, Haotong Qin, Yakun Chang, Wei Wang, Rongrong Ni, and Yao Zhao. Sagnet: Decoupling semantic-agnostic artifacts from limited training data for robust generalization in deep- fake detection.IEEE Transactions on Information Forensics and Security, 2025. 1

  61. [61]

    Midjourney v6.1

    Midjourney Team. Midjourney v6.1. https://www. midjourney.com/home, . 2024. 6

  62. [62]

    Dall-e 3 ai image generator

    OpenAI Team. Dall-e 3 ai image generator. https:// dalle3.ai/, . 2024. 6

  63. [63]

    Haofan Wang, Ashley Kleynhans, and Abe Estrada. Inswap. https://github.com/haofanwang/inswapper

  64. [64]

    InstantID: Zero-shot Identity-Preserving Generation in Seconds

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 6

  65. [65]

    Cnn-generated images are surprisingly easy to spot

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8695–8704, 2020. 2, 3, 6, 7, 8

  66. [66]

    Dire for diffusion- generated image detection

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion- generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22445– 22455, 2023. 2

  67. [67]

    Which face is real? https: //www.whichfaceisreal.com/

    Jevin West and Carl Bergstrom. Which face is real? https: //www.whichfaceisreal.com/. 2019. 6

  68. [68]

    Infinite-id: Identity-preserved personalization via id-semantics decoupling paradigm

    Yi Wu, Ziqiang Li, Heliang Zheng, Chaoyue Wang, and Bin Li. Infinite-id: Identity-preserved personalization via id-semantics decoupling paradigm. InEuropean Conference on Computer Vision, pages 279–296. Springer, 2024. 1, 6

  69. [69]

    https://xihe.mindspore.cn/modelzoo/wukong

    Wukong. https://xihe.mindspore.cn/modelzoo/wukong. 2022. 6

  70. [70]

    Data Noising as Smoothing in Neural Network Language Models

    Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y Ng. Data noising as smoothing in neural network language models.arXiv preprint arXiv:1703.02573, 2017. 3

  71. [71]

    Gener- alizable deepfake detection via effective local-global feature extraction.arXiv preprint arXiv:2501.15253, 2025

    Jiazhen Yan, Ziqiang Li, Ziwen He, and Zhangjie Fu. Gener- alizable deepfake detection via effective local-global feature extraction.arXiv preprint arXiv:2501.15253, 2025. 1, 2, 6, 7

  72. [72]

    NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

    Jiazhen Yan, Fan Wang, Weiwei Jiang, Ziqiang Li, and Zhangjie Fu. Ns-net: Decoupling clip semantic informa- tion through null-space for generalizable ai-generated image detection.arXiv preprint arXiv:2508.01248, 2025. 1, 2

  73. [73]

    A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024. 6, 7, 8

  74. [74]

    Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

    Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decompo- sition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024. 2, 6, 7, 8

  75. [75]

    Dˆ 3: Scaling up deepfake detection by learning from discrepancy

    Yongqi Yang, Zhihao Qian, Ye Zhu, Olga Russakovsky, and Yu Wu. Dˆ 3: Scaling up deepfake detection by learning from discrepancy. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23850–23859, 2025. 1, 3

  76. [76]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

  77. [77]

    Low-rank few-shot adaptation of vision-language models

    Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024. 6 11

  78. [78]

    Styleswin: Transformer-based gan for high-resolution image generation

    Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin: Transformer-based gan for high-resolution image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11304–11314, 2022. 6

  79. [79]

    Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise

    Hongyuan Zhang, Yanchen Xu, Sida Huang, and Xuelong Li. Data augmentation of contrastive learning is estimating positive-incentive noise.arXiv preprint arXiv:2408.09929,

  80. [80]

    Towards universal ai-generated image de- tection by variational information bottleneck network

    Haifeng Zhang, Qinghui He, Xiuli Bi, Weisheng Li, Bo Liu, and Bin Xiao. Towards universal ai-generated image de- tection by variational information bottleneck network. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23828–23837, 2025. 1, 2, 6, 7, 8

Showing first 80 references.