pith. machine review for the scientific record. sign in

arxiv: 2604.16879 · v2 · submitted 2026-04-18 · 💻 cs.CV

Recognition: unknown

Adaptive Forensic Feature Refinement via Intrinsic Importance Perception

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords synthetic image detectionvisual foundation modelsadaptive layer selectionforensic feature refinementparameter subspace constraintcross-distribution generalizationforgery cue perception
0
0 comments X

The pith

I2P selects the VFM layer most useful for forgery cues and restricts fine-tuning to low-sensitivity parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to resolve the conflict in synthetic image detection between making a visual foundation model more task-specific and keeping the broad generalization that came from its original pretraining. Existing approaches either use only the final layer or apply full fine-tuning, both of which limit performance on images from unknown generators. I2P instead first locates the single layer whose representations best separate real from synthetic content and then allows updates only inside the part of the parameter space that matters least to the pretrained structure. Readers should care because new generative models appear faster than detectors can be retrained, so any method that reuses existing visual priors more efficiently matters for practical deployment.

Core claim

VFM adaptation for synthetic image detection is best treated as a joint problem of identifying the critical representational layer that carries the most transferable forgery information and then confining task-driven parameter changes to a low-sensitivity subspace; I2P implements this by adaptive layer selection followed by subspace-constrained updates, thereby raising task specificity while keeping as much of the original pretrained structure intact as possible.

What carries the argument

Intrinsic importance perception: the mechanism that first ranks layers by how discriminative their representations are for the detection task and then locates the low-sensitivity parameter subspace in which updates can occur with minimal disturbance to pretrained knowledge.

If this is right

  • Detectors achieve higher accuracy on synthetic images from previously unseen sources.
  • The pretrained cross-modal structure remains available for generalization instead of being overwritten.
  • Multi-layer information is used more precisely than simple concatenation or final-layer-only strategies.
  • Task adaptation becomes possible with smaller risk of degrading open-set behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection-plus-subspace pattern could be tested on video or audio forensic tasks that also rely on foundation-model backbones.
  • If low-sensitivity subspaces prove stable across different foundation models, the method might reduce the need for full retraining when switching backbones.
  • A practical next check is whether the identified critical layer stays the same when the set of training generators is expanded.

Load-bearing premise

An optimal hierarchy of layer representations for transferable forgery cues exists and can be located adaptively without the subspace constraint itself harming open-set performance on unseen generators.

What would settle it

Measure detection accuracy on images produced by a generator model never encountered during training or layer selection; if I2P does not outperform both final-layer baselines and full fine-tuning on this held-out set, the joint-optimization claim fails.

Figures

Figures reproduced from arXiv: 2604.16879 by Bingde Hu, Jiazhen Yang, Jie Lei, Junjun Zheng, Kejia Chen, Xiangheng Kong, Yang Gao, Zunlei Feng.

Figure 1
Figure 1. Figure 1: T-SNE visualizations of layer-wise representations extracted on GenImage [ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise spectral structure analysis of a frozen CLIP [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise evolution of CLS token attention maps [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Random initialization ( ) often necessitates extensive exploration and may converge to sharp, perturbation-sensitive local minima ( ). Pre-trained VFM reside within a structured flat basin ( ). Aggressive fine-tuning risks escaping this basin and inducing catastrophic forgetting, whereas controlled knowl￾edge injection confines updates to low-sensitivity directions to preserve generalization capabilities (… view at source ↗
Figure 6
Figure 6. Figure 6: I2P overall framework. Stage 1: Critical Layer Identification extracts layer-wise CLS token representations from a frozen VFM and learns their relative contributions via gated attention, identifying the most discriminative intermediate layer ( ) for SID and pruning ( ) deeper layers beyond it. Stage 2: Controlled Knowledge Injection selects a low-importance parameter subset using importance scores and upda… view at source ↗
Figure 8
Figure 8. Figure 8: VFM backbones comparison. Performance of the [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of update rate 𝜂 and the number of CLS token used under Setting 1 and Setting 2. 5.3 Ablation Studies Effect of CLI and CKI module. To verify the contribution of each component in I2P, we use the frozen CLIP visual encoder as the baseline and separately evaluate the variants with CKI only, CLI only, and the full model with both CLI and CKI. As shown in Ta￾ble. 5, introducing either CLI or CKI alone … view at source ↗
read the original abstract

With the rapid development of generative models and multimodal content editing technologies, the key challenge faced by synthetic image detection (SID) lies in cross-distribution generalization to unknown generation sources. In recent years, visual foundation models (VFM), which acquire rich visual priors through large scale image-text alignment pretraining, have become a promising technical route for improving the generalization ability of SID. However, existing VFM-based methods remain relatively coarse-grained in their adaptation strategies. They typically either directly use the final layer representations of VFM or simply fuse multi layer features, lacking explicit modeling of the optimal representational hierarchy for transferable forgery cues. Meanwhile, although directly fine-tuning VFM can enhance task adaptation, it may also damage the cross-modal pretrained structure that supports open-set generalization. To address this task specific tension, we reformulate VFM adaptation for SID as a joint optimization problem: it is necessary both to identify the critical representational layer that is more suitable for carrying forgery discriminative information and to constrain the disturbance caused by task knowledge injection to the pretrained structure. Based on this, we propose I2P, an SID framework centered on intrinsic importance perception. I2P first adaptively identifies the critical layer representations that are most discriminative for SID, and then constrains task-driven parameter updates within a low sensitivity parameter subspace, thereby improving task specificity while preserving the transferable structure of pretrained representations as much as possible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes I2P, an SID framework for adapting visual foundation models (VFMs) to synthetic image detection. It reformulates the adaptation task as a joint optimization problem: adaptively identifying critical layer representations most discriminative for forgery cues via intrinsic importance perception, while constraining task-driven parameter updates to a low-sensitivity subspace to preserve cross-modal pretrained structure and improve generalization to unknown generators.

Significance. If empirically validated, the approach could offer a more targeted alternative to existing coarse VFM adaptation strategies (final-layer use or simple multi-layer fusion) and direct fine-tuning, by explicitly addressing the tension between task specificity and retention of transferable features. This framing of SID adaptation as joint layer selection plus subspace constraint is conceptually coherent and relevant to open-set forensic detection challenges.

major comments (2)
  1. [Abstract] Abstract: The headline claim that I2P 'adaptively identifies the critical layer representations that are most discriminative for SID' rests on the unverified existence of an optimal representational hierarchy for transferable forgery cues. No derivation, bound, or preliminary analysis is supplied showing that the intrinsic importance perception metric recovers such a hierarchy rather than fitting training-domain artifacts; this is load-bearing for the cross-distribution generalization argument.
  2. [Abstract] Abstract: The second pillar—that constraining updates 'within a low sensitivity parameter subspace' preserves the transferable pretrained structure 'as much as possible'—is asserted without any sensitivity analysis, ablation, or proof that the masking avoids damaging the very cross-modal features enabling open-set transfer. This directly undermines the claimed advantage over direct fine-tuning.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'intrinsic importance perception' is introduced without a concise mathematical definition or pointer to the precise importance metric used; this notation should be clarified early.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the conceptual relevance of framing SID adaptation as joint layer selection and subspace constraint. We address each major comment below and will revise the manuscript accordingly to strengthen the supporting analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that I2P 'adaptively identifies the critical layer representations that are most discriminative for SID' rests on the unverified existence of an optimal representational hierarchy for transferable forgery cues. No derivation, bound, or preliminary analysis is supplied showing that the intrinsic importance perception metric recovers such a hierarchy rather than fitting training-domain artifacts; this is load-bearing for the cross-distribution generalization argument.

    Authors: We agree that the abstract claim would benefit from explicit supporting analysis. While the manuscript presents empirical evidence of improved cross-generator performance, it does not include a dedicated preliminary study of the importance metric. In the revised version we will add a new subsection (or appendix) that reports layer-wise importance scores on held-out generators, quantifies correlation with forgery discriminability, and provides ablations comparing the selected layers against random, final-layer, and multi-layer baselines. This will help verify that the metric captures transferable cues rather than training-domain artifacts. revision: yes

  2. Referee: [Abstract] Abstract: The second pillar—that constraining updates 'within a low sensitivity parameter subspace' preserves the transferable pretrained structure 'as much as possible'—is asserted without any sensitivity analysis, ablation, or proof that the masking avoids damaging the very cross-modal features enabling open-set transfer. This directly undermines the claimed advantage over direct fine-tuning.

    Authors: We concur that the abstract would be strengthened by direct evidence on this point. The current manuscript contains comparative results against direct fine-tuning, but lacks a targeted sensitivity analysis of the masked subspace. We will add an ablation in the revision that measures the impact of subspace-constrained updates on pretrained feature stability (e.g., via cosine similarity or transfer performance on auxiliary tasks) relative to full fine-tuning. This will provide concrete support for the preservation claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is a heuristic reformulation without self-referential reduction

full rationale

The paper's abstract and description contain no equations, derivations, or first-principles results that reduce to inputs by construction. It reformulates SID adaptation as identifying critical layers via intrinsic importance perception and constraining updates to a low-sensitivity subspace, but presents this as a proposed framework (I2P) rather than a derived prediction. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are evident. The central claim rests on the existence of an identifiable optimal hierarchy, which is an assumption rather than a tautological output. This matches the default expectation of non-circularity for methodological proposals without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the framework description implies unstated assumptions about layer importance and parameter sensitivity that are not quantified here.

pith-pipeline@v0.9.0 · 5561 in / 1020 out tokens · 43274 ms · 2026-05-10T07:54:40.836839+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 20 canonical work pages · 6 internal anchors

  1. [1]

    Midjourney

    2022. Midjourney. https://www.midjourney.com/home/

  2. [2]

    2022. Wukong. https://xihe.mindspore.cn/modelzoo/wukong

  3. [3]

    Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan

  4. [4]

    Foundation models defining a new era in vision: a survey and outlook.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  5. [5]

    black forest labs. 2024. FLUX.1: A new era of creation. https://blackforestlabs.ai/

  6. [6]

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. 2025. Perception encoder: The best visual embeddings are not at the output of the network.arXiv preprint arXiv:2504.13181(2025)

  7. [7]

    Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large scale GAN training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096 (2018)

  8. [8]

    Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip Yu, and Lichao Sun. 2025. A survey of ai-generated content (aigc).Comput. Surveys57, 5 (2025), 1–38

  9. [9]

    Jiaxuan Chen, Jieteng Yao, and Li Niu. 2024. A single simple patch is all you need for ai-generated image detection.arXiv preprint arXiv:2402.01123(2024)

  10. [10]

    Yingjian Chen, Lei Zhang, and Yakun Niu. 2025. ForgeLens: Data- Efficient Forgery Focus for Generalizable Forgery Image Detection. arXiv:2408.13697 [cs.CV] https://arxiv.org/abs/2408.13697

  11. [11]

    Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag

  12. [12]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI. InProceedings of the Computer Vision and Pattern Recognition Conference. 13455–13465

  13. [13]

    Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi- domain image-to-image translation. InProceedings of the IEEE conference on computer vision and pattern recognition. 8789–8797

  14. [14]

    Ignacio Meza De la Jara, Cristian Rodriguez-Opazo, Damien Teney, Damith Ranasinghe, and Ehsan Abbasnejad. 2025. Mysteries of the Deep: Role of In- termediate Representations in Out of Distribution Detection.arXiv preprint arXiv:2510.05782(2025)

  15. [15]

    Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in neural information processing systems34 (2021), 8780–8794

  16. [16]

    Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. InInternational conference on machine learning. PMLR, 10323–10337

  17. [17]

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10696–10706

  18. [18]

    Yatharth Gupta, Vishnu V Jaddipal, Harish Prabhala, Sayak Paul, and Patrick V on Platen. 2024. Progressive knowledge distillation of stable diffusion xl using layer level loss.arXiv preprint arXiv:2401.02677(2024)

  19. [19]

    Babak Hassibi, David G Stork, and Gregory J Wolff. 1993. Optimal brain surgeon and general network pruning. InIEEE international conference on neural networks. IEEE, 293–299

  20. [20]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840– 6851

  21. [21]

    Wenke Huang, Jian Liang, Xianda Guo, Yiyang Fang, Guancheng Wan, Xuankun Rong, Chi Wen, Zekun Shi, Qingyun Li, Didi Zhu, et al. 2025. Keeping yourself is important in downstream tuning multimodal large language model.arXiv preprint arXiv:2503.04543(2025)

  22. [22]

    Dimitrios Karageorgiou, Symeon Papadopoulos, Ioannis Kompatsiaris, and Ef- stratios Gavves. 2025. Any-resolution ai-generated image detection by spectral learning. InProceedings of the Computer Vision and Pattern Recognition Confer- ence. 18706–18717

  23. [23]

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196(2017)

  24. [24]

    Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar- chitecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410

  25. [25]

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119

  26. [26]

    Christos Koutlis and Symeon Papadopoulos. 2024. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on Computer Vision. Springer, 394–411

  27. [27]

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. 2024. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245(2024)

  28. [28]

    Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng

  29. [29]

    InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    Improving synthetic image detection towards generalization: An image transformation perspective. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1. 2405–2414

  30. [30]

    Shuqiao Liang, Jian Liu, Renzhang Chen, and Quanlong Guan. 2025. FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies.arXiv preprint arXiv:2509.20890(2025)

  31. [31]

    Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. 2024. Forgery-aware adaptive transformer for generalizable synthetic image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10770–10780

  32. [32]

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. 2022. Pseudo numerical methods for diffusion models on manifolds.arXiv preprint arXiv:2202.09778(2022)

  33. [33]

    Hengrui Lou, Zunlei Feng, Jinsong Geng, Erteng Liu, Jie Lei, Lechao Cheng, Jie Song, Mingli Song, and Yijun Bei. 2025. STD-FD: Spatio-Temporal Distribution Fitting Deviation for AIGC Forgery Identification. InForty-second International Conference on Machine Learning

  34. [34]

    Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. 2024. LaREˆ 2: La- tent reconstruction error based method for diffusion-generated image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17006–17015

  35. [35]

    Wenpeng Mu, Zheng Li, Qiang Xu, Xinghao Jiang, and Tanfeng Sun. 2025. ExDA: Towards Universal Detection and Plug-and-Play Attribution of AI-Generated Ex- Regulatory Images. InProceedings of the 33rd ACM International Conference on Multimedia. 11512–11521

  36. [36]

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741(2021)

  37. [37]

    Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising dif- fusion probabilistic models. InInternational conference on machine learning. PMLR, 8162–8171

  38. [38]

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24480– 24489

  39. [39]

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2337–2346

  40. [40]

    Alan Perotti, Marco Nurisso, and Mirko Zaffaroni. 2025. No Detector to Rule Them All. InProceedings of the 1st on Deepfake Forensics Workshop: Detection, Attribution, Recognition, and Adversarial Challenges in the Era of AI-Generated Media. 65–72

  41. [41]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  42. [42]

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. InInternational conference on machine learning. Pmlr, 8821–8831

  43. [43]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  44. [44]

    Olivier Roy and Martin Vetterli. 2007. The effective rank: A measure of effective dimensionality. In2007 15th European signal processing conference. IEEE, 606– 610

  45. [45]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Rama- monjisoa, et al. 2025. Dinov3.arXiv preprint arXiv:2508.10104(2025)

  46. [46]

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. 2025. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013(2025)

  47. [47]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, V ol. 38. 5052–5060

  48. [48]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 28130–28139

  49. [49]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12105–12114

  50. [50]

    Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method.arXiv preprint physics/0004057(2000). MM ’26, November 10-14, 2026, Rio de Janeiro, Brazil Y ang et al

  51. [51]

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. 2025. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Featur...

  52. [52]

    Jiarui Wang, Huiyu Duan, Juntong Wang, Ziheng Jia, Woo Yi Yang, Xiaorong Zhu, Yu Zhao, Jiaying Qian, Yuke Xing, Guangtao Zhai, et al. 2025. Dfbench: Benchmarking deepfake image detection capability of large multimodal models. InProceedings of the 33rd ACM International Conference on Multimedia. 12666– 12673

  53. [53]

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704

  54. [54]

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. 2023. Dire for diffusion-generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision. 22445–22455

  55. [55]

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. 2024. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435(2024)

  56. [56]

    Zhiyuan Yan, Jiangming Wang, Zhendong Wang, Peng Jin, Ke-Yue Zhang, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. 2024. Effort: Efficient orthogonal modeling for generalizable ai-generated image detection. arXiv preprint arXiv:2411.156332 (2024)

  57. [57]

    Haifeng Zhang, Qinghui He, Xiuli Bi, Weisheng Li, Bo Liu, and Bin Xiao. 2025. Towards Universal AI-Generated Image Detection by Variational Information Bot- tleneck Network. InProceedings of the Computer Vision and Pattern Recognition Conference. 23828–23837

  58. [58]

    Jiehua Zhang, Liang Li, Chenggang Yan, Wei Ke, and Yihong Gong. 2025. Frequency-aware Correlation Discovering and Spatial Forgery Clue Distilling for Synthetic Image Detection. InProceedings of the 33rd ACM International Conference on Multimedia. 11726–11735

  59. [59]

    Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al . 2025. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567(2025)

  60. [60]

    Nan Zhong, Yiran Xu, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. 2023. Patchcraft: Exploring texture patch for efficient ai-generated image detection. arXiv preprint arXiv:2311.12397(2023)

  61. [61]

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. InPro- ceedings of the IEEE international conference on computer vision. 2223–2232

  62. [62]

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. 2023. Genimage: A million-scale benchmark for detecting ai-generated image.Advances in Neural Information Processing Systems36 (2023), 77771–77782. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009