pith. sign in

arxiv: 2605.30062 · v1 · pith:4WINOE5Onew · submitted 2026-05-28 · 💻 cs.CV

FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection

Pith reviewed 2026-06-29 07:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords synthetic image detectionchain of thoughtphysical lawslarge multimodal modelsforgery detectioncritical thinkingrobustness
0
0 comments X

The pith

FakeVLM-R1 uses critical chain-of-thought reasoning anchored in physical laws to detect synthetic images with causal understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to move beyond imitation learning in synthetic image detection by giving large multimodal models the ability to reason causally about physical laws. It introduces a training framework that combines supervised fine-tuning with group relative policy optimization and a critical thinking chain-of-thought process. This process requires the model to generate both a forgery hypothesis and a physical commonsense counter-proof during inference. The FakeClue++ dataset supplies the physical law annotations that serve as the anchor. If successful, this would produce detectors that are more interpretable, less biased against real images, and more robust to perturbations.

Core claim

FakeVLM-R1 internalizes physical laws by executing bidirectional dialectical reasoning in its critical thinking CoT, where it proposes forgery hypotheses while invoking physical commonsense to build authenticity counter-proofs, trained via GRPO on the FakeClue++ dataset that provides unified authenticity anchors from physical laws of authentic images.

What carries the argument

The Critical Thinking Chain-of-Thought mechanism, which enforces bidirectional dialectical reasoning by requiring both forgery hypothesis and physical authenticity counter-proof.

If this is right

  • It achieves state-of-the-art performance across multiple benchmarks for synthetic image detection.
  • The detection becomes high-precision and logically interpretable.
  • It resolves the over-rejection bias of existing methods against real images.
  • It demonstrates improved generalization and robustness against perturbations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the physical anchors work as intended, similar CoT structures could be applied to other tasks where causal physical reasoning is needed, such as video forgery detection.
  • The approach might help mitigate explanatory hallucinations in multimodal models more broadly by enforcing counter-proof reasoning.
  • Further tests on datasets without explicit physical annotations could reveal how much the training transfers the internalized laws.

Load-bearing premise

The annotations guided by physical laws in the dataset provide a genuine causal anchor that enables the model to internalize physical commonsense rather than just improving statistical pattern matching.

What would settle it

An experiment comparing the model's performance on physical reasoning tasks with and without the physical laws annotations in training, measuring if the causal counter-proof capability disappears when annotations are removed.

Figures

Figures reproduced from arXiv: 2605.30062 by Conghui He, Junyan Ye, Kaiqing Lin, Leqi Zhu, Weijia Li, Zhiyuan Yan.

Figure 1
Figure 1. Figure 1: Comparison of synthetic detection paradigms. (a) Tra￾ditional methods overfit synthetic artifacts, risking explanatory hallucinations; (b) Our FakeVLM-R1 additionally learns real￾world physical laws, achieving forensic reasoning via critical thinking; (c) A real-image case demonstrating our method’s robustness against false positives. the image [15]–[18]. Early synthetic detection methods primar￾ily formul… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the FakeClue++ dataset. Diverging from traditional datasets focused on artifact annotation, FakeClue++ additionally incorporates the internalization of real-world physical laws. By incorporating a vast amount of authentic images that inherently conform to natural rules, it enables highly efficient and low-cost annotation via large models. 2) Real Image: To transcend mere artifact recognition an… view at source ↗
Figure 3
Figure 3. Figure 3: The two-stage training framework of FakeVLM-R1. To cold-start the explanatory generation capability, we fine-tune both the Vision Encoder and the LLM Decoder. During RL Stage, we optimize the LLM decoder using multiple rewards to stimulate bidirectional dialectical reasoning and precise detection. G. Subsequently, we introduce a set of multi-dimensional re￾ward functions to comprehensively evaluate this gr… view at source ↗
Figure 4
Figure 4. Figure 4: Fine-grained comparison in FakeClue++. Performance comparison between FakeVLM-R1 and FakeVLM across di￾verse image categories and ground-truth labels. evidence for genuine animal fur or flower close-ups, which ultimately culminates in complete misjudgments. Conversely, FakeVLM-R1 accurately captures and comprehends real-world physical attributes, such as optical shallow depth of field, coherent single ligh… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of reasoning processes on fake Samples among FakeVLM-R1, FakeVLM and Gemini-2.5. TABLE IV: Detection accuracy metrics evaluated on MM￾FakeBench. All results are expressed as percentages (%). Bold and underlined values indicate the best and second-best results, respectively. Method Real Fake Overall Acc F1 Acc F1 Acc F1 Effort [70] 94.7 92.6 83.7 86.9 90.6 89.8 NPR [41] 97.7 88.5 57.4 72.6 83.8 8… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of reasoning processes on real Samples among FakeVLM-R1, FakeVLM and Gemini-2.5. adjustment, and blurring). As shown in Table VI, the reported values represent the percentage change in performance, with ’Overall’ denoting the mean of the absolute changes. Ex￾perimental results demonstrate that FakeVLM-R1 achieves a substantial leap in its resilience against perturbations. In terms of overall sta… view at source ↗
read the original abstract

The development of generative artificial intelligence technologies has propelled the visual realism of synthetic images to an unprecedented level. Although current interpretable detection methods based on Large Multimodal Models (LMMs) have made certain progress, they still rely on imitation learning derived from massive volumes of forged data. Consequently, they lack genuine causal reasoning capabilities and are prone to explanatory hallucinations. To overcome this bottleneck, we propose FakeVLM-R1, aiming to endow the model with human-like critical thinking capabilities when performing synthetic detection tasks. Building upon Supervised Fine-Tuning (SFT), this framework integrates Group Relative Policy Optimization (GRPO) with a Critical Thinking Chain-of-Thought (CoT) mechanism. During the inference phase, the model executes a "bidirectional dialectical reasoning" process: while proposing a forgery hypothesis, it must simultaneously invoke physical commonsense to construct an authenticity counter-proof. Furthermore, we constructed the FakeClue++ dataset with high-quality samples, which extensively introduces annotations guided by the physical laws of authentic images, providing a unified authenticity anchor for the model. Experiments confirm that FakeVLM-R1 achieves SOTA performance the evaluated models across multiple benchmarks. It not only achieves high-precision, logically interpretable detection but also resolves the over-rejection bias of existing methods against real images, demonstrating generalization and robustness against perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes FakeVLM-R1, which augments supervised fine-tuning with Group Relative Policy Optimization and a Critical Thinking Chain-of-Thought mechanism that performs bidirectional dialectical reasoning (forgery hypothesis plus physical-commonsense counter-proof). It introduces the FakeClue++ dataset containing physical-law annotations on authentic images to serve as a unified authenticity anchor. The central claims are that this framework yields SOTA detection performance across benchmarks, produces logically interpretable outputs, resolves over-rejection bias against real images, and demonstrates generalization and robustness to perturbations.

Significance. If the reported performance gains and bias-resolution claims are substantiated by rigorous experiments and ablations, the approach could advance interpretable synthetic-image detection by shifting from pure imitation learning toward models that internalize physical constraints, a direction with potential value for forensic and content-authenticity applications.

major comments (3)
  1. [Abstract] Abstract: the claim that FakeVLM-R1 'achieves SOTA performance the evaluated models across multiple benchmarks' is unsupported by any reported baselines, metrics, error bars, dataset sizes, or statistical tests, so the central empirical claim cannot be evaluated.
  2. [Abstract] Abstract (paragraph on dataset construction): no ablation that removes the physical-laws annotations, no analysis of whether generated counter-proofs actually reference those annotations, and no counterfactual replacing physical guidance with generic captions are described, leaving the causal mechanism (internalization versus improved pattern matching) untested.
  3. [Abstract] Abstract: the assertion that the method 'resolves the over-rejection bias of existing methods against real images' is presented without quantitative evidence (e.g., false-positive rates on real-image test sets or comparison tables), making the bias-resolution claim load-bearing yet unevaluable.
minor comments (1)
  1. [Abstract] Abstract: the sentence 'achieves SOTA performance the evaluated models' is missing a preposition such as 'among'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these focused comments on the abstract. Each point identifies a legitimate gap in how claims are presented, and we will revise the manuscript to address them directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that FakeVLM-R1 'achieves SOTA performance the evaluated models across multiple benchmarks' is unsupported by any reported baselines, metrics, error bars, dataset sizes, or statistical tests, so the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract as written does not contain the supporting quantitative details. The experiments section of the manuscript reports the relevant comparisons, but the abstract itself must be self-contained. We will revise the abstract to include the key metrics, baseline names, dataset sizes, error bars, and statistical test results that substantiate the SOTA claim. revision: yes

  2. Referee: [Abstract] Abstract (paragraph on dataset construction): no ablation that removes the physical-laws annotations, no analysis of whether generated counter-proofs actually reference those annotations, and no counterfactual replacing physical guidance with generic captions are described, leaving the causal mechanism (internalization versus improved pattern matching) untested.

    Authors: The current manuscript does not contain the requested ablations or counterfactual analyses. We will add them in the revision: (1) an ablation that removes the physical-law annotations, (2) explicit tracing of whether generated counter-proofs cite the annotations, and (3) a control condition that replaces physical guidance with generic captions. These additions will directly test the claimed causal mechanism. revision: yes

  3. Referee: [Abstract] Abstract: the assertion that the method 'resolves the over-rejection bias of existing methods against real images' is presented without quantitative evidence (e.g., false-positive rates on real-image test sets or comparison tables), making the bias-resolution claim load-bearing yet unevaluable.

    Authors: We agree that the abstract currently offers no quantitative support for the bias-resolution claim. We will revise the abstract to report false-positive rates on real-image test sets for FakeVLM-R1 versus the compared methods, along with the relevant comparison table or numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML framework with no derivation chain or self-referential equations

full rationale

The provided abstract and description contain no equations, formal derivations, or load-bearing mathematical steps. The paper describes an SFT+GRPO training process on a new dataset (FakeClue++) and reports experimental SOTA results; these are empirical claims whose validity depends on ablations and benchmarks rather than any quantity reducing to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked in a way that collapses the central claim. This is the common case of a non-circular empirical paper; the reader's noted assumption about causal internalization is a question of experimental evidence, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no mathematical derivations, fitted constants, or explicit axioms; ledger left empty.

pith-pipeline@v0.9.1-grok · 5785 in / 1155 out tokens · 27250 ms · 2026-06-29T07:57:37.909469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 34 canonical work pages · 12 internal anchors

  1. [1]

    Generative adversarial nets,

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014

  2. [2]

    Denoising Diffusion Probabilistic Models

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” 2020. [Online]. Available: https://arxiv.org/abs/2006.11239

  3. [3]

    Improving image generation with better captions,

    J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guoet al., “Improving image generation with better captions,”Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023

  4. [4]

    Z-image: An efficient image generation foundation model with single-stream diffusion transformer,

    I. Team, H. Cai, S. Cao, R. Du, P. Gao, S. Hoi, Z. Hou, S. Huang, D. Jiang, X. Jin, L. Li, Z. Li, Z.-Y . Li, D. Liu, D. Liu, J. Shi, Q. Wu, F. Yu, C. Zhang, S. Zhang, and S. Zhou, “Z-image: An efficient image generation foundation model with single-stream diffusion transformer,”

  5. [5]
  6. [6]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

  7. [7]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  8. [8]

    Diffusion models in vision: A survey,

    F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 9, pp. 10 850–10 869, 2023

  9. [9]

    Realgen: Photorealistic text-to-image generation via detector-guided rewards,

    J. Ye, L. Zhu, Y . Guo, D. Jiang, Z. Huang, Y . Zhang, Z. Yan, H. Fu, C. He, and W. Li, “Realgen: Photorealistic text-to-image generation via detector-guided rewards,”arXiv preprint arXiv:2512.00473, 2025

  10. [10]

    Mmfakebench: A mixed-source multimodal misinformation detection benchmark for lvlms,

    X. Liu, Z. Li, P. Li, H. Huang, S. Xia, X. Cui, L. Huang, W. Deng, and Z. He, “Mmfakebench: A mixed-source multimodal misinformation detection benchmark for lvlms,”arXiv preprint arXiv:2406.08772, 2024

  11. [11]

    Deepfakes: Deceptions, mitigations, and opportunities,

    M. Mustak, J. Salminen, M. Mäntymäki, A. Rahman, and Y . K. Dwivedi, “Deepfakes: Deceptions, mitigations, and opportunities,”Journal of Business Research, vol. 154, p. 113368, 2023

  12. [12]

    Leveraging representations from intermediate encoder-blocks for synthetic image detection,

    C. Koutlis and S. Papadopoulos, “Leveraging representations from intermediate encoder-blocks for synthetic image detection,” inEuropean Conference on computer vision. Springer, 2024, pp. 394–411

  13. [13]

    Seeing before reasoning: A unified framework for generalizable and explainable fake image detection,

    K. Lin, Z. Yan, R. Chen, J. Ye, K.-Y . Zhang, Y . Zhou, P. Jin, B. Li, T. Yao, and S. Ding, “Seeing before reasoning: A unified framework for generalizable and explainable fake image detection,”arXiv preprint arXiv:2509.25502, 2025

  14. [14]

    Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

    C. Jiang, W. Dong, Z. Zhang, C. Si, F. Yu, W. Peng, X. Yuan, Y . Bi, M. Zhao, Z. Zhouet al., “Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection,”arXiv preprint arXiv:2506.00979, 2025

  15. [15]

    OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

    Y . Guo, J. Ye, C. Zhang, H. Kang, H. Fu, C. He, and W. Li, “Omniaid: Decoupling semantic and artifacts for universal ai-generated image detection in the wild,”arXiv preprint arXiv:2511.08423, 2025

  16. [16]

    Forgerynet: A versatile benchmark for comprehensive forgery analysis,

    Y . He, B. Gan, S. Chen, Y . Zhou, G. Yin, L. Song, L. Sheng, J. Shao, and Z. Liu, “Forgerynet: A versatile benchmark for comprehensive forgery analysis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4360–4369

  17. [17]

    Genimage: A million-scale benchmark for detecting ai- generated image,

    M. Zhu, H. Chen, Q. Yan, X. Huang, G. Lin, W. Li, Z. Tu, H. Hu, J. Hu, and Y . Wang, “Genimage: A million-scale benchmark for detecting ai- generated image,”Advances in neural information processing systems, vol. 36, pp. 77 771–77 782, 2023

  18. [18]

    Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images,

    B. Chen, J. Zeng, J. Yang, and R. Yang, “Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images,” inForty-first International Conference on Machine Learning, 2024

  19. [19]

    Wildfake: A large-scale and hierarchical dataset for ai-generated im- ages detection,

    Y . Hong, J. Feng, H. Chen, J. Lan, H. Zhu, W. Wang, and J. Zhang, “Wildfake: A large-scale and hierarchical dataset for ai-generated im- ages detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 3500–3508

  20. [20]

    Frepgan: robust deepfake detection using frequency-level perturbations,

    Y . Jeong, D. Kim, Y . Ro, and J. Choi, “Frepgan: robust deepfake detection using frequency-level perturbations,” inProceedings of the AAAI conference on artificial intelligence, vol. 36, no. 1, 2022, pp. 1060–1068

  21. [21]

    A single simple patch is all you need for ai-generated image detection,

    J. Chen, J. Yao, and L. Niu, “A single simple patch is all you need for ai-generated image detection,”arXiv preprint arXiv:2402.01123, 2024

  22. [22]

    Fakescope: Large multimodal expert model for transparent ai-generated image forensics,

    Y . Li, Y . Tian, Y . Huang, W. Lu, S. Wang, W. Lin, and A. Rocha, “Fakescope: Large multimodal expert model for transparent ai-generated image forensics,” 2025. [Online]. Available: https: //arxiv.org/abs/2503.24267

  23. [23]

    Legion: Learning to ground and explain for synthetic image detection,

    H. Kanget al., “Legion: Learning to ground and explain for synthetic image detection,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2025

  24. [24]

    Can chatgpt detect deepfakes? a study of using multimodal large language models for media forensics,

    S. Jia, R. Lyu, K. Zhao, Y . Chen, Z. Yan, Y . Ju, C. Hu, X. Li, B. Wu, and S. Lyu, “Can chatgpt detect deepfakes? a study of using multimodal large language models for media forensics,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4324–4333

  25. [25]

    Loki: A comprehensive synthetic data de- tection benchmark using large multimodal models,

    J. Ye, B. Zhou, Z. Huang, J. Zhang, T. Bai, H. Kang, J. He, H. Lin, Z. Wang, T. Wuet al., “Loki: A comprehensive synthetic data de- tection benchmark using large multimodal models,”arXiv preprint arXiv:2410.09732, 2024

  26. [26]

    Fakebench: Probing explainable fake image detection via large mul- timodal models,

    Y . Li, X. Liu, X. Wang, B. S. Lee, S. Wang, A. Rocha, and W. Lin, “Fakebench: Probing explainable fake image detection via large mul- timodal models,”IEEE Transactions on Information Forensics and Security, 2025

  27. [27]

    X2-dfd: A framework for explainable and extendable deepfake detection,

    Y . Chen, Z. Yan, G. Cheng, K. Zhao, S. Lyu, and B. Wu, “X2-dfd: A framework for explainable and extendable deepfake detection,”arXiv preprint arXiv:2410.06126, 2024

  28. [28]

    Aigi-holmes: Towards explainable and generalizable ai-generated image detection via multimodal large language models,

    Z. Zhou, Y . Luo, Y . Wu, K. Sun, J. Ji, K. Yan, S. Ding, X. Sun, Y . Wu, and R. Ji, “Aigi-holmes: Towards explainable and generalizable ai-generated image detection via multimodal large language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 18 746–18 758

  29. [29]

    Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation,

    S. Wen, J. Ye, P. Feng, H. Kang, Z. Wen, Y . Chen, J. Wu, W. Wu, C. He, and W. Li, “Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation,”arXiv preprint arXiv:2503.14905, 2025

  30. [30]

    Critique fine-tuning: Learning to cri- tique is more effective than learning to imitate, 2025,

    Y . Wang, X. Yue, and W. Chen, “Critique fine-tuning: Learning to cri- tique is more effective than learning to imitate, 2025,”URL https://arxiv. org/abs/2501.17703

  31. [31]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022. 14

  32. [32]

    Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,

    J. Ye, D. Jiang, Z. Wang, L. Zhu, Z. Hu, Z. Huang, J. He, Z. Yan, J. Yu, H. Liet al., “Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation,”arXiv preprint arXiv:2508.09987, 2025

  33. [33]

    Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation,

    Z. Yan, J. Ye, W. Li, Z. Huang, S. Yuan, X. He, K. Lin, J. He, C. He, and L. Yuan, “Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation,”arXiv preprint arXiv:2504.02782, 2025

  34. [34]

    Mind-brush: Integrating agentic cognitive search and reasoning into image generation,

    J. He, J. Ye, Z. Huang, D. Jiang, C. Zhang, L. Zhu, R. Zhang, X. Zhang, and W. Li, “Mind-brush: Integrating agentic cognitive search and reasoning into image generation,”arXiv preprint arXiv:2602.01756, 2026

  35. [35]

    Cnn- generated images are surprisingly easy to spot... for now,

    S.-Y . Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn- generated images are surprisingly easy to spot... for now,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8695–8704

  36. [36]

    Towards universal fake image detec- tors that generalize across generative models,

    U. Ojha, Y . Li, and Y . J. Lee, “Towards universal fake image detec- tors that generalize across generative models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 24 480–24 489

  37. [37]

    Detecting generated images by real images only,

    X. Bi, B. Liu, F. Yang, B. Xiao, W. Li, G. Huang, and P. C. Cosman, “Detecting generated images by real images only,”arXiv preprint arXiv:2311.00962, 2023

  38. [38]

    Generative adversarial networks,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020

  39. [39]

    Rich and poor texture con- trast: A simple yet effective approach for ai-generated image detection,

    N. Zhong, Y . Xu, Z. Qian, and X. Zhang, “Rich and poor texture con- trast: A simple yet effective approach for ai-generated image detection,” arXiv preprint arXiv:2311.12397, vol. 3, no. 6, p. 1, 2023

  40. [40]

    Fakecatcher: Detection of synthetic portrait videos using biological signals,

    U. A. Ciftci, I. Demir, and L. Yin, “Fakecatcher: Detection of synthetic portrait videos using biological signals,”IEEE transactions on pattern analysis and machine intelligence, 2020

  41. [41]

    Wavelet-packets for deepfake image analysis and detection,

    M. Wolter, F. Blanke, R. Heese, and J. Garcke, “Wavelet-packets for deepfake image analysis and detection,”Machine Learning, vol. 111, no. 11, pp. 4295–4327, 2022

  42. [42]

    Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,

    C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 28 130–28 139

  43. [43]

    Visual veracity: Advancing ai-generated image detection with convolutional neural networks,

    A. S. Gupta, K. P. Shreneter, and S. Sehgal, “Visual veracity: Advancing ai-generated image detection with convolutional neural networks,”2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), 2024

  44. [44]

    Compar- ative analyais of cnn architectures for deep fake detection,

    P. Wani, S. Chavan, S. Paithankar, D. Ghusse, and S. Barve, “Compar- ative analyais of cnn architectures for deep fake detection,”2025 3rd International Conference on Intelligent Systems, Advanced Computing and Communication (ISACC), 2025

  45. [45]

    Detec- tion of ai-generated synthetic images with a lightweight cnn,

    A. L. La ¯devi´c, T. Kramberger, R. Kramberger, and D. Vlahek, “Detec- tion of ai-generated synthetic images with a lightweight cnn,”Applied Informatics, 2024

  46. [46]

    Advancing ai-generated image detection: Enhanced accuracy through cnn and vision transformer models with explainable ai insights,

    M. Z. Hossain, F. U. Zaman, and M. R. Islam, “Advancing ai-generated image detection: Enhanced accuracy through cnn and vision transformer models with explainable ai insights,”2023 26th International Conference on Computer and Information Technology (ICCIT), 2023

  47. [47]

    Antifakeprompt: Prompt- tuned vision-language models are fake image detectors,

    Y .-M. Chang, C. Yeh, W.-C. Chiu, and N. Yu, “Antifakeprompt: Prompt- tuned vision-language models are fake image detectors,”arXiv preprint arXiv:2310.17419, 2023

  48. [48]

    Fad-net: Fake images detection and generalization based on frequency domain transformation,

    X. Liu, J. Liu, P. Guo, D. Tuo, S. Tian, and Y . Jiang, “Fad-net: Fake images detection and generalization based on frequency domain transformation,” in2022 15th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP- BMEI). IEEE, 2022, pp. 1–7

  49. [49]

    Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning,

    C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 5052–5060

  50. [50]

    Dynamic graph learning with content-guided spatial-frequency relation reasoning for deepfake detection,

    Y . Wang, K. Yu, C. Chen, X. Hu, and S. Peng, “Dynamic graph learning with content-guided spatial-frequency relation reasoning for deepfake detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7278–7287

  51. [51]

    OpenAI, “Gpt-4o,” https://openai.com/index/ introducing-4o-image-generation, 2025

  52. [52]

    Gemini 2.5 pro,

    G. DeepMind, “Gemini 2.5 pro,” Accessed: 2025-11-11, 2025. [Online]. Available: https://cloud.google.com/vertex-ai/generative-ai/docs/models/ gemini/2-5-pro

  53. [53]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023

  54. [54]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13923

  55. [55]

    Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181, 2024

    H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, C. Li, W. Sun, Q. Yan, G. Zhaiet al., “Q-bench: A benchmark for general-purpose foundation models on low-level vision,”arXiv preprint arXiv:2309.14181, 2023

  56. [56]

    Why are visually-grounded language models bad at image classification?

    Y . Zhang, A. Unell, X. Wang, D. Ghosh, Y . Su, L. Schmidt, and S. Yeung-Levy, “Why are visually-grounded language models bad at image classification?”Advances in Neural Information Processing Systems, vol. 37, pp. 51 727–51 753, 2024

  57. [57]

    Fakeshield: Explainable image forgery detection and localization via multi-modal large language models,

    Z. Xu, X. Zhang, R. Li, Z. Tang, Q. Huang, and J. Zhang, “Fakeshield: Explainable image forgery detection and localization via multi-modal large language models,”arXiv preprint arXiv:2410.02761, 2024

  58. [58]

    Common sense reasoning for deepfake detection,

    Y . Zhang, B. Colman, X. Guo, A. Shahriyari, and G. Bharaj, “Common sense reasoning for deepfake detection,” inEuropean conference on computer vision. Springer, 2024, pp. 399–415

  59. [59]

    arXiv preprint arXiv:2408.10072 (2024)

    Z. Huang, B. Xia, Z. Lin, Z. Mou, W. Yang, and J. Jia, “Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant,”arXiv preprint arXiv:2408.10072, 2024

  60. [60]

    BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

    H. Wen, T. Li, Z. Huang, Y . He, and G. Cheng, “Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm,”arXiv preprint arXiv:2507.14632, 2025

  61. [61]

    Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

    Y . Li, W. Zheng, Y . Zhang, R. Sun, Y . Zheng, L. Chen, J. Zhou, and J. Lu, “Skyra: Ai-generated video detection via grounded artifact reasoning,”arXiv preprint arXiv:2512.15693, 2025

  62. [62]

    ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization

    J. Liu, F. Zhang, J. Zhu, E. Sun, Q. Zhang, and Z.-J. Zha, “Forgerygpt: Multimodal large language model for explainable image forgery detec- tion and localization,”arXiv preprint arXiv:2410.10238, 2024

  63. [63]

    arXiv preprint arXiv:2505.18660 (2025)

    Z. Huang, T. Li, X. Li, H. Wen, Y . He, J. Zhang, H. Fei, X. Yang, X. Huang, B. Peng, and G. Cheng, “So-fake: Benchmarking and explaining social media image forgery detection,” 2025. [Online]. Available: https://arxiv.org/abs/2505.18660

  64. [64]

    Sida: Social media image deepfake detection, localization and explanation with large multimodal model,

    Z. Huang, J. Hu, X. Li, Y . He, X. Zhao, B. Peng, B. Wu, X. Huang, and G. Cheng, “Sida: Social media image deepfake detection, localization and explanation with large multimodal model,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 831– 28 841

  65. [65]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

    A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V . Ferrari, “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,”International Journal of Computer Vision, vol. 128, no. 7, p. 1956–1981, Mar. 2020. [O...

  66. [66]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th symposium on operating systems principles, 2023, pp. 611–626

  67. [67]

    Sglang: Efficient execution of structured language model programs,

    L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalezet al., “Sglang: Efficient execution of structured language model programs,”Advances in neural information processing systems, vol. 37, pp. 62 557–62 583, 2024

  68. [68]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  69. [69]

    Sophiavl-r1: Reinforcing mllms reasoning with thinking reward,

    K. Fan, K. Feng, H. Lyu, D. Zhou, and X. Yue, “Sophiavl-r1: Reinforcing mllms reasoning with thinking reward,”arXiv preprint arXiv:2505.17018, 2025

  70. [70]

    On the detection of synthetic images generated by diffusion models,

    R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, and L. Verdoliva, “On the detection of synthetic images generated by diffusion models,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  71. [71]

    Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

    Z. Yan, J. Wang, Z. Wang, P. Jin, K.-Y . Zhang, S. Chen, T. Yao, S. Ding, B. Wu, and L. Yuan, “Effort: Efficient orthogonal modeling for general- izable ai-generated image detection,”arXiv preprint arXiv:2411.15633, 2024

  72. [72]

    A sanity check for ai-generated image detection,

    S. Yan, O. Li, J. Cai, Y . Hao, X. Jiang, Y . Hu, and W. Xie, “A sanity check for ai-generated image detection,”arXiv preprint arXiv:2406.19435, 2024

  73. [73]

    Gpt-5 system card,

    OpenAI, “Gpt-5 system card,” https://cdn.openai.com/ gpt-5-system-card.pdf, Aug. 2025

  74. [74]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wang, Z. Xie, Y . Wu, K. Hu, J. Wang, Y . Sun, Y . Li, 15 Y . Piao, K. Guan, A. Liu, X. Xie, Y . You, K. Dong, X. Yu, H. Zhang, L. Zhao, Y . Wang, and C. Ruan, “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding,” 2024. [Online]. Available:...

  75. [75]

    Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models,

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

  76. [76]
  77. [77]

    Global texture enhancement for fake face detection in the wild,

    Z. Liu, X. Qi, and P. H. Torr, “Global texture enhancement for fake face detection in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8060–8069

  78. [78]

    Fusing global and local features for generalized ai-synthesized image detection,

    Y . Ju, S. Jia, L. Ke, H. Xue, K. Nagano, and S. Lyu, “Fusing global and local features for generalized ai-synthesized image detection,” in2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022, pp. 3465–3469