pith. machine review for the scientific record. sign in

arxiv: 2602.04939 · v2 · submitted 2026-02-04 · 💻 cs.CV

Recognition: no theorem link

SynthForensics: Benchmarking and Evaluating People-Centric Synthetic Video Deepfakes

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords deepfake detectionsynthetic videobenchmarkgenerative modelsT2VI2Vdetector evaluationvideo forensics
0
0 comments X

The pith

A new benchmark of 20,445 synthetic people videos reveals deepfake detectors lose 27 AUC points on average from legacy sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SynthForensics to test detectors on realistic videos of people made by current text-to-video and image-to-video generators rather than older editing-based forgeries. It pairs these videos with real footage from established sources, validates them through two stages of human review, and supplies versions at multiple compression levels. Across 15 detectors, performance falls sharply on the new data, with further losses under compression, and retraining on synthetics improves new results only by hurting accuracy on older benchmarks. The study finds that features for spotting synthetic generation differ from those for spotting manipulations. The full dataset, pipeline, and code are released to support better evaluation.

Core claim

SynthForensics supplies 20,445 videos from eight text-to-video and seven image-to-video open-source generators, each paired with matching real clips from FF++ and DFD, produced through two-stage human validation and delivered in four compression settings with complete metadata. Evaluation on fifteen detectors across three protocols shows face-based methods lose 13 to 55 AUC points, averaging 27, moving from FF++ to SynthForensics, with an additional average drop of 23 points under aggressive compression. Fine-tuning on the new benchmark narrows the gap yet reduces accuracy on legacy sets, while training from scratch indicates that synthetic-generation cues and manipulation cues remain mostly

What carries the argument

The SynthForensics benchmark itself, consisting of paired real-synthetic videos from fifteen specified T2V and I2V generators, two-stage human validation, and four compression variants.

If this is right

  • Face-based detectors exhibit AUC drops of 13-55 points (mean 27) when moving from FF++ to SynthForensics.
  • Aggressive compression adds a further average 23-point AUC decline on the new videos.
  • Fine-tuning on SynthForensics raises performance on synthetic content but lowers it on legacy manipulation benchmarks.
  • Training from scratch on synthetic data shows detection features for generation artifacts are largely disjoint from manipulation features.
  • The benchmark supplies a harder, more current test set for assessing detector robustness to realistic people synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Separate training regimes or hybrid models may be needed to cover both generative synthesis and traditional editing artifacts at once.
  • Real-world systems for video verification could integrate this benchmark to stay effective against improving generative models.
  • The paired-source and validation approach can be reused in other media types to track how new generators evade detectors.
  • Benchmark metadata could help generative-model developers create targeted examples that expose current detector weaknesses.

Load-bearing premise

The eight text-to-video and seven image-to-video generators plus the two-stage human validation produce videos that stand in for realistic synthetic people content today and in the near future.

What would settle it

Testing the fifteen detectors on SynthForensics and measuring whether AUC falls by 13-55 points from their FF++ scores, or stays flat, directly checks the reported generalization failure.

Figures

Figures reproduced from arXiv: 2602.04939 by Claudio Vittorio Ragaglia, Francesco Guarnera, Luca Guarnera, Mirko Casu, Roberto Leotta, Salvatore Alfio Sambataro, Sebastiano Battiato, Yuri Petralia.

Figure 1
Figure 1. Figure 1: The paired-source data generation pipeline for SynthForensics. Real source videos (left) from FF++ and DFD are analyzed to [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The SynthForensics benchmark. Starting from 1,363 source videos (FF++, DFD), a VLM generates structured 7-field descrip￾tions validated by human reviewers. Each prompt is optimized for five T2V models (CogVideoX, MAGI-1, Self-Forcing, SkyReels-V2, Wan2.1). Synthesized videos undergo human validation before processing into four compression versions (Raw, Canonical, CRF23, CRF40). The final benchmark compris… view at source ↗
Figure 3
Figure 3. Figure 3: Zero-shot performance (Video AUC %) of state-of-the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The exact System Prompt utilized with VideoLLaMA 3 [ [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative example of the raw structured metadata [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The exact System Instruction used for the automated safety screening performed with the LLM [ [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Failure Analysis of Excluded Models. We visualize representative artifacts that led to the exclusion of candidate [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Comparison (Sample A - News Anchor). Frame sequences ( [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Comparison (Sample B - Sports Broadcast). Frame sequences ( [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Fine-grained Generation Fidelity Analysis (Sample B - Sports Broadcast). Frame sequences (t = 0 to t = 4) comparing mag [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

Modern T2V/I2V generators synthesize people increasingly hard to distinguish from authentic footage, while current evaluation suites lag: legacy benchmarks target manipulation-based forgeries, and recent synthetic-video benchmarks prioritize scale over realistic human depiction. We introduce SynthForensics, a people-centric benchmark of $20{,}445$ videos from 8 T2V and 7 I2V open-source generators, paired-source from FF++/DFD reals, two-stage human-validated, in four compression versions with full metadata. In our paired-comparison human study, raters prefer SynthForensics in $71$--$77\%$ of head-to-head comparisons against each of nine existing synthetic-video benchmarks, while facial-quality metrics fall within the FF++/DFD baseline range. Across 15 detectors and three protocols, face-based methods drop $13$--$55$ AUC points (mean $27$) from FF++ to SynthForensics and a further $23$ under aggressive compression; fine-tuning closes the gap at a backward cost on legacy benchmarks; training from scratch shows synthetic and manipulation features largely disjoint for most detectors. We release dataset, pipeline, and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces SynthForensics, a people-centric benchmark of 20,445 videos generated from 8 T2V and 7 I2V open-source models, paired with real footage from FF++ and DFD. It includes two-stage human validation, four compression levels, and full metadata. Human raters prefer the new videos over nine prior synthetic benchmarks in 71-77% of comparisons. Across 15 detectors and three protocols, face-based detectors show AUC drops of 13-55 points (mean 27) from FF++ to SynthForensics, with an additional 23-point drop under aggressive compression; fine-tuning recovers performance on the new data at the cost of legacy benchmarks, while training from scratch indicates largely disjoint features between synthetic and manipulation detection.

Significance. If the benchmark videos are representative of realistic synthetic people content, the results demonstrate a clear generalization failure in existing detectors and support the claim that synthetic-video and manipulation features are largely disjoint for most models. The public release of the dataset, pipeline, and code is a clear strength that enables future work. The empirical scale (15 detectors, multiple protocols, compression variants) provides concrete evidence of the performance gap, though the magnitude of the reported drops depends on the chosen generators being broadly representative.

major comments (2)
  1. [Dataset construction] Dataset construction section: the selection of the specific 8 T2V and 7 I2V open-source generators is presented without detailed justification, ablation studies, or comparison to proprietary models. Because the headline AUC drops (13-55 points, mean 27) and the disjoint-features conclusion are only meaningful if these videos lie on the same distribution as current and near-future realistic synthetic people content, the lack of evidence that the observed artifacts are not generator-specific undermines the central generalization claim.
  2. [Evaluation protocols and results] Evaluation protocols and results sections: the two-stage human validation is described at a high level but lacks exact selection criteria, inter-rater agreement statistics, and full methodology details. In addition, the AUC results are reported without error bars or confidence intervals, so the mean drop of 27 points and the further 23-point compression drop cannot be assessed for statistical reliability.
minor comments (3)
  1. [Abstract] Abstract: the reported human preference rates (71-77%) and AUC ranges would benefit from explicit error bars or sample sizes to allow readers to gauge precision.
  2. [Evaluation protocols] The paper should clarify the exact definition of the three evaluation protocols (e.g., train/test splits, whether cross-generator or cross-compression) in a dedicated subsection or table.
  3. [Experiments] Figure captions and tables listing the 15 detectors should include the precise model names and any fine-tuning hyperparameters to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each of the major comments below and plan to incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the selection of the specific 8 T2V and 7 I2V open-source generators is presented without detailed justification, ablation studies, or comparison to proprietary models. Because the headline AUC drops (13-55 points, mean 27) and the disjoint-features conclusion are only meaningful if these videos lie on the same distribution as current and near-future realistic synthetic people content, the lack of evidence that the observed artifacts are not generator-specific undermines the central generalization claim.

    Authors: We selected the 8 T2V and 7 I2V models because they represent leading publicly available generators at the time of dataset creation, chosen for their high adoption in the research community, strong performance on established video quality benchmarks such as VBench, and demonstrated capability for realistic human synthesis. In the revision we will add an explicit selection-criteria subsection that lists model popularity metrics (GitHub stars, citations), qualitative human-preference rankings, and references to prior works that used the same models. We will also include an ablation that recomputes the key AUC drops on random subsets of the generators to demonstrate that the observed gaps are consistent rather than driven by any single model. Regarding proprietary models, our benchmark deliberately targets open-source generators to guarantee full reproducibility and accessibility; closed-source systems are not generally available for large-scale dataset construction by academic researchers. We will add a limitations paragraph acknowledging this scope and noting that the rapid convergence of open and closed model quality makes the reported generalization failure relevant to near-future realistic content. revision: yes

  2. Referee: [Evaluation protocols and results] Evaluation protocols and results sections: the two-stage human validation is described at a high level but lacks exact selection criteria, inter-rater agreement statistics, and full methodology details. In addition, the AUC results are reported without error bars or confidence intervals, so the mean drop of 27 points and the further 23-point compression drop cannot be assessed for statistical reliability.

    Authors: We will expand the human-validation section with the precise protocol: first-stage filtering required unanimous rejection of obvious artifacts by three raters; second-stage selection retained only clips that received at least three-out-of-five positive votes from a new rater pool. We will report inter-rater agreement using Fleiss’ kappa (target value >0.6) together with the exact rater instructions, compensation, and demographic summary. For all AUC figures we will recompute results with bootstrap resampling (1,000 iterations) over the test videos and add 95 % confidence intervals to every reported drop, including the mean 27-point and 23-point compression drops. These additions will allow direct statistical assessment of the reliability of the performance gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper creates a new dataset (SynthForensics) from open-source T2V/I2V generators, performs human validation, and reports empirical AUC drops for 15 detectors across protocols compared to FF++ and DFD. No equations, derivations, predictions, or first-principles results are claimed. All performance numbers are direct measurements on held-out data; no parameters are fitted and then renamed as predictions. No self-citations are load-bearing for the central claims, and the representativeness of the chosen generators is an explicit modeling choice rather than a derived result. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities; relies on standard computer-vision evaluation practices and human preference protocols.

axioms (1)
  • domain assumption Human raters provide reliable preference judgments in paired comparisons of synthetic versus real videos
    Central to the 71-77% preference claim and benchmark validation

pith-pipeline@v0.9.0 · 5535 in / 1055 out tokens · 31732 ms · 2026-05-16T07:29:31.704773+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 16 internal anchors

  1. [1]

    MAGI-1: Autoregressive Video Generation at Scale

    Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, et al. MAGI-1: Autoregressive Video Genera- tion at Scale. arXiv preprint arXiv:2505.13211, 2025. 3, 4, 6, 8, 2

  2. [2]

    Advance Fake Video Detection via Vision Transformers

    Joy Battocchio, Stefano Dell’Anna, Andrea Montibeller, and Giulia Boato. Advance Fake Video Detection via Vision Transformers. In Proceedings of the 2025 ACM Workshop on Information Hiding and Multimedia Security, page 1–11, New York, NY , USA, 2025. Association for Computing Ma- chinery. 3

  3. [3]

    Human Action CLIPs: Detecting AI-generated Human Motion

    Matyas Bohacek and Hany Farid. Human Action CLIPs: Detecting AI-generated Human Motion. arXiv preprint arXiv:2412.00526, 2025. 3

  4. [4]

    A V-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations

    Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muhammad Haris Khan, Usman Tariq, et al. A V-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations. In Proceedings of the 33rd ACM International Conference on Multimedia , page 13686–13691, New York, NY , USA, 2025. Association for Computing Machinery. 3

  5. [5]

    End-to-End Reconstruction- Classification Learning for Face Forgery Detection

    Junyi Cao, Chao Ma, Taiping Yao, Shen Chen, Shouhong Ding, and Xiaokang Yang. End-to-End Reconstruction- Classification Learning for Face Forgery Detection. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4113–4122, 2022. 3, 5, 7, 8, 11

  6. [6]

    GenAI mirage: The impostor bias and the deepfake detection challenge in the era of artificial illu- sions

    Mirko Casu, Luca Guarnera, Pasquale Caponnetto, and Se- bastiano Battiato. GenAI mirage: The impostor bias and the deepfake detection challenge in the era of artificial illu- sions. Forensic Science International: Digital Investigation, 50:301795, 2024. 2

  7. [7]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, et al. SkyReels-V2: Infinite-length Film Generative Model. arXiv preprint arXiv:2504.13074, 2025. 3, 4, 6, 8, 1

  8. [8]

    Trusted Media Challenge Dataset and User Study

    Weiling Chen, Sheng Lun Benjamin Chua, Stefan Winkler, and See-Kiong Ng. Trusted Media Challenge Dataset and User Study. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, page 3873–3877, New York, NY , USA, 2022. Association for Computing Machinery. 3, 7

  9. [9]

    X2-dfd: A framework for ex- plainable and extendable deepfake detection

    Yize Chen, Zhiyuan Yan, Guangliang Cheng, Kangran Zhao, Siwei Lyu, and Baoyuan Wu. X2-dfd: A framework for ex- plainable and extendable deepfake detection. arXiv preprint arXiv:2410.06126, 2024. 7

  10. [10]

    Can We Leave Deepfake Data Behind in Training Deepfake Detector? In Advances in Neural Information Processing Systems , pages 21979– 21998

    Jikang Cheng, Zhiyuan Yan, Ying Zhang, Yuhao Luo, Zhongyuan Wang, and Chen Li. Can We Leave Deepfake Data Behind in Training Deepfake Detector? In Advances in Neural Information Processing Systems , pages 21979– 21998. Curran Associates, Inc., 2024. 3, 5, 7, 8, 11

  11. [11]

    GenConViT: Deepfake Video Detection Using Generative Convolutional Vision Transformer

    Deressa Wodajo Deressa, Hannes Mareen, Peter Lambert, Solomon Atnafu, Zahid Akhtar, and Glenn Van Wallendael. GenConViT: Deepfake Video Detection Using Generative Convolutional Vision Transformer. Applied Sciences , 15 (12), 2025. 3, 5, 7, 8, 11

  12. [12]

    The DeepFake Detection Challenge (DFDC) Dataset

    Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, et al. The DeepFake Detection Chal- lenge (DFDC) Dataset. arXiv preprint arXiv:2006.07397 ,

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3

  14. [14]

    Contributing Data to Deepfake Detection Research

    Nick Dufour and Andrew Gully. Contributing Data to Deepfake Detection Research. https : / / ai . googleblog . com / 2019 / 09 / contributing - data - to - deepfake - detection . html , 2019. Accessed: 2025-11-05. 3, 6, 7, 8

  15. [15]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2014. 2

  16. [16]

    New Veo Updates and More AI Progress in Video and Audio

    Google. New Veo Updates and More AI Progress in Video and Audio. https://blog.google/technology/ ai/veo-updates-flow/ , 2024. Accessed: 2025-11-

  17. [17]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, et al. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783,

  18. [18]

    Level up the deepfake detection: a method to effectively dis- criminate images generated by gan architectures and diffu- sion models

    Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. Level up the deepfake detection: a method to effectively dis- criminate images generated by gan architectures and diffu- sion models. In Intelligent Systems Conference, pages 615–

  19. [19]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, et al. LTX- Video: Realtime Video Latent Diffusion. arXiv preprint arXiv:2501.00103, 2024. 2, 4, 3

  20. [20]

    Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model

    Yue-Hua Han, Tai-Ming Huang, Kai-Lung Hua, and Jun- Cheng Chen. Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22995–23005, 2025. 3, 5, 7, 8, 11

  21. [21]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. arXiv preprint arXiv:2207.12598, 2022. 4 9

  22. [22]

    Denoising Dif- fusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif- fusion Probabilistic Models. In Advances in Neural Infor- mation Processing Systems, pages 6840–6851. Curran Asso- ciates, Inc., 2020. 2, 4

  23. [23]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self Forcing: Bridging the Train- Test Gap in Autoregressive Video Diffusion. arXiv preprint arXiv:2506.08009, 2025. 3, 4, 6, 8, 2

  24. [24]

    Self-Forcing: Official Implementa- tion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self-Forcing: Official Implementa- tion. https://github.com/guandeh17/Self- Forcing, 2025. Accessed: 2025-11-09. 2, 3, 5

  25. [25]

    VBench: Comprehensive Bench- mark Suite for Video Generative Models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, et al. VBench: Comprehensive Bench- mark Suite for Video Generative Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 5, 9

  26. [26]

    DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection

    Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 3

  27. [27]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022. 4

  28. [28]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, et al. HunyuanVideo: A Systematic Frame- work For Large Video Generative Models. arXiv preprint arXiv:2412.03603, 2025. 2, 4, 3

  29. [29]

    DeepFakes: a New Threat to Face Recognition? Assessment and Detection

    Pavel Korshunov and Sebastien Marcel. DeepFakes: a New Threat to Face Recognition? Assessment and Detection. arXiv preprint arXiv:1812.08685, 2018. 3, 7

  30. [30]

    Kuaishou Technology. Kling. https://klingai.com/ global/, 2024. Accessed: 2025-11-05. 2, 4

  31. [31]

    Roy-Chowdhury

    Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachan- dran, and Amit K. Roy-Chowdhury. Towards a Universal Synthetic Video Detector: From Face or Background Ma- nipulations to Fully AI-Generated Content. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 28050–28060, 2025. 2

  32. [32]

    Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics

    Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 2, 3, 6, 7

  33. [33]

    Celeb-DF++: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics

    Yuezun Li, Delong Zhu, Xinjie Cui, and Siwei Lyu. Celeb-DF++: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics. arXiv preprint arXiv:2507.18015, 2025. 3

  34. [34]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 4

  35. [35]

    Flow matching for generative mod- eling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. In 11th International Conference on Learning Repre- sentations, ICLR 2023, 2023. 2

  36. [36]

    Videodpo: Omni- preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. In Pro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 2

  37. [37]

    Anwei Luo, Chenqi Kong, Jiwu Huang, Yongjian Hu, Xi- angui Kang, and Alex C. Kot. Beyond the Prior Forgery Knowledge: Mining Critical Clues for General Face Forgery Detection. IEEE Transactions on Information Forensics and Security, 19:1168–1182, 2024. 3, 5, 7, 11

  38. [38]

    Step-Video-T2V Technical Re- port: The Practice, Challenges, and Future of Video Founda- tion Model

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, et al. Step-Video-T2V Technical Re- port: The Practice, Challenges, and Future of Video Founda- tion Model. arXiv preprint arXiv:2502.10248 , 2025. 2, 4, 3

  39. [39]

    GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video

    Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, et al. GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video. arXiv preprint arXiv:2501.11340, 2025. 3

  40. [40]

    OpenAI. Sora 2. https://openai.com/index/ sora-2/, 2024. Accessed: 2025-11-05. 2, 4

  41. [41]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

  42. [42]

    Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

    Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, et al. Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k.arXiv preprint arXiv:2503.09642, 2025. 2, 4, 3

  43. [43]

    FSRT: Facial Scene Representation Transformer for Face Reenact- ment from Factorized Appearance Head-pose and Facial Ex- pression Features

    Andre Rochow, Max Schwarz, and Sven Behnke. FSRT: Facial Scene Representation Transformer for Face Reenact- ment from Factorized Appearance Head-pose and Facial Ex- pression Features. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7716–7726, 2024. 2

  44. [44]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention , pages 234–241. Springer, 2015. 2

  45. [45]

    FaceForen- sics++: Learning to Detect Manipulated Facial Images

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Niessner. FaceForen- sics++: Learning to Detect Manipulated Facial Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. 2, 3, 6, 7, 8

  46. [46]

    BlendFace: Re-designing Identity Encoders for Face- Swapping

    Kaede Shiohara, Xingchao Yang, and Takafumi Take- tomi. BlendFace: Re-designing Identity Encoders for Face- Swapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2023. 2

  47. [47]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 4

  48. [48]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 2

  49. [49]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, et al. Wan: Open and Advanced Large-Scale 10 Video Generative Models.arXiv preprint arXiv:2503.20314,

  50. [50]

    AltFreezing for More Gen- eral Video Face Forgery Detection

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, and Houqiang Li. AltFreezing for More Gen- eral Video Face Forgery Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4129–4138, 2023. 3, 5, 7, 8, 11

  51. [51]

    UCF: Uncovering Common Features for Generalizable Deepfake Detection

    Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. UCF: Uncovering Common Features for Generalizable Deepfake Detection. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision , pages 22412– 22423, 2023. 3, 5, 7, 8, 11

  52. [52]

    DeepfakeBench: A Comprehensive Bench- mark of Deepfake Detection

    Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, and Baoyuan Wu. DeepfakeBench: A Comprehensive Bench- mark of Deepfake Detection. In Advances in Neural Infor- mation Processing Systems, pages 4534–4565. Curran Asso- ciates, Inc., 2023. 5

  53. [53]

    Orthogonal Subspace De- composition for Generalizable AI-Generated Image Detec- tion

    Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, et al. Orthogonal Subspace De- composition for Generalizable AI-Generated Image Detec- tion. In Forty-second International Conference on Machine Learning, 2025. 3, 5, 7, 8, 11

  54. [54]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, et al. CogVideoX: Text-to-Video Dif- fusion Models with An Expert Transformer. arXiv preprint arXiv:2408.06072, 2025. 2, 4, 6, 8, 3

  55. [55]

    Free- man, Fredo Durand, Eli Shechtman, and Xun Huang

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Free- man, Fredo Durand, Eli Shechtman, and Xun Huang. From Slow Bidirectional to Fast Autoregressive Video Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22963– 22974, 2025. 2, 4, 3

  56. [56]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, et al. VideoLLaMA 3: Fron- tier Multimodal Foundation Models for Image and Video Understanding. arXiv preprint arXiv:2501.13106, 2025. 4, 1, 2, 3

  57. [57]

    SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, et al. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8652–8661, 2023. 2

  58. [58]

    Exploring Temporal Coherence for More Gen- eral Video Face Forgery Detection

    Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. Exploring Temporal Coherence for More Gen- eral Video Face Forgery Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15044–15054, 2021. 3, 5, 7, 8, 11

  59. [59]

    WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection

    Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection. In Proceedings of the 28th ACM International Conference on Multimedia , page 2382–2390, New York, NY , USA, 2020. Association for Computing Machinery. 3, 7 11 SynthForensics: A Multi-Generator Benchmark for Detectin...

  60. [60]

    Technical Appendices Overview This document presents the technical implementation de- tails and granular experimental analysis supporting the Syn- thForensics benchmark. The document is organized as fol- lows: • Section 8 details the structured prompt generation and adaptation process, including the exact system instruc- tions used for semantic alignment ...

  61. [61]

    short description

    The SynthForensics Benchmark: Extended Details This section presents the technical specifications and pro- tocols governing the construction of the SynthForensics benchmark. We first detail the structured prompt genera- tion pipeline, including the system instructions for semantic alignment and the negative prompt optimization strategies. Subsequently, we...

  62. [62]

    Prompt Rewriting: For failures in Semantic Coherence or Ethical Compliance, the textual prompt was manually rewritten to reduce ambiguity or enforce stricter safety constraints

  63. [63]

    Negative Prompt Augmentation: For Anatomical or Ren- dering artifacts, we augmented the model-specific neg- ative prompts with targeted keywords (e.g., adding ”de- formed iris” or ”glitch”) to actively suppress the defect in the next generation pass

  64. [64]

    video-within- video

    Parameter Adjustment: For Temporal Artifacts or weak adherence to instructions, we fine-tuned the generation hyperparameters: • CFG Scale: Adjusted (typically +0.5 to +1.0) to strengthen semantic adherence. • Inference Steps: Increased (e.g., from 50 to 60) to re- solve under-generated details. • Temporal Shift: Modified to stabilize motion trajecto- ries...

  65. [65]

    Extended Results and Analysis In this section we present the evaluation methodology de- signed to execute a fair comparison across diverse detec- tion architectures. We first outline the preprocessing and output standardization steps required to align frame-level and video-level models, and then define the primary metrics used to measure detection efficac...