arxiv: 2605.06912 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: no theorem link

Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge

Kirill Trapeznikov , Gabriel Mancino-Ball , Jonathan Li , Paul Cummer , Jai Aslam , Danial Samadi Vahdati , Tai Nguyen , Matthew C. Stamm , Peter Bautista , Michael Davinroy , Laura Cassani , Jill Crisman

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords synthetic video detectiongenerative videodeepfake detectionpost-processing robustnesscross-generator generalizationmedia forensicschallenge evaluationvideo authenticity

0 comments

The pith

Synthetic video detectors advance across generators but not edits

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from the SAFE Synthetic Video Detection Challenge, which ran for 90 days with over 600 submissions testing algorithms on 6000 videos from 13 generative models matched to real footage. Detectors showed improved ability to identify synthetic content from previously unseen generators, yet performance dropped sharply when videos underwent post-processing such as resizing, re-compression, and motion blur. This evaluation matters because generative video tools are becoming widely accessible, making reliable detection essential for verifying media authenticity. The challenge used blind testing on a 20-hour dataset to measure both cross-generator generalization and robustness to common manipulations.

Core claim

The SAFE challenge demonstrates that contemporary synthetic video detection methods exhibit measurable progress in cross-generator generalization, successfully identifying content produced by diverse state-of-the-art models, while displaying persistent vulnerabilities when the same content is subjected to post-processing operations including resizing, re-compression, motion blur, and similar transformations.

What carries the argument

The two-task challenge design, consisting of synthetic content detection from diverse generators and detection after post-processing, evaluated under fully blind conditions on a matched real-synthetic dataset of 6000 videos.

Load-bearing premise

The challenge dataset and its chosen post-processing operations sufficiently represent the range of real-world conditions that synthetic video detectors will encounter.

What would settle it

A new detector achieving consistently high accuracy on post-processed test videos in an independent evaluation using different operations or additional unseen generators would undermine the reported persistent vulnerabilities.

Figures

Figures reproduced from arXiv: 2605.06912 by Danial Samadi Vahdati, Gabriel Mancino-Ball, Jai Aslam, Jill Crisman, Jonathan Li, Kirill Trapeznikov, Laura Cassani, Matthew C. Stamm, Michael Davinroy, Paul Cummer, Peter Bautista, Tai Nguyen.

**Figure 2.** Figure 2: Synthetic Video Sources: 10 different frame samples matching the content of the real videos for each TI2V generator (columns). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Generation pipeline used for creating the dataset that [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: A sample of frames showing the visual quality per aug [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Task 1 Results: [Left] ROC Curve for all teams (with FPR plotted on a log scale. [Right] AUC vs Inference time (on a log scale) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: True vs False Positive Rates for each augmentation type [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

The proliferation of generative video technologies has intensified the need for reliable methods to detect and characterize synthetic media. To address this challenge, we organized the \href{https://safe-video-2025.dsri.org}{SAFE: Synthetic Video Detection Challenge}, co-located with the \textit{Authenticity and Provenance in the Age of Generative AI (APAI) Workshop }at ICCV 2025. The competition invited participants to develop and evaluate algorithms capable of distinguishing real from synthetic videos under fully blind evaluation conditions with over 600 submissions from 12 teams over a 90 day span. Hosted on the Hugging Face platform, the challenge comprised two primary tasks: (1) detection of synthetic video content generated by diverse state-of-the-art models, and (2) detection of synthetic content following common post-processing operations such as resizing, re-compression, motion blur and others. The challenge data consisted of 13 modern high quality synthetic video models with generated content matched to real videos from 21 diverse and challenge sources, all adding up to 20 hours of 6,000 video samples. This paper describes the challenge design, dataset construction, evaluation methodology, and outcomes, offering insights into the generalization and robustness of contemporary synthetic video detection methods. Our findings highlight measurable progress in cross-generator generalization but also persistent vulnerabilities to post-processing artifacts. https://safe-video-2025.dsri.org

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward competition report that supplies new benchmarks from a blind challenge with 13 video generators and post-processing tests, but the post-processing vulnerabilities claim rests on an unvalidated assumption about real-world conditions.

read the letter

The main takeaway is that this paper reports outcomes from the SAFE challenge: a large blind evaluation with 6000 matched real and synthetic videos from 13 generators, split into standard detection and the same content after resizing, recompression, motion blur and similar steps. It finds measurable gains in cross-generator performance alongside ongoing weaknesses to those operations. That setup and the scale of submissions (over 600 from 12 teams) are the concrete new elements here, and they give the field a current snapshot it did not have before. The paper does a clean job describing the data construction, the two tasks, the blind protocol on Hugging Face, and the high-level results without overclaiming a new detector or theory. For anyone tracking synthetic media detection, those numbers and the public dataset are useful reference points. The soft spots are proportionate. The abstract and report give limited metrics, no error bars, and no detailed statistical breakdown, so the strength of the “progress” and “vulnerabilities” is hard to judge precisely from the text. More critically, the post-processing operations are treated as representative of real distribution pipelines, yet the paper supplies no external validation—such as comparisons to platform-collected videos or ablations with other transforms—to show the chosen artifacts match actual severity and statistics. As a competition report it stays descriptive rather than analytical. This work is for researchers who need current benchmarks in media forensics and misinformation detection. A reader looking for empirical grounding on where detectors stand today will find value in the scale and the dual-task design. I would send it to peer review. The data volume and blind evaluation protocol are substantial enough to merit referee time, even if revisions should add more quantitative detail and checks on the post-processing assumption.

Referee Report

1 major / 2 minor

Summary. The manuscript describes the organization and results of the SAFE: Synthetic Video Detection Challenge co-located with the APAI Workshop at ICCV 2025. It details the challenge design involving two tasks—synthetic video detection and detection under post-processing operations—using a dataset of 6,000 video samples (20 hours) from 13 synthetic generators matched to real videos from 21 sources. With over 600 submissions from 12 teams evaluated blindly on the Hugging Face platform, the paper reports measurable progress in cross-generator generalization while noting persistent vulnerabilities to post-processing artifacts such as resizing, re-compression, and motion blur.

Significance. If the detailed results and analysis support the high-level claims, this work is significant for providing a large-scale blind benchmark that evaluates both generalization across generators and robustness to common post-processing. Such benchmarks are timely for the synthetic media detection community and can inform the development of more reliable detectors. The public challenge platform and dataset scale add practical value beyond a standard methods paper.

major comments (1)

[Abstract] Abstract: The central claim of 'measurable progress in cross-generator generalization but also persistent vulnerabilities to post-processing artifacts' depends on the post-processing operations (resizing, re-compression, motion blur and others) serving as a valid proxy for real-world conditions. The manuscript provides no external validation, such as statistical comparisons of artifact distributions against platform-collected videos or ablations with stronger real-world transforms, leaving this assumption untested and load-bearing for the robustness conclusions.

minor comments (2)

[Abstract] Abstract: The phrase '21 diverse and challenge sources' appears to contain a possible typo or unclear phrasing; clarify whether this refers to 'challenging sources' or another intended meaning.
[Abstract] The abstract summarizes outcomes at a high level but does not include specific metrics, error bars, or statistical analysis; the full results section should provide these to allow verification of the 'measurable progress' claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript describing the SAFE challenge. We have carefully considered the major comment and provide our point-by-point response below. We are prepared to make revisions as outlined.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'measurable progress in cross-generator generalization but also persistent vulnerabilities to post-processing artifacts' depends on the post-processing operations (resizing, re-compression, motion blur and others) serving as a valid proxy for real-world conditions. The manuscript provides no external validation, such as statistical comparisons of artifact distributions against platform-collected videos or ablations with stronger real-world transforms, leaving this assumption untested and load-bearing for the robustness conclusions.

Authors: We agree that stronger external validation of the post-processing operations as real-world proxies would enhance the robustness claims. These operations were chosen as they represent frequently encountered manipulations in video distribution pipelines, drawing from prior work in media forensics. Nevertheless, we recognize the absence of direct artifact distribution comparisons or additional ablations. In the revised manuscript, we will update the abstract to avoid overgeneralizing the real-world implications and include a new subsection in the discussion that explicitly discusses the limitations of the chosen post-processing suite and proposes future directions for more comprehensive real-world testing. This revision will be made without altering the core challenge results. revision: partial

Circularity Check

0 steps flagged

Descriptive challenge report exhibits no circularity

full rationale

The paper is a descriptive summary of an external competition (SAFE challenge) including dataset construction, evaluation methodology, and empirical outcomes from participant submissions. It advances no mathematical derivations, predictions, fitted parameters, or first-principles results that could reduce to inputs by construction. Claims regarding cross-generator generalization and post-processing vulnerabilities are presented as observational findings from the blind evaluation on the provided test set, without self-definitional loops, ansatzes smuggled via citation, or load-bearing self-citations. The analysis remains self-contained against external benchmarks of the competition results themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical challenge report paper with no mathematical derivations, fitted parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5594 in / 1065 out tokens · 84778 ms · 2026-05-11T00:48:39.567924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 15 canonical work pages · 6 internal anchors

[1]

Rössler, et al., Faceforensics++: Learning to de- tect manipulated facial images, in: Proc

A. Rössler, et al., Faceforensics++: Learning to de- tect manipulated facial images, in: Proc. ICCV, 2019, pp. 1–11

2019
[2]

The DeepFake Detection Challenge (DFDC) Dataset

B. Dolhansky, et al., The deepfake detection challenge (dfdc) dataset, arXiv (2020).arXiv:2006.07397

work page internal anchor Pith review arXiv 2020
[3]

Jiang, W

L. Jiang, W. Wu, C. C. Li, C. Qian, C. C. Loy, Deeperforensics-1.0: A large-scale dataset for real- world face forgery detection, in: Proc. CVPR, 2020, pp. 2889–2898

2020
[4]

Kwon, et al., Kodf: A large-scale korean deepfake detection dataset, in: Proc

P. Kwon, et al., Kodf: A large-scale korean deepfake detection dataset, in: Proc. ICCV, 2021, pp. 10744– 10753

2021
[5]

B. Zi, Z. Song, K. Zhang, W. Luo, Wilddeepfake: A challenging real-world dataset for deepfake detection, arXiv (2021).arXiv:2101.01456

work page arXiv 2021
[6]

H. Chen, X. Hong, et al., Demamba: Ai-generated video detection on million-scale genvideo benchmark, arXiv (2024).arXiv:2405.19707

work page arXiv 2024
[7]

Yan, et al., Df40: Toward next-generation deepfake detection, in: NeurIPS Datasets and Benchmarks, 2024

Z. Yan, et al., Df40: Toward next-generation deepfake detection, in: NeurIPS Datasets and Benchmarks, 2024

2024
[8]

Bharadwaj, et al., Vane-bench: Video anomaly evaluation benchmark for conversational lmms, arXiv (2024).arXiv:2406.10326

R. Bharadwaj, et al., Vane-bench: Video anomaly evaluation benchmark for conversational lmms, arXiv (2024).arXiv:2406.10326. 10

work page arXiv 2024
[9]

N. A. Chandra, et al., Deepfake-eval-2024: A multi- modal in-the-wild benchmark of deepfakes circulated in 2024, arXiv (2025).arXiv:2503.02857

work page arXiv 2024
[10]

Batra, S

A. Batra, S. Kumar, et al., Socialdf: Benchmark dataset and detection model for mitigating deepfakes on social media, arXiv (2025).arXiv:2506.05538

work page arXiv 2025
[11]

Mavos-dd: Multilingual audio-video open- set deepfake detection benchmark.arXiv preprint arXiv:2505.11109, 2025

F.-A. Croitoru, et al., Mavos-dd: Multilingual audio- video open-set deepfake detection benchmark, arXiv (2025).arXiv:2505.11109

work page arXiv 2025
[12]

rep., Partnership on AI, accessed 2025 (2020)

Partnership on AI, The deepfake detection challenge: Insights and recommendations for ai and media in- tegrity, Tech. rep., Partnership on AI, accessed 2025 (2020)

2025
[13]

URLhttps://huggingface.co/docs/ competitions/en/index

Hugging face competitions (2025). URLhttps://huggingface.co/docs/ competitions/en/index

2025
[14]

Qwen3 Technical Report

A. Yang, et al., Qwen3 technical report, arXiv (2025). arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Team Wan, et al., Wan: Open and advanced large- scale video generative models, arXiv (2025).arXiv: 2503.20314

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

URLhttps://docs.cloud.google.com/ vertex-ai/generative-ai/docs/models/veo/ 2-0-generate

Veo 2 text+image-to-video generator (2024). URLhttps://docs.cloud.google.com/ vertex-ai/generative-ai/docs/models/veo/ 2-0-generate

2024
[17]

URLhttps://platform.pixverse.ai/onboard

Pixverse 4.5 text+image-to-video generator (2025). URLhttps://platform.pixverse.ai/onboard

2025
[18]

URLhttps://app.klingai.com/

Kling 2.0 text+image-to-video generator (2025). URLhttps://app.klingai.com/

2025
[19]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, et al., Hunyuanvideo: A systematic frame- work for large video generative models, arXiv (2024). arXiv:2412.03603

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Zhang, et al., Frame context packing and drift pre- vention in next-frame-prediction video diffusion mod- els, in: NeurIPS, 2025

L. Zhang, et al., Frame context packing and drift pre- vention in next-frame-prediction video diffusion mod- els, in: NeurIPS, 2025

2025
[21]

F.Bao, etal., Vidu: Ahighlyconsistent, dynamicand skilled text-to-video generator with diffusion models, arXiv (2024).arXiv:2405.04233

work page arXiv 2024
[22]

URLhttps://platform.minimax.io

Minimax video-01 text+image-to-video generator (2024). URLhttps://platform.minimax.io

2024
[23]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Y. Gao, et al., Seedance 1.0: Exploring the bound- aries of video generation models, arXiv (2025). arXiv:2506.09113

work page internal anchor Pith review arXiv 2025
[24]

URLhttps://docs.dev.runwayml.com/api/

Runway gen-4.0 turbo text+image-to-video generator (2025). URLhttps://docs.dev.runwayml.com/api/

2025
[25]

Tang, et al., The 9th ai city challenge, in: Proc

Z. Tang, et al., The 9th ai city challenge, in: Proc. ICCV Workshops, 2025, pp. 5467–5476

2025
[26]

Wang, et al., The 8th ai city challenge, in: Proc

S. Wang, et al., The 8th ai city challenge, in: Proc. CVPR Workshops, 2024, pp. 7261–7272

2024
[27]

Wang, et al., Mcblt: Multi-camera multi-object 3d tracking in long videos, arXiv (2024).arXiv:2412

Y. Wang, et al., Mcblt: Multi-camera multi-object 3d tracking in long videos, arXiv (2024).arXiv:2412. 00692

2024
[28]

HunyuanVideo-Avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025

Y. Chen, et al., Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple charac- ters, arXiv (2025).arXiv:2505.20156

work page arXiv 2025
[29]

Available: https://arxiv.org/abs/2506.09042

X. Ren, et al., Cosmos-drive-dreams: Scalable syn- thetic driving data generation with world foundation models, arXiv (2025).arXiv:2506.09042

work page arXiv 2025
[30]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, et al., Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, arXiv (2023).arXiv:2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

D. S. Vahdati, et al., Beyond deepfake images: De- tecting ai-generated videos, in: CVPRW, 2024, pp. 4397–4408

2024
[32]

Corvi, et al., Seeing what matters: Generalizable ai-generated video detection with forensic-oriented augmentation, in: NeurIPS, 2025

R. Corvi, et al., Seeing what matters: Generalizable ai-generated video detection with forensic-oriented augmentation, in: NeurIPS, 2025. Appendix A. Additional Competition Details Additional details about the competition are provided below. 11 Figure A.7: Competition Setup: participants log in with their Hugging Face credentials and provide the submissio...

2025