pith. machine review for the scientific record. sign in

arxiv: 2604.04029 · v1 · submitted 2026-04-05 · 💻 cs.CV

Recognition: no theorem link

ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

Chao Shen, Hang Wang, Lei Zhang, Zhi-Qi Cheng

Pith reviewed 2026-05-13 17:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated video detectiontemporal self-similaritymultimodal fusionvideo forensicsanomaly detectiongenerative models
0
0 comments X

The pith

AI-generated videos exhibit anomalous temporal self-similarity because they follow deterministic prompt-driven paths unlike the variable dynamics of real videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AI-generated videos contain a detectable fingerprint of anomalous temporal self-similarity arising from their deterministic generation processes. This fingerprint appears as unnaturally repetitive correlations in visual, semantic, and cross-modal domains over time. By constructing similarity matrices from frame descriptions and fusing them with cross-attention, the method captures global temporal anomalies that local artifact detectors miss. If correct, this approach would provide a more robust way to distinguish generated videos from authentic ones across various generation models.

Core claim

The paper's core claim is that AIGVs follow deterministic anchor-driven trajectories from text or image prompts, inducing unnaturally repetitive correlations across visual and semantic domains, which can be quantified by visual, textual, and cross-modal similarity matrices; these matrices are encoded by dedicated Transformer encoders and integrated via bidirectional cross-attentive fusion to detect generation with higher accuracy than prior methods.

What carries the argument

Anomalous temporal self-similarity measured through a triple-similarity representation of visual, textual, and cross-modal matrices built from frame-wise descriptions, processed by Transformer encoders and fused with cross-attention.

If this is right

  • Detection performance improves in AP, AUC, and ACC on large benchmarks including GenVideo and VidProM.
  • The approach generalizes better across different video generation models than local-artifact methods.
  • Emphasis shifts to global temporal evolution captured by multimodal similarity rather than short-term inconsistencies.
  • The fusion of intra- and inter-modal dynamics provides a unified framework for quantifying generative determinism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Video generation systems could add controlled temporal noise to reduce the repetitive self-similarity pattern and evade detection.
  • The similarity-matrix approach might adapt to spotting AI-generated sequences in audio or motion capture data by swapping the input modalities.
  • Hybrid videos containing both real and generated segments could be tested to determine how the anomaly signal degrades with partial replacement.

Load-bearing premise

AI-generated videos always follow deterministic trajectories from fixed prompts that create repetitive temporal correlations absent in real videos.

What would settle it

Construct the visual, textual, and cross-modal similarity matrices for both real videos and videos from the latest generators, then measure whether the resulting anomaly scores show clear separation or substantial overlap.

Figures

Figures reproduced from arXiv: 2604.04029 by Chao Shen, Hang Wang, Lei Zhang, Zhi-Qi Cheng.

Figure 1
Figure 1. Figure 1: Motivation of the proposed ATSS framework. Real [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of ATSS. Given a video with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualizations of 10 subsets on the GenVideo dataset: (a) Crafter, (b) Gen2, (c) HotShot, (d) Lavie, (e) ModelScope, [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of Attention Density Maps. Each row displays the attention weights for the visual, textual, and cross-modal [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that AI-generated videos (AIGVs) exhibit a distinctive fingerprint called anomalous temporal self-similarity (ATSS) arising from deterministic anchor-driven trajectories (e.g., fixed text/image prompts), in contrast to the stochastic dynamics of real videos. It proposes a multimodal detection framework that builds visual, textual, and cross-modal similarity matrices from frame-wise descriptions, encodes them with dedicated Transformer encoders, and fuses them via a bidirectional cross-attentive module to capture intra- and inter-modal temporal anomalies, reporting superior AP, AUC, and ACC on the GenVideo, EvalCrafter, VideoPhy, and VidProM benchmarks with strong generalization across generation models.

Significance. If the central claim is substantiated, the work would advance AIGV detection by shifting focus from localized artifacts or short-term inconsistencies to global temporal evolution, potentially improving robustness and cross-model generalization in digital forensics. The multimodal triple-similarity representation and planned code release would support reproducibility and further development.

major comments (2)
  1. [§3] §3 (Method): The similarity matrices are constructed directly from input frames and frame-wise descriptions without any explicit control or matching for motion statistics (e.g., optical-flow magnitude distributions) between real and generated videos. This leaves the core claim—that repetitive correlations reflect generative determinism rather than reduced temporal variance in prompt-conditioned clips—unverified and load-bearing for the generalization results.
  2. [§4] §4 (Experiments): No ablation studies isolating the contribution of the cross-attentive fusion versus individual similarity matrices, nor any failure-mode analysis on videos with matched motion statistics, are described. Without these, it is impossible to confirm that ATSS captures an intrinsic fingerprint rather than a motion artifact.
minor comments (2)
  1. [Abstract] Abstract: The claim of outperforming SOTA methods is stated without any numerical values for AP, AUC, or ACC; a one-sentence quantitative summary would improve clarity.
  2. [§3.1] Notation: The precise mathematical definitions of the three similarity matrices and the bidirectional cross-attentive fusion are introduced late; moving the key equations to §3.1 would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the validation of our core claim regarding anomalous temporal self-similarity in AIGVs. We address each major point below and will revise the manuscript accordingly to strengthen the evidence that ATSS reflects generative determinism rather than motion variance alone.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The similarity matrices are constructed directly from input frames and frame-wise descriptions without any explicit control or matching for motion statistics (e.g., optical-flow magnitude distributions) between real and generated videos. This leaves the core claim—that repetitive correlations reflect generative determinism rather than reduced temporal variance in prompt-conditioned clips—unverified and load-bearing for the generalization results.

    Authors: We agree that explicit matching of motion statistics would provide stronger isolation of the ATSS fingerprint. The textual and cross-modal similarity matrices incorporate semantic trajectories from frame-wise descriptions, which capture prompt-driven determinism beyond low-level motion variance; this is supported by our cross-model generalization results on benchmarks with diverse motion profiles. However, we acknowledge the concern is valid and will add a dedicated analysis section with optical-flow magnitude comparisons between real and generated videos, plus matched-subset experiments where feasible. revision: partial

  2. Referee: [§4] §4 (Experiments): No ablation studies isolating the contribution of the cross-attentive fusion versus individual similarity matrices, nor any failure-mode analysis on videos with matched motion statistics, are described. Without these, it is impossible to confirm that ATSS captures an intrinsic fingerprint rather than a motion artifact.

    Authors: We will incorporate the requested ablations in the revised version, including quantitative comparisons of performance using visual-only, textual-only, cross-modal-only matrices, and the full bidirectional cross-attentive fusion. We will also add failure-mode analysis on motion-matched video subsets (e.g., via optical-flow histogram matching) to demonstrate that ATSS detection remains robust when motion statistics are controlled. These additions will directly address whether the gains stem from the multimodal fusion of temporal anomalies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; core representations derived independently from video content

full rationale

The ATSS pipeline constructs visual, textual, and cross-modal similarity matrices directly from input frames and frame-wise descriptions. These matrices quantify temporal correlations without reference to detection labels or classifier outputs. Transformer encoders and bidirectional cross-attentive fusion operate on these fixed representations to produce detection scores. No equation or step reduces by construction to a fitted parameter renamed as prediction, nor does any load-bearing claim rely on self-citation chains that presuppose the target result. The method is self-contained against external benchmarks, with performance evaluated on held-out data from GenVideo, EvalCrafter, VideoPhy, and VidProM. The central hypothesis (deterministic vs. stochastic dynamics) is tested empirically rather than defined into existence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that generative models produce deterministic temporal trajectories unlike natural stochastic dynamics; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption AIGVs follow deterministic anchor-driven trajectories inducing unnaturally repetitive correlations
    Invoked as the basis for the anomalous self-similarity fingerprint.

pith-pipeline@v0.9.0 · 5570 in / 1127 out tokens · 49353 ms · 2026-05-13T17:11:54.995509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 4 internal anchors

  1. [1]

    A survey on video diffusion models,

    Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y .-G. Jiang, “A survey on video diffusion models,”ACM Computing Surveys, vol. 57, no. 2, pp. 1–42, 2024

  2. [2]

    Survey of video diffusion models: Foundations, implementations, and applications,

    Y . Wang, X. Liu, W. Pang, L. Ma, S. Yuan, P. Debevec, and N. Yu, “Survey of video diffusion models: Foundations, implementations, and applications,”Transactions on Machine Learning Research, 2025, survey Certification. [Online]. Available: https: //openreview.net/forum?id=2ODDBObKjH

  3. [3]

    Media forensics and deepfakes: an overview

    L. Verdoliva, “Media forensics and deepfakes: an overview,”IEEE journal of selected topics in signal processing, vol. 14, no. 5, pp. 910– 932, 2020

  4. [4]

    Introducing runway gen-4,

    Runway AI, Inc., “Introducing runway gen-4,” https://runwayml.com/ research/introducing-runway-gen-4, 2025, accessed: 2025-11-10

  5. [5]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

    D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”International Journal of Computer Vision, vol. 133, no. 4, pp. 1879–1893, 2025

  6. [6]

    Video generation models as world simulators,

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhmanet al., “Video generation models as world simulators,”OpenAI Blog, vol. 1, no. 8, p. 1, 2024

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023

  8. [8]

    Pika.art,

    Pika, “Pika.art,” https://pika.art/, 2022

  9. [9]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    H. Chen, Y . Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y . Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320

  10. [10]

    Dynamicrafter: Animating open-domain im- ages with video diffusion priors,

    J. Xing, M. Xia, Y . Zhang, H. Chen, W. Yu, H. Liu, G. Liu, X. Wang, Y . Shan, and T.-T. Wong, “Dynamicrafter: Animating open-domain im- ages with video diffusion priors,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 399–417

  11. [11]

    Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

    X. Shi, Z. Huang, F.-Y . Wang, W. Bian, D. Li, Y . Zhang, M. Zhang, K. C. Cheung, S. See, H. Qinet al., “Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

  12. [12]

    Image conductor: Precision control for interactive video syn- thesis,

    Y . Li, X. Wang, Z. Zhang, Z. Wang, Z. Yuan, L. Xie, Y . Shan, and Y . Zou, “Image conductor: Precision control for interactive video syn- thesis,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 5031–5038

  13. [13]

    Towards open-set identity preserving face synthesis,

    J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, “Towards open-set identity preserving face synthesis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6713–6722

  14. [14]

    Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information pro- cessing systems, 37:36777–36804, 2024

    Z. Guo, Y . Wu, C. Zhuowei, P. Zhang, Q. Heet al., “Pulid: Pure and lightning id customization via contrastive alignment,”Advances in neural information processing systems, vol. 37, pp. 36 777–36 804, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

  15. [15]

    Out- of-distribution detection learning with unreliable out-of-distribution sources,

    H. Zheng, Q. Wang, Z. Fang, X. Xia, F. Liu, T. Liu, and B. Han, “Out- of-distribution detection learning with unreliable out-of-distribution sources,”Advances in neural information processing systems, vol. 36, pp. 72 110–72 123, 2023

  16. [16]

    Hififace: 3d shape and semantic prior guided high fidelity face swapping,

    Y . Wang, X. Chen, J. Zhu, W. Chu, Y . Tai, C. Wang, J. Li, Y . Wu, F. Huang, and R. Ji, “Hififace: 3d shape and semantic prior guided high fidelity face swapping,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Z.- H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 ...

  17. [17]

    Diffswap: High- fidelity and controllable face swapping via 3d-aware masked diffusion,

    W. Zhao, Y . Rao, W. Shi, Z. Liu, J. Zhou, and J. Lu, “Diffswap: High- fidelity and controllable face swapping via 3d-aware masked diffusion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8568–8577

  18. [18]

    Avff: Audio-visual feature fusion for video deepfake detection,

    T. Oorloff, S. Koppisetti, N. Bonettini, D. Solanki, B. Colman, Y . Ya- coob, A. Shahriyari, and G. Bharaj, “Avff: Audio-visual feature fusion for video deepfake detection,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 27 102– 27 112

  19. [19]

    Learning defense transformations for counterattacking adversarial examples,

    J. Li, S. Zhang, J. Cao, and M. Tan, “Learning defense transformations for counterattacking adversarial examples,”Neural Networks, vol. 164, pp. 177–185, 2023

  20. [20]

    Identifying and mitigating the security risks of generative ai,

    C. Barrett, B. Boyd, E. Bursztein, N. Carlini, B. Chen, J. Choi, A. R. Chowdhury, M. Christodorescu, A. Datta, S. Feiziet al., “Identifying and mitigating the security risks of generative ai,”Foundations and Trends® in Privacy and Security, vol. 6, no. 1, pp. 1–52, 2023

  21. [21]

    A survey of detection and mitigation for fake images on social media platforms,

    D. K. Sharma, B. Singh, S. Agarwal, L. Garg, C. Kim, and K.-H. Jung, “A survey of detection and mitigation for fake images on social media platforms,”Applied Sciences, vol. 13, no. 19, p. 10980, 2023

  22. [22]

    Spatiotemporal inconsistency learning for deepfake video detection,

    Z. Gu, Y . Chen, T. Yao, S. Ding, J. Li, F. Huang, and L. Ma, “Spatiotemporal inconsistency learning for deepfake video detection,” in Proceedings of the 29th ACM International Conference on Multimedia, ser. MM ’21, 2021, p. 3473–3481

  23. [23]

    Exploring temporal coherence for more general video face forgery detection,

    Y . Zheng, J. Bao, D. Chen, M. Zeng, and F. Wen, “Exploring temporal coherence for more general video face forgery detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 044–15 054

  24. [24]

    Tall: Thumbnail layout for deepfake video detection,

    Y . Xu, J. Liang, G. Jia, Z. Yang, Y . Zhang, and R. He, “Tall: Thumbnail layout for deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 658–22 668

  25. [25]

    Ai-generated video detection via spatial-temporal anomaly learning,

    J. Bai, M. Lin, G. Cao, and Z. Lou, “Ai-generated video detection via spatial-temporal anomaly learning,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 460– 470

  26. [26]

    Demamba: Ai-generated video detection on million-scale genvideo benchmark,

    H. Chen, Y . Hong, Z. Huang, Z. Xu, Z. Gu, Y . Li, J. Lan, H. Zhu, J. Zhang, W. Wanget al., “Demamba: Ai-generated video detection on million-scale genvideo benchmark,”arXiv preprint arXiv:2405.19707, 2024

  27. [27]

    D3: Training-free ai-generated video detection using second-order features,

    C. Zheng, R. Suo, C. Lin, Z. Zhao, L. Yang, S. Liu, M. Yang, C. Wang, and C. Shen, “D3: Training-free ai-generated video detection using second-order features,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 12 852– 12 862

  28. [28]

    Mre-net: Multi-rate excitation network for deepfake video detection,

    G. Pang, B. Zhang, Z. Teng, Z. Qi, and J. Fan, “Mre-net: Multi-rate excitation network for deepfake video detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 3663– 3676, 2023

  29. [29]

    Istvt: interpretable spatial-temporal video transformer for deepfake detection,

    C. Zhao, C. Wang, G. Hu, H. Chen, C. Liu, and J. Tang, “Istvt: interpretable spatial-temporal video transformer for deepfake detection,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1335–1348, 2023

  30. [30]

    Altfreezing for more general video face forgery detection,

    Z. Wang, J. Bao, W. Zhou, W. Wang, and H. Li, “Altfreezing for more general video face forgery detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 4129– 4138

  31. [31]

    Vulnerability-aware spatio-temporal learning for generalizable deep- fake video detection,

    D. Nguyen, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “Vulnerability-aware spatio-temporal learning for generalizable deep- fake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 10 786–10 796

  32. [32]

    Compressed deepfake video detection based on 3d spatiotemporal trajectories,

    Z. Chen, X. Liao, X. Wu, and Y . Chen, “Compressed deepfake video detection based on 3d spatiotemporal trajectories,” in2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2024, pp. 1–8

  33. [33]

    Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,

    T. Kim, J. Choi, Y . Jeong, H. Noh, J. Yoo, S. Baek, and J. Choi, “Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 11 198–11 207

  34. [34]

    Dip: diffusion learning of inconsistency pattern for general deepfake detection,

    F. Nie, J. Ni, J. Zhang, B. Zhang, and W. Zhang, “Dip: diffusion learning of inconsistency pattern for general deepfake detection,”IEEE Transactions on Multimedia, 2024

  35. [35]

    Deepfake detection with spatio-temporal consistency and attention,

    Y . Chen, N. Akhtar, N. A. H. Haldar, and A. Mian, “Deepfake detection with spatio-temporal consistency and attention,” in2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA). IEEE, 2022, pp. 1–8

  36. [36]

    Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,

    Z. Yan, Y . Zhao, S. Chen, M. Guo, X. Fu, T. Yao, S. Ding, Y . Wu, and L. Yuan, “Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 615–12 625

  37. [37]

    Id-reveal: Identity-aware deepfake video detection,

    D. Cozzolino, A. R ¨ossler, J. Thies, M. Nießner, and L. Verdoliva, “Id-reveal: Identity-aware deepfake video detection,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 108–15 117

  38. [38]

    Learning natural consistency representation for face forgery video detection,

    D. Zhang, Z. Xiao, S. Li, F. Lin, J. Li, and S. Ge, “Learning natural consistency representation for face forgery video detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 407–424

  39. [39]

    Self-supervised video forensics by audio-visual anomaly detection,

    C. Feng, Z. Chen, and A. Owens, “Self-supervised video forensics by audio-visual anomaly detection,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10 491–10 503

  40. [40]

    Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content,

    R. Kundu, H. Xiong, V . Mohanty, A. Balachandran, and A. K. Roy- Chowdhury, “Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 050–28 060

  41. [41]

    Mintime: Multi-identity size-invariant video deepfake detection,

    D. A. Coccomini, G. K. Zilos, G. Amato, R. Caldelli, F. Falchi, S. Pa- padopoulos, and C. Gennaro, “Mintime: Multi-identity size-invariant video deepfake detection,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 6084–6096, 2024

  42. [42]

    Exploiting style latent flows for generalizing deepfake video detection,

    J. Choi, T. Kim, Y . Jeong, S. Baek, and J. Choi, “Exploiting style latent flows for generalizing deepfake video detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 1133–1143

  43. [43]

    Shaking the fake: Detecting deepfake videos in real time via active probes,

    Z. Xie and J. Luo, “Shaking the fake: Detecting deepfake videos in real time via active probes,”arXiv preprint arXiv:2409.10889, 2024

  44. [44]

    Distinguish any fake videos: Unleashing the power of large-scale data and motion features,

    L. Ji, Y . Lin, Z. Huang, Y . Han, X. Xu, J. Wu, C. Wang, and Z. Liu, “Distinguish any fake videos: Unleashing the power of large-scale data and motion features,”arXiv preprint arXiv:2405.15343, 2024

  45. [45]

    Detecting ai- generated video via frame consistency,

    L. Ma, Z. Yan, Q. Guo, Y . Liao, H. Yu, and P. Zhou, “Detecting ai- generated video via frame consistency,” in2025 IEEE International Conference on Multimedia and Expo (ICME), 2025, pp. 1–6

  46. [46]

    Turns out i’m not real: Towards robust detection of ai-generated videos,

    Q. Liu, P. Shi, Y .-Y . Tsai, C. Mao, and J. Yang, “Turns out i’m not real: Towards robust detection of ai-generated videos,”arXiv preprint arXiv:2406.09601, 2024

  47. [47]

    Physics-driven spatiotemporal modeling for ai-generated video detection,

    S. Zhang, Z. Lian, J. Yang, D. Li, G. Pang, F. Liu, B. Han, S. Li, and M. Tan, “Physics-driven spatiotemporal modeling for ai-generated video detection,” inAdvances in Neural Information Processing Systems, 2025

  48. [48]

    AI-generated video detection via perceptual straightening,

    C. Intern `o, R. Geirhos, M. Olhofer, S. Liu, B. Hammer, and D. Klindt, “AI-generated video detection via perceptual straightening,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id= LsmUgStXby

  49. [49]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

  50. [50]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  51. [51]

    Expanding language-image pretrained models for gen- eral video recognition,

    B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for gen- eral video recognition,” inEuropean conference on computer vision. Springer, 2022, pp. 1–18

  52. [52]

    Breaking semantic artifacts for generalized ai-generated image detec- tion,

    C. Zheng, C. Lin, Z. Zhao, H. Wang, X. Guo, S. Liu, and C. Shen, “Breaking semantic artifacts for generalized ai-generated image detec- tion,” inAdvances in Neural Information Processing Systems, A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 59 570– 59 596

  53. [53]

    Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,

    C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,” inProceedings of the IEEE/CVF Conference on JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 28 130–28 139

  54. [54]

    Evalcrafter: Benchmarking and evaluating large video generation models,

    Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 139–22 149

  55. [55]

    Videophy: Evaluating physical commonsense for video generation

    H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y . Bitton, C. Jiang, Y . Sun, K.-W. Chang, and A. Grover, “Videophy: Evaluating physical commonsense for video generation,”arXiv preprint arXiv:2406.03520, 2024

  56. [56]

    Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,

    W. Wang and Y . Yang, “Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,” 2024

  57. [57]

    Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,

    H. Xu, Q. Ye, X. Wu, M. Yan, Y . Miao, J. Ye, G. Xu, A. Hu, Y . Shi, G. Xuet al., “Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,”arXiv preprint arXiv:2306.04362, 2023

  58. [58]

    moonvalley.ai,

    moonvalley.ai, “moonvalley.ai,” https://moonvalley.ai/, 2022

  59. [59]

    ModelScope Text-to-Video Technical Report

    J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Mod- elscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023

  60. [60]

    Morph studio,

    Morph Studio, “Morph studio,” https://www.morphstudio.com/, 2023

  61. [61]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

    D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”International Journal of Computer Vision, pp. 1–15, 2024

  62. [62]

    Hotshot-xl,

    Hotshot, “Hotshot-xl,” https://huggingface.co/hotshotco/Hotshot-XL, 2023

  63. [63]

    Structure and content-guided video synthesis with diffusion models,

    P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 7346–7356

  64. [64]

    Lavie: High-quality video generation with cascaded latent diffusion models,

    Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,”International Journal of Computer Vision, vol. 133, no. 5, pp. 3059–3078, 2025

  65. [65]

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al

    H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wanget al., “Videocrafter1: Open diffusion models for high-quality video generation,”arXiv preprint arXiv:2310.19512, 2023

  66. [66]

    Dreamvideo: Composing your dream videos with customized subject and motion,

    Y . Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y . Liu, Y . Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6537–6549

  67. [67]

    Dreamoving: A human video generation framework based on diffusion models,

    M. Feng, J. Liu, K. Yu, Y . Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Liet al., “Dreamoving: A human video generation framework based on diffusion models,”arXiv preprint arXiv:2312.05107, 2023

  68. [68]

    Magicanimate: Temporally consistent human image animation using diffusion model,

    Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, and M. Z. Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1481–1490

  69. [69]

    Msr-vtt: A large video description dataset for bridging video and language,

    J. Xu, T. Mei, T. Yao, and Y . Rui, “Msr-vtt: A large video description dataset for bridging video and language,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288– 5296

  70. [70]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion.” inProceedings of the 3rd International Conference on Learning Representations (ICLR 2015), 2015

  71. [71]

    Rethinking the inception architecture for computer vision,

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826. Hang Wangreceived his bachelor’s degree in the School of Computer Science and Technology from Xi’an Jiaotong University in 2017, and...