pith. machine review for the scientific record. sign in

arxiv: 2605.00630 · v1 · submitted 2026-05-01 · 💻 cs.CV · cs.MM· eess.IV

Recognition: unknown

CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:18 UTC · model grok-4.3

classification 💻 cs.CV cs.MMeess.IV
keywords AI-generated video detectioncross-modal temporal artifactsCMTAvideo authenticityCLIPBLIPtemporal modelingmultimodal deepfake detection
0
0 comments X

The pith

AI-generated videos can be identified by their unnaturally stable alignment between visual content and textual semantics across frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that real videos show natural fluctuations in how well their frames match implied textual descriptions over time, while AI-generated videos keep this alignment too steady because they follow fixed input prompts. This difference, called cross-modal temporal artifact, serves as a fingerprint that previous methods missed by focusing only on visual or motion patterns. The authors build a detection system that extracts frame captions with BLIP, aligns them to visuals with CLIP, then tracks both coarse changes with a GRU and fine details with a Transformer. If the claim holds, detectors gain a more reliable way to spot synthetic video even when the generator or content changes.

Core claim

We identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. The CMTA framework captures these artifacts through joint cross-modal embedding and multi-grained temporal modeling with BLIP-generated captions, CLIP representations, a GRU branch for coarse alignment fluctuations, and a Transformer encoder for fine inter-frame variations.

What carries the argument

Cross-modal temporal artifact (CMTA), the unnaturally stable temporal trajectories of visual-textual semantic alignment in AI-generated videos versus natural fluctuations in real videos.

Load-bearing premise

The observed difference in cross-modal temporal stability is a reliable fingerprint that holds across diverse content, generators, and real-world conditions rather than being limited to specific datasets or prompts.

What would settle it

Finding AI-generated videos whose cross-modal alignment fluctuates at rates statistically indistinguishable from real videos, or real videos whose alignment stays artificially stable, would make the detector fail on those cases.

Figures

Figures reproduced from arXiv: 2605.00630 by Chao Shen, Chenhao Lin, Cong Wang, Hang Wang, Lei Zhang, Minghui Yang.

Figure 1
Figure 1. Figure 1: Motivation of the proposed CMTA framework. (Left) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline of the proposed CMTA framework. Given an input video, CMTA first leverages BLIP to generate frame [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the GRU cell for coarse-grained temporal [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of features extracted before the classification head. Blue and orange points represent real and [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically, CMTA leverages BLIP to generate frame-level image captions and utilizes CLIP to extract corresponding visual-textual representations. A coarse-grained temporal modeling branch is then designed to characterize temporal fluctuations in cross-modal alignment with a GRU. In parallel, a fine-grained branch is constructed to capture intricate inter-frame variations from integrated visual-textual features with a Transformer encoder. Extensive experiments on 40 subsets across four large-scale datasets, including GenVideo, EvalCrafter, VideoPhy, and VidProM, validate that our approach sets a new state-of-the-art while exhibiting superior cross-generator generalization. Code and models of CMTA will be released at https://github.com/hwang-cs-ime/CMTA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to identify a distinctive fingerprint in AI-generated videos (AIGVs) termed cross-modal temporal artifact (CMTA), arising from unnaturally stable semantic trajectories in cross-modal alignment (due to prompt guidance) versus natural temporal fluctuations in real videos. It proposes the CMTA framework that generates frame captions with BLIP, extracts aligned visual-textual embeddings with CLIP, applies a GRU branch for coarse-grained modeling of alignment fluctuations, and a Transformer encoder for fine-grained inter-frame variations in integrated features. Extensive experiments on 40 subsets across GenVideo, EvalCrafter, VideoPhy, and VidProM are said to demonstrate new state-of-the-art detection with superior cross-generator generalization.

Significance. If the CMTA signal proves to be an intrinsic, generator-invariant property rather than a modeling artifact, the work could meaningfully advance AIGV detection by introducing cross-modal temporal cues that complement existing uni-modal and spatiotemporal approaches, with potential benefits for generalization. The explicit commitment to releasing code and models is a clear strength that supports reproducibility and community validation.

major comments (2)
  1. Abstract: The central claim of setting a new state-of-the-art with superior cross-generator generalization on 40 subsets is presented without any reference to the specific baselines, evaluation metrics, statistical significance tests, error bars, data-split protocols, or controls for confounds such as prompt leakage. This absence directly undermines assessment of whether the reported gains are robust and load-bearing for the generalization argument.
  2. Proposed approach (BLIP captioning step): The framework relies on BLIP (pretrained predominantly on real image-text pairs) to produce frame captions before CLIP alignment and temporal modeling. This creates a plausible alternative explanation that the observed cross-modal stability difference is at least partly induced by BLIP's systematic response to distribution shift in AIGVs (e.g., more generic or temporally consistent captions) rather than an intrinsic generator property. No ablation with alternative captioners or analysis of caption variability is mentioned, threatening the cross-generator results on the 40 subsets.
minor comments (1)
  1. Abstract: The integration of the coarse-grained GRU branch and fine-grained Transformer branch into the final classifier is described only at a high level; a brief statement on fusion (e.g., concatenation, attention) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment in detail below, providing clarifications from the manuscript and outlining revisions where the concerns identify areas for improvement. We believe these responses strengthen the presentation of our claims regarding CMTA without altering the core contributions.

read point-by-point responses
  1. Referee: Abstract: The central claim of setting a new state-of-the-art with superior cross-generator generalization on 40 subsets is presented without any reference to the specific baselines, evaluation metrics, statistical significance tests, error bars, data-split protocols, or controls for confounds such as prompt leakage. This absence directly undermines assessment of whether the reported gains are robust and load-bearing for the generalization argument.

    Authors: We agree that the abstract, due to length constraints, summarizes the high-level claims without enumerating every experimental detail. The manuscript provides these specifics in Section 4 (Experiments), including comparisons against multiple baselines from prior AIGV detection literature, primary use of AUC as the metric, descriptions of the 40 subsets drawn from GenVideo, EvalCrafter, VideoPhy, and VidProM, and the cross-generator evaluation protocol that holds out generators and uses dataset-provided train/test splits. Prompt leakage is controlled via the standard held-out prompt setups in those datasets. While we report averaged results across runs, explicit error bars and formal significance tests can be added for clarity. We will revise the abstract to briefly reference the evaluation scope (e.g., 'outperforming prior methods across 40 subsets with superior cross-generator AUC') and direct readers to Section 4 for full protocols, metrics, and controls. This preserves abstract conciseness while improving assessability. revision: partial

  2. Referee: Proposed approach (BLIP captioning step): The framework relies on BLIP (pretrained predominantly on real image-text pairs) to produce frame captions before CLIP alignment and temporal modeling. This creates a plausible alternative explanation that the observed cross-modal stability difference is at least partly induced by BLIP's systematic response to distribution shift in AIGVs (e.g., more generic or temporally consistent captions) rather than an intrinsic generator property. No ablation with alternative captioners or analysis of caption variability is mentioned, threatening the cross-generator results on the 40 subsets.

    Authors: This is a substantive concern we should have addressed more explicitly. Our design choice of BLIP follows its common use for frame-level captioning in video tasks, and the CMTA hypothesis focuses on how prompt-guided AIGV generation produces visual content with unnaturally stable semantics that manifest in cross-modal alignment trajectories. Nevertheless, to rule out captioner-specific bias, we will incorporate an ablation in the revision that replaces BLIP with an alternative captioning pipeline (e.g., a fine-tuned LLaVA model or direct visual feature integration without explicit captions) on representative subsets. We will also add quantitative analysis of caption variability, such as frame-to-frame semantic embedding distances, to show that the stability difference persists and is larger in AIGVs than real videos regardless of the caption source. These additions will be placed in Section 3 and the experiments, directly supporting the cross-generator generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent cross-modal modeling

full rationale

The paper presents an empirical detection method that extracts frame captions via off-the-shelf BLIP, aligns them with CLIP embeddings, then applies standard GRU and Transformer encoders to model temporal fluctuations in cross-modal alignment. No equations, parameters, or uniqueness claims are shown to reduce by construction to fitted inputs or self-citations. The central fingerprint (stable semantic trajectories in AIGVs) is posited as an observed phenomenon and then measured by the proposed architecture; the architecture does not define or presuppose the fingerprint. Experiments on 40 held-out subsets across four external datasets provide independent validation rather than tautological confirmation. This is a standard self-contained ML detection pipeline with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into exact training details. Relies on pre-trained vision-language models and standard temporal architectures whose assumptions are inherited rather than newly derived.

free parameters (1)
  • GRU and Transformer hyperparameters
    Trained end-to-end for the detection task; specific values not stated in abstract.
axioms (1)
  • domain assumption Real videos exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations
    Stated directly as the contrast that defines the CMTA fingerprint.
invented entities (1)
  • Cross-modal temporal artifact (CMTA) no independent evidence
    purpose: Distinctive fingerprint for distinguishing AIGVs from real videos
    Newly named and operationalized in this work via the dual-branch temporal modeling.

pith-pipeline@v0.9.0 · 5580 in / 1281 out tokens · 47100 ms · 2026-05-09T20:18:08.928766+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    Google Deepmind, “Veo,” https://deepmind.google/technologies/veo/, 2024

  2. [2]

    Video generation models as world simulators,

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhmanet al., “Video generation models as world simulators,”OpenAI Blog, vol. 1, no. 8, p. 1, 2024

  3. [3]

    Gen-2: Generate novel videos with text, images or video clips,

    A. Germanidis, “Gen-2: Generate novel videos with text, images or video clips,” 2023

  4. [4]

    Ai-generated video detection via spatial-temporal anomaly learning,

    J. Bai, M. Lin, G. Cao, and Z. Lou, “Ai-generated video detection via spatial-temporal anomaly learning,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 460– 470

  5. [5]

    Demamba: Ai-generated video detection on million-scale genvideo benchmark,

    H. Chen, Y . Hong, Z. Huang, Z. Xu, Z. Gu, Y . Li, J. Lan, H. Zhu, J. Zhang, W. Wanget al., “Demamba: Ai-generated video detection on million-scale genvideo benchmark,”arXiv preprint arXiv:2405.19707, 2024

  6. [6]

    Detecting ai-generated video via frame consistency,

    L. Ma, Z. Yan, Q. Guo, Y . Liao, H. Yu, and P. Zhou, “Detecting ai-generated video via frame consistency,”2025 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:270212645

  7. [7]

    D3: Training-free ai-generated video detection using second-order features,

    C. Zheng, R. Suo, C. Lin, Z. Zhao, L. Yang, S. Liu, M. Yang, C. Wang, and C. Shen, “D3: Training-free ai-generated video detection using second-order features,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 12 852– 12 862

  8. [8]

    Physics-driven spatiotemporal modeling for ai-generated video detection,

    S. Zhang, Z. Lian, J. Yang, D. Li, G. Pang, F. Liu, B. Han, S. Li, and M. Tan, “Physics-driven spatiotemporal modeling for ai-generated video detection,” inAdvances in Neural Information Processing Systems, 2025

  9. [9]

    AI-generated video detection via perceptual straightening,

    C. Intern `o, R. Geirhos, M. Olhofer, S. Liu, B. Hammer, and D. Klindt, “AI-generated video detection via perceptual straightening,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id= LsmUgStXby

  10. [10]

    Exploring temporal coherence for more general video face forgery detection,

    Y . Zheng, J. Bao, D. Chen, M. Zeng, and F. Wen, “Exploring temporal coherence for more general video face forgery detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 044–15 054

  11. [11]

    Spatiotemporal inconsistency learning for deepfake video detection,

    Z. Gu, Y . Chen, T. Yao, S. Ding, J. Li, F. Huang, and L. Ma, “Spatiotemporal inconsistency learning for deepfake video detection,” in Proceedings of the 29th ACM International Conference on Multimedia, ser. MM ’21, 2021, p. 3473–3481

  12. [12]

    Dip: diffusion learning of inconsistency pattern for general deepfake detection,

    F. Nie, J. Ni, J. Zhang, B. Zhang, and W. Zhang, “Dip: diffusion learning of inconsistency pattern for general deepfake detection,”IEEE Transactions on Multimedia, 2024

  13. [13]

    Mre-net: Multi-rate excitation network for deepfake video detection,

    G. Pang, B. Zhang, Z. Teng, Z. Qi, and J. Fan, “Mre-net: Multi-rate excitation network for deepfake video detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 3663– 3676, 2023

  14. [14]

    Deepfake detection with spatio-temporal consistency and attention,

    Y . Chen, N. Akhtar, N. A. H. Haldar, and A. Mian, “Deepfake detection with spatio-temporal consistency and attention,” in2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA). IEEE, 2022, pp. 1–8

  15. [15]

    Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,

    T. Kim, J. Choi, Y . Jeong, H. Noh, J. Yoo, S. Baek, and J. Choi, “Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 11 198–11 207

  16. [16]

    Exploiting style latent flows for generalizing deepfake video detection,

    J. Choi, T. Kim, Y . Jeong, S. Baek, and J. Choi, “Exploiting style latent flows for generalizing deepfake video detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 1133–1143

  17. [17]

    Compressed deepfake video detection based on 3d spatiotemporal trajectories,

    Z. Chen, X. Liao, X. Wu, and Y . Chen, “Compressed deepfake video detection based on 3d spatiotemporal trajectories,” in2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2024, pp. 1–8

  18. [18]

    Istvt: interpretable spatial-temporal video transformer for deepfake detection,

    C. Zhao, C. Wang, G. Hu, H. Chen, C. Liu, and J. Tang, “Istvt: interpretable spatial-temporal video transformer for deepfake detection,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1335–1348, 2023

  19. [19]

    Tall: Thumbnail layout for deepfake video detection,

    Y . Xu, J. Liang, G. Jia, Z. Yang, Y . Zhang, and R. He, “Tall: Thumbnail layout for deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 658–22 668

  20. [20]

    Mintime: Multi-identity size-invariant video deepfake detection,

    D. A. Coccomini, G. K. Zilos, G. Amato, R. Caldelli, F. Falchi, S. Pa- padopoulos, and C. Gennaro, “Mintime: Multi-identity size-invariant video deepfake detection,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 6084–6096, 2024

  21. [21]

    Vulnerability-aware spatio-temporal learning for generalizable and in- terpretable deepfake video detection,

    D. Nguyen, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “Vulnerability-aware spatio-temporal learning for generalizable and in- terpretable deepfake video detection,”arXiv preprint arXiv:2501.01184, 2025

  22. [22]

    Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,

    Z. Yan, Y . Zhao, S. Chen, M. Guo, X. Fu, T. Yao, S. Ding, Y . Wu, and L. Yuan, “Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,” inProceedings JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15 of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 615–12 625

  23. [23]

    Altfreezing for more general video face forgery detection,

    Z. Wang, J. Bao, W. Zhou, W. Wang, and H. Li, “Altfreezing for more general video face forgery detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 4129– 4138

  24. [24]

    Learning natural consistency representation for face forgery video detection,

    D. Zhang, Z. Xiao, S. Li, F. Lin, J. Li, and S. Ge, “Learning natural consistency representation for face forgery video detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 407–424

  25. [25]

    Id-reveal: Identity-aware deepfake video detection,

    D. Cozzolino, A. R ¨ossler, J. Thies, M. Nießner, and L. Verdoliva, “Id-reveal: Identity-aware deepfake video detection,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 108–15 117

  26. [26]

    Shaking the fake: Detecting deepfake videos in real time via active probes,

    Z. Xie and J. Luo, “Shaking the fake: Detecting deepfake videos in real time via active probes,”arXiv preprint arXiv:2409.10889, 2024

  27. [27]

    Avff: Audio-visual feature fusion for video deepfake detection,

    T. Oorloff, S. Koppisetti, N. Bonettini, D. Solanki, B. Colman, Y . Ya- coob, A. Shahriyari, and G. Bharaj, “Avff: Audio-visual feature fusion for video deepfake detection,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 27 102– 27 112

  28. [28]

    Self-supervised video forensics by audio-visual anomaly detection,

    C. Feng, Z. Chen, and A. Owens, “Self-supervised video forensics by audio-visual anomaly detection,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10 491–10 503

  29. [29]

    Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content,

    R. Kundu, H. Xiong, V . Mohanty, A. Balachandran, and A. K. Roy- Chowdhury, “Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 050–28 060

  30. [30]

    Beyond deepfake images: Detecting ai-generated videos,

    D. S. Vahdati, T. D. Nguyen, A. Azizpour, and M. C. Stamm, “Beyond deepfake images: Detecting ai-generated videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, pp. 4397–4408

  31. [31]

    What matters in detecting ai- generated videos like sora?

    C. Chang, Z. Liu, X. Lyu, and X. Qi, “What matters in detecting ai- generated videos like sora?”arXiv preprint arXiv:2406.19568, 2024

  32. [32]

    Distinguish any fake videos: Unleashing the power of large-scale data and motion features,

    L. Ji, Y . Lin, Z. Huang, Y . Han, X. Xu, J. Wu, C. Wang, and Z. Liu, “Distinguish any fake videos: Unleashing the power of large-scale data and motion features,”arXiv preprint arXiv:2405.15343, 2024

  33. [33]

    Turns out i’m not real: Towards robust detection of ai-generated videos,

    Q. Liu, P. Shi, Y .-Y . Tsai, C. Mao, and J. Yang, “Turns out i’m not real: Towards robust detection of ai-generated videos,”arXiv preprint arXiv:2406.09601, 2024

  34. [34]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

  35. [35]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  36. [36]

    Learning phrase representations using RNN encoder–decoder for statistical machine translation,

    K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” inProceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Oct. 2014, pp. 1724–1734

  37. [37]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”ICLR, 2021

  38. [38]

    Expanding language-image pretrained models for gen- eral video recognition,

    B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for gen- eral video recognition,” inEuropean conference on computer vision. Springer, 2022, pp. 1–18

  39. [39]

    Breaking semantic artifacts for generalized ai-generated image detec- tion,

    C. Zheng, C. Lin, Z. Zhao, H. Wang, X. Guo, S. Liu, and C. Shen, “Breaking semantic artifacts for generalized ai-generated image detec- tion,” inAdvances in Neural Information Processing Systems, A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 59 570– 59 596

  40. [40]

    Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,

    C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 28 130–28 139

  41. [41]

    Evalcrafter: Benchmarking and evaluating large video generation models,

    Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 139–22 149

  42. [42]

    Bansal, Z

    H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y . Bitton, C. Jiang, Y . Sun, K.-W. Chang, and A. Grover, “Videophy: Evaluating physical commonsense for video generation,”arXiv preprint arXiv:2406.03520, 2024

  43. [43]

    Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,

    W. Wang and Y . Yang, “Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,” 2024

  44. [44]

    Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,

    H. Xu, Q. Ye, X. Wu, M. Yan, Y . Miao, J. Ye, G. Xu, A. Hu, Y . Shi, G. Xuet al., “Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,”arXiv preprint arXiv:2306.04362, 2023

  45. [45]

    Pika.art,

    Pika, “Pika.art,” https://pika.art/, 2022

  46. [46]

    ModelScope Text-to-Video Technical Report

    J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Mod- elscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023

  47. [47]

    Morph studio,

    Morph Studio, “Morph studio,” https://www.morphstudio.com/, 2023

  48. [48]

    moonvalley.ai,

    moonvalley.ai, “moonvalley.ai,” https://moonvalley.ai/, 2022

  49. [49]

    Hotshot-xl,

    Hotshot, “Hotshot-xl,” https://huggingface.co/hotshotco/Hotshot-XL, 2023

  50. [50]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

    D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”International Journal of Computer Vision, pp. 1–15, 2024

  51. [51]

    Structure and content-guided video synthesis with diffusion models,

    P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 7346–7356

  52. [52]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wanget al., “Videocrafter1: Open diffusion models for high-quality video generation,”arXiv preprint arXiv:2310.19512, 2023

  53. [53]

    Lavie: High-quality video generation with cascaded latent diffusion models,

    Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,”International Journal of Computer Vision, vol. 133, no. 5, pp. 3059–3078, 2025

  54. [54]

    Dreamvideo: Composing your dream videos with customized subject and motion,

    Y . Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y . Liu, Y . Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6537–6549

  55. [55]

    Dreamoving: A human video generation framework based on diffusion models,

    M. Feng, J. Liu, K. Yu, Y . Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Liet al., “Dreamoving: A human video generation framework based on diffusion models,”arXiv preprint arXiv:2312.05107, 2023

  56. [56]

    Magicanimate: Temporally consistent human image animation using diffusion model,

    Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, and M. Z. Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1481–1490

  57. [57]

    Msr-vtt: A large video description dataset for bridging video and language,

    J. Xu, T. Mei, T. Yao, and Y . Rui, “Msr-vtt: A large video description dataset for bridging video and language,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288– 5296