arxiv: 2605.00630 · v1 · submitted 2026-05-01 · 💻 cs.CV · cs.MM· eess.IV

Recognition: unknown

CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

Hang Wang , Chao Shen , Chenhao Lin , Minghui Yang , Lei Zhang , Cong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:18 UTC · model grok-4.3

classification 💻 cs.CV cs.MMeess.IV

keywords AI-generated video detectioncross-modal temporal artifactsCMTAvideo authenticityCLIPBLIPtemporal modelingmultimodal deepfake detection

0 comments

The pith

AI-generated videos can be identified by their unnaturally stable alignment between visual content and textual semantics across frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that real videos show natural fluctuations in how well their frames match implied textual descriptions over time, while AI-generated videos keep this alignment too steady because they follow fixed input prompts. This difference, called cross-modal temporal artifact, serves as a fingerprint that previous methods missed by focusing only on visual or motion patterns. The authors build a detection system that extracts frame captions with BLIP, aligns them to visuals with CLIP, then tracks both coarse changes with a GRU and fine details with a Transformer. If the claim holds, detectors gain a more reliable way to spot synthetic video even when the generator or content changes.

Core claim

We identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. The CMTA framework captures these artifacts through joint cross-modal embedding and multi-grained temporal modeling with BLIP-generated captions, CLIP representations, a GRU branch for coarse alignment fluctuations, and a Transformer encoder for fine inter-frame variations.

What carries the argument

Cross-modal temporal artifact (CMTA), the unnaturally stable temporal trajectories of visual-textual semantic alignment in AI-generated videos versus natural fluctuations in real videos.

Load-bearing premise

The observed difference in cross-modal temporal stability is a reliable fingerprint that holds across diverse content, generators, and real-world conditions rather than being limited to specific datasets or prompts.

What would settle it

Finding AI-generated videos whose cross-modal alignment fluctuates at rates statistically indistinguishable from real videos, or real videos whose alignment stays artificially stable, would make the detector fail on those cases.

Figures

Figures reproduced from arXiv: 2605.00630 by Chao Shen, Chenhao Lin, Cong Wang, Hang Wang, Lei Zhang, Minghui Yang.

**Figure 2.** Figure 2: The pipeline of the proposed CMTA framework. Given an input video, CMTA first leverages BLIP to generate frame [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the GRU cell for coarse-grained temporal [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of features extracted before the classification head. Blue and orange points represent real and [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically, CMTA leverages BLIP to generate frame-level image captions and utilizes CLIP to extract corresponding visual-textual representations. A coarse-grained temporal modeling branch is then designed to characterize temporal fluctuations in cross-modal alignment with a GRU. In parallel, a fine-grained branch is constructed to capture intricate inter-frame variations from integrated visual-textual features with a Transformer encoder. Extensive experiments on 40 subsets across four large-scale datasets, including GenVideo, EvalCrafter, VideoPhy, and VidProM, validate that our approach sets a new state-of-the-art while exhibiting superior cross-generator generalization. Code and models of CMTA will be released at https://github.com/hwang-cs-ime/CMTA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CMTA frames cross-modal caption stability as an AIGV fingerprint and reports broad cross-generator gains, but the signal may trace to BLIP's real-data bias rather than generator properties.

read the letter

The paper's main contribution is spotting that AI-generated videos keep unusually steady semantic alignment between frames and text over time, while real videos fluctuate more naturally. They turn this into a detector by running BLIP captions on frames, pulling CLIP embeddings, then feeding the sequence into a GRU branch for coarse temporal shifts and a Transformer for finer inter-frame changes. The experiments run across 40 subsets from four datasets and multiple generators, which is more coverage than typical in this area, and they promise to release code and models.

Referee Report

2 major / 1 minor

Summary. The paper claims to identify a distinctive fingerprint in AI-generated videos (AIGVs) termed cross-modal temporal artifact (CMTA), arising from unnaturally stable semantic trajectories in cross-modal alignment (due to prompt guidance) versus natural temporal fluctuations in real videos. It proposes the CMTA framework that generates frame captions with BLIP, extracts aligned visual-textual embeddings with CLIP, applies a GRU branch for coarse-grained modeling of alignment fluctuations, and a Transformer encoder for fine-grained inter-frame variations in integrated features. Extensive experiments on 40 subsets across GenVideo, EvalCrafter, VideoPhy, and VidProM are said to demonstrate new state-of-the-art detection with superior cross-generator generalization.

Significance. If the CMTA signal proves to be an intrinsic, generator-invariant property rather than a modeling artifact, the work could meaningfully advance AIGV detection by introducing cross-modal temporal cues that complement existing uni-modal and spatiotemporal approaches, with potential benefits for generalization. The explicit commitment to releasing code and models is a clear strength that supports reproducibility and community validation.

major comments (2)

Abstract: The central claim of setting a new state-of-the-art with superior cross-generator generalization on 40 subsets is presented without any reference to the specific baselines, evaluation metrics, statistical significance tests, error bars, data-split protocols, or controls for confounds such as prompt leakage. This absence directly undermines assessment of whether the reported gains are robust and load-bearing for the generalization argument.
Proposed approach (BLIP captioning step): The framework relies on BLIP (pretrained predominantly on real image-text pairs) to produce frame captions before CLIP alignment and temporal modeling. This creates a plausible alternative explanation that the observed cross-modal stability difference is at least partly induced by BLIP's systematic response to distribution shift in AIGVs (e.g., more generic or temporally consistent captions) rather than an intrinsic generator property. No ablation with alternative captioners or analysis of caption variability is mentioned, threatening the cross-generator results on the 40 subsets.

minor comments (1)

Abstract: The integration of the coarse-grained GRU branch and fine-grained Transformer branch into the final classifier is described only at a high level; a brief statement on fusion (e.g., concatenation, attention) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment in detail below, providing clarifications from the manuscript and outlining revisions where the concerns identify areas for improvement. We believe these responses strengthen the presentation of our claims regarding CMTA without altering the core contributions.

read point-by-point responses

Referee: Abstract: The central claim of setting a new state-of-the-art with superior cross-generator generalization on 40 subsets is presented without any reference to the specific baselines, evaluation metrics, statistical significance tests, error bars, data-split protocols, or controls for confounds such as prompt leakage. This absence directly undermines assessment of whether the reported gains are robust and load-bearing for the generalization argument.

Authors: We agree that the abstract, due to length constraints, summarizes the high-level claims without enumerating every experimental detail. The manuscript provides these specifics in Section 4 (Experiments), including comparisons against multiple baselines from prior AIGV detection literature, primary use of AUC as the metric, descriptions of the 40 subsets drawn from GenVideo, EvalCrafter, VideoPhy, and VidProM, and the cross-generator evaluation protocol that holds out generators and uses dataset-provided train/test splits. Prompt leakage is controlled via the standard held-out prompt setups in those datasets. While we report averaged results across runs, explicit error bars and formal significance tests can be added for clarity. We will revise the abstract to briefly reference the evaluation scope (e.g., 'outperforming prior methods across 40 subsets with superior cross-generator AUC') and direct readers to Section 4 for full protocols, metrics, and controls. This preserves abstract conciseness while improving assessability. revision: partial
Referee: Proposed approach (BLIP captioning step): The framework relies on BLIP (pretrained predominantly on real image-text pairs) to produce frame captions before CLIP alignment and temporal modeling. This creates a plausible alternative explanation that the observed cross-modal stability difference is at least partly induced by BLIP's systematic response to distribution shift in AIGVs (e.g., more generic or temporally consistent captions) rather than an intrinsic generator property. No ablation with alternative captioners or analysis of caption variability is mentioned, threatening the cross-generator results on the 40 subsets.

Authors: This is a substantive concern we should have addressed more explicitly. Our design choice of BLIP follows its common use for frame-level captioning in video tasks, and the CMTA hypothesis focuses on how prompt-guided AIGV generation produces visual content with unnaturally stable semantics that manifest in cross-modal alignment trajectories. Nevertheless, to rule out captioner-specific bias, we will incorporate an ablation in the revision that replaces BLIP with an alternative captioning pipeline (e.g., a fine-tuned LLaVA model or direct visual feature integration without explicit captions) on representative subsets. We will also add quantitative analysis of caption variability, such as frame-to-frame semantic embedding distances, to show that the stability difference persists and is larger in AIGVs than real videos regardless of the caption source. These additions will be placed in Section 3 and the experiments, directly supporting the cross-generator generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent cross-modal modeling

full rationale

The paper presents an empirical detection method that extracts frame captions via off-the-shelf BLIP, aligns them with CLIP embeddings, then applies standard GRU and Transformer encoders to model temporal fluctuations in cross-modal alignment. No equations, parameters, or uniqueness claims are shown to reduce by construction to fitted inputs or self-citations. The central fingerprint (stable semantic trajectories in AIGVs) is posited as an observed phenomenon and then measured by the proposed architecture; the architecture does not define or presuppose the fingerprint. Experiments on 40 held-out subsets across four external datasets provide independent validation rather than tautological confirmation. This is a standard self-contained ML detection pipeline with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into exact training details. Relies on pre-trained vision-language models and standard temporal architectures whose assumptions are inherited rather than newly derived.

free parameters (1)

GRU and Transformer hyperparameters
Trained end-to-end for the detection task; specific values not stated in abstract.

axioms (1)

domain assumption Real videos exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations
Stated directly as the contrast that defines the CMTA fingerprint.

invented entities (1)

Cross-modal temporal artifact (CMTA) no independent evidence
purpose: Distinctive fingerprint for distinguishing AIGVs from real videos
Newly named and operationalized in this work via the dual-branch temporal modeling.

pith-pipeline@v0.9.0 · 5580 in / 1281 out tokens · 47100 ms · 2026-05-09T20:18:08.928766+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 11 canonical work pages · 2 internal anchors

[1]

Google Deepmind, “Veo,” https://deepmind.google/technologies/veo/, 2024

2024
[2]

Video generation models as world simulators,

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhmanet al., “Video generation models as world simulators,”OpenAI Blog, vol. 1, no. 8, p. 1, 2024

2024
[3]

Gen-2: Generate novel videos with text, images or video clips,

A. Germanidis, “Gen-2: Generate novel videos with text, images or video clips,” 2023

2023
[4]

Ai-generated video detection via spatial-temporal anomaly learning,

J. Bai, M. Lin, G. Cao, and Z. Lou, “Ai-generated video detection via spatial-temporal anomaly learning,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 460– 470

2024
[5]

Demamba: Ai-generated video detection on million-scale genvideo benchmark,

H. Chen, Y . Hong, Z. Huang, Z. Xu, Z. Gu, Y . Li, J. Lan, H. Zhu, J. Zhang, W. Wanget al., “Demamba: Ai-generated video detection on million-scale genvideo benchmark,”arXiv preprint arXiv:2405.19707, 2024

work page arXiv 2024
[6]

Detecting ai-generated video via frame consistency,

L. Ma, Z. Yan, Q. Guo, Y . Liao, H. Yu, and P. Zhou, “Detecting ai-generated video via frame consistency,”2025 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:270212645

2025
[7]

D3: Training-free ai-generated video detection using second-order features,

C. Zheng, R. Suo, C. Lin, Z. Zhao, L. Yang, S. Liu, M. Yang, C. Wang, and C. Shen, “D3: Training-free ai-generated video detection using second-order features,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 12 852– 12 862

2025
[8]

Physics-driven spatiotemporal modeling for ai-generated video detection,

S. Zhang, Z. Lian, J. Yang, D. Li, G. Pang, F. Liu, B. Han, S. Li, and M. Tan, “Physics-driven spatiotemporal modeling for ai-generated video detection,” inAdvances in Neural Information Processing Systems, 2025

2025
[9]

AI-generated video detection via perceptual straightening,

C. Intern `o, R. Geirhos, M. Olhofer, S. Liu, B. Hammer, and D. Klindt, “AI-generated video detection via perceptual straightening,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id= LsmUgStXby

2025
[10]

Exploring temporal coherence for more general video face forgery detection,

Y . Zheng, J. Bao, D. Chen, M. Zeng, and F. Wen, “Exploring temporal coherence for more general video face forgery detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 044–15 054

2021
[11]

Spatiotemporal inconsistency learning for deepfake video detection,

Z. Gu, Y . Chen, T. Yao, S. Ding, J. Li, F. Huang, and L. Ma, “Spatiotemporal inconsistency learning for deepfake video detection,” in Proceedings of the 29th ACM International Conference on Multimedia, ser. MM ’21, 2021, p. 3473–3481

2021
[12]

Dip: diffusion learning of inconsistency pattern for general deepfake detection,

F. Nie, J. Ni, J. Zhang, B. Zhang, and W. Zhang, “Dip: diffusion learning of inconsistency pattern for general deepfake detection,”IEEE Transactions on Multimedia, 2024

2024
[13]

Mre-net: Multi-rate excitation network for deepfake video detection,

G. Pang, B. Zhang, Z. Teng, Z. Qi, and J. Fan, “Mre-net: Multi-rate excitation network for deepfake video detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 3663– 3676, 2023

2023
[14]

Deepfake detection with spatio-temporal consistency and attention,

Y . Chen, N. Akhtar, N. A. H. Haldar, and A. Mian, “Deepfake detection with spatio-temporal consistency and attention,” in2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA). IEEE, 2022, pp. 1–8

2022
[15]

Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,

T. Kim, J. Choi, Y . Jeong, H. Noh, J. Yoo, S. Baek, and J. Choi, “Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 11 198–11 207

2025
[16]

Exploiting style latent flows for generalizing deepfake video detection,

J. Choi, T. Kim, Y . Jeong, S. Baek, and J. Choi, “Exploiting style latent flows for generalizing deepfake video detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 1133–1143

2024
[17]

Compressed deepfake video detection based on 3d spatiotemporal trajectories,

Z. Chen, X. Liao, X. Wu, and Y . Chen, “Compressed deepfake video detection based on 3d spatiotemporal trajectories,” in2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2024, pp. 1–8

2024
[18]

Istvt: interpretable spatial-temporal video transformer for deepfake detection,

C. Zhao, C. Wang, G. Hu, H. Chen, C. Liu, and J. Tang, “Istvt: interpretable spatial-temporal video transformer for deepfake detection,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1335–1348, 2023

2023
[19]

Tall: Thumbnail layout for deepfake video detection,

Y . Xu, J. Liang, G. Jia, Z. Yang, Y . Zhang, and R. He, “Tall: Thumbnail layout for deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 658–22 668

2023
[20]

Mintime: Multi-identity size-invariant video deepfake detection,

D. A. Coccomini, G. K. Zilos, G. Amato, R. Caldelli, F. Falchi, S. Pa- padopoulos, and C. Gennaro, “Mintime: Multi-identity size-invariant video deepfake detection,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 6084–6096, 2024

2024
[21]

Vulnerability-aware spatio-temporal learning for generalizable and in- terpretable deepfake video detection,

D. Nguyen, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “Vulnerability-aware spatio-temporal learning for generalizable and in- terpretable deepfake video detection,”arXiv preprint arXiv:2501.01184, 2025

work page arXiv 2025
[22]

Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,

Z. Yan, Y . Zhao, S. Chen, M. Guo, X. Fu, T. Yao, S. Ding, Y . Wu, and L. Yuan, “Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,” inProceedings JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15 of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 615–12 625

2021
[23]

Altfreezing for more general video face forgery detection,

Z. Wang, J. Bao, W. Zhou, W. Wang, and H. Li, “Altfreezing for more general video face forgery detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 4129– 4138

2023
[24]

Learning natural consistency representation for face forgery video detection,

D. Zhang, Z. Xiao, S. Li, F. Lin, J. Li, and S. Ge, “Learning natural consistency representation for face forgery video detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 407–424

2024
[25]

Id-reveal: Identity-aware deepfake video detection,

D. Cozzolino, A. R ¨ossler, J. Thies, M. Nießner, and L. Verdoliva, “Id-reveal: Identity-aware deepfake video detection,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 108–15 117

2021
[26]

Shaking the fake: Detecting deepfake videos in real time via active probes,

Z. Xie and J. Luo, “Shaking the fake: Detecting deepfake videos in real time via active probes,”arXiv preprint arXiv:2409.10889, 2024

work page arXiv 2024
[27]

Avff: Audio-visual feature fusion for video deepfake detection,

T. Oorloff, S. Koppisetti, N. Bonettini, D. Solanki, B. Colman, Y . Ya- coob, A. Shahriyari, and G. Bharaj, “Avff: Audio-visual feature fusion for video deepfake detection,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 27 102– 27 112

2024
[28]

Self-supervised video forensics by audio-visual anomaly detection,

C. Feng, Z. Chen, and A. Owens, “Self-supervised video forensics by audio-visual anomaly detection,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10 491–10 503

2023
[29]

Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content,

R. Kundu, H. Xiong, V . Mohanty, A. Balachandran, and A. K. Roy- Chowdhury, “Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 050–28 060

2025
[30]

Beyond deepfake images: Detecting ai-generated videos,

D. S. Vahdati, T. D. Nguyen, A. Azizpour, and M. C. Stamm, “Beyond deepfake images: Detecting ai-generated videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024, pp. 4397–4408

2024
[31]

What matters in detecting ai- generated videos like sora?

C. Chang, Z. Liu, X. Lyu, and X. Qi, “What matters in detecting ai- generated videos like sora?”arXiv preprint arXiv:2406.19568, 2024

work page arXiv 2024
[32]

Distinguish any fake videos: Unleashing the power of large-scale data and motion features,

L. Ji, Y . Lin, Z. Huang, Y . Han, X. Xu, J. Wu, C. Wang, and Z. Liu, “Distinguish any fake videos: Unleashing the power of large-scale data and motion features,”arXiv preprint arXiv:2405.15343, 2024

work page arXiv 2024
[33]

Turns out i’m not real: Towards robust detection of ai-generated videos,

Q. Liu, P. Shi, Y .-Y . Tsai, C. Mao, and J. Yang, “Turns out i’m not real: Towards robust detection of ai-generated videos,”arXiv preprint arXiv:2406.09601, 2024

work page arXiv 2024
[34]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

2022
[35]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

2021
[36]

Learning phrase representations using RNN encoder–decoder for statistical machine translation,

K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” inProceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Oct. 2014, pp. 1724–1734

2014
[37]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”ICLR, 2021

2021
[38]

Expanding language-image pretrained models for gen- eral video recognition,

B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for gen- eral video recognition,” inEuropean conference on computer vision. Springer, 2022, pp. 1–18

2022
[39]

Breaking semantic artifacts for generalized ai-generated image detec- tion,

C. Zheng, C. Lin, Z. Zhao, H. Wang, X. Guo, S. Liu, and C. Shen, “Breaking semantic artifacts for generalized ai-generated image detec- tion,” inAdvances in Neural Information Processing Systems, A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 59 570– 59 596

2024
[40]

Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,

C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 28 130–28 139

2024
[41]

Evalcrafter: Benchmarking and evaluating large video generation models,

Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 139–22 149

2024
[42]

Bansal, Z

H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y . Bitton, C. Jiang, Y . Sun, K.-W. Chang, and A. Grover, “Videophy: Evaluating physical commonsense for video generation,”arXiv preprint arXiv:2406.03520, 2024

work page arXiv 2024
[43]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,

W. Wang and Y . Yang, “Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,” 2024

2024
[44]

Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,

H. Xu, Q. Ye, X. Wu, M. Yan, Y . Miao, J. Ye, G. Xu, A. Hu, Y . Shi, G. Xuet al., “Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,”arXiv preprint arXiv:2306.04362, 2023

work page arXiv 2023
[45]

Pika.art,

Pika, “Pika.art,” https://pika.art/, 2022

2022
[46]

ModelScope Text-to-Video Technical Report

J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Mod- elscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review arXiv 2023
[47]

Morph studio,

Morph Studio, “Morph studio,” https://www.morphstudio.com/, 2023

2023
[48]

moonvalley.ai,

moonvalley.ai, “moonvalley.ai,” https://moonvalley.ai/, 2022

2022
[49]

Hotshot-xl,

Hotshot, “Hotshot-xl,” https://huggingface.co/hotshotco/Hotshot-XL, 2023

2023
[50]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”International Journal of Computer Vision, pp. 1–15, 2024

2024
[51]

Structure and content-guided video synthesis with diffusion models,

P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 7346–7356

2023
[52]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wanget al., “Videocrafter1: Open diffusion models for high-quality video generation,”arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review arXiv 2023
[53]

Lavie: High-quality video generation with cascaded latent diffusion models,

Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,”International Journal of Computer Vision, vol. 133, no. 5, pp. 3059–3078, 2025

2025
[54]

Dreamvideo: Composing your dream videos with customized subject and motion,

Y . Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y . Liu, Y . Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6537–6549

2024
[55]

Dreamoving: A human video generation framework based on diffusion models,

M. Feng, J. Liu, K. Yu, Y . Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Liet al., “Dreamoving: A human video generation framework based on diffusion models,”arXiv preprint arXiv:2312.05107, 2023

work page arXiv 2023
[56]

Magicanimate: Temporally consistent human image animation using diffusion model,

Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, and M. Z. Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1481–1490

2024
[57]

Msr-vtt: A large video description dataset for bridging video and language,

J. Xu, T. Mei, T. Yao, and Y . Rui, “Msr-vtt: A large video description dataset for bridging video and language,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288– 5296

2016