arxiv: 2604.04029 · v1 · submitted 2026-04-05 · 💻 cs.CV

Recognition: no theorem link

ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

Chao Shen, Hang Wang, Lei Zhang, Zhi-Qi Cheng

Pith reviewed 2026-05-13 17:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated video detectiontemporal self-similaritymultimodal fusionvideo forensicsanomaly detectiongenerative models

0 comments

The pith

AI-generated videos exhibit anomalous temporal self-similarity because they follow deterministic prompt-driven paths unlike the variable dynamics of real videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AI-generated videos contain a detectable fingerprint of anomalous temporal self-similarity arising from their deterministic generation processes. This fingerprint appears as unnaturally repetitive correlations in visual, semantic, and cross-modal domains over time. By constructing similarity matrices from frame descriptions and fusing them with cross-attention, the method captures global temporal anomalies that local artifact detectors miss. If correct, this approach would provide a more robust way to distinguish generated videos from authentic ones across various generation models.

Core claim

The paper's core claim is that AIGVs follow deterministic anchor-driven trajectories from text or image prompts, inducing unnaturally repetitive correlations across visual and semantic domains, which can be quantified by visual, textual, and cross-modal similarity matrices; these matrices are encoded by dedicated Transformer encoders and integrated via bidirectional cross-attentive fusion to detect generation with higher accuracy than prior methods.

What carries the argument

Anomalous temporal self-similarity measured through a triple-similarity representation of visual, textual, and cross-modal matrices built from frame-wise descriptions, processed by Transformer encoders and fused with cross-attention.

If this is right

Detection performance improves in AP, AUC, and ACC on large benchmarks including GenVideo and VidProM.
The approach generalizes better across different video generation models than local-artifact methods.
Emphasis shifts to global temporal evolution captured by multimodal similarity rather than short-term inconsistencies.
The fusion of intra- and inter-modal dynamics provides a unified framework for quantifying generative determinism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Video generation systems could add controlled temporal noise to reduce the repetitive self-similarity pattern and evade detection.
The similarity-matrix approach might adapt to spotting AI-generated sequences in audio or motion capture data by swapping the input modalities.
Hybrid videos containing both real and generated segments could be tested to determine how the anomaly signal degrades with partial replacement.

Load-bearing premise

AI-generated videos always follow deterministic trajectories from fixed prompts that create repetitive temporal correlations absent in real videos.

What would settle it

Construct the visual, textual, and cross-modal similarity matrices for both real videos and videos from the latest generators, then measure whether the resulting anomaly scores show clear separation or substantial overlap.

Figures

Figures reproduced from arXiv: 2604.04029 by Chao Shen, Hang Wang, Lei Zhang, Zhi-Qi Cheng.

**Figure 2.** Figure 2: The overall framework of ATSS. Given a video with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE visualizations of 10 subsets on the GenVideo dataset: (a) Crafter, (b) Gen2, (c) HotShot, (d) Lavie, (e) ModelScope, [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of Attention Density Maps. Each row displays the attention weights for the visual, textual, and cross-modal [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATSS proposes a new multimodal fingerprint for AI video detection via temporal self-similarity matrices but the abstract supplies no numbers to back the performance claims.

read the letter

This paper's main idea is that AI-generated videos show unnatural repetition in how their frames relate to each other over time, both visually and through captions, because they follow fixed prompt-driven paths instead of the random changes in real footage. The authors build three similarity matrices from the input and fuse them with cross-attentive transformers to pick up that pattern for detection. That framing of global temporal logic as a distinctive signal is new compared with prior work that mostly hunts local artifacts or brief inconsistencies. The triple-matrix setup plus bidirectional fusion is a concrete architecture that tries to handle both intra-modal and cross-modal dynamics in one pass, which is a reasonable engineering step for the problem. The circularity burden looks low since the matrices come straight from the video and descriptions without using labels. The stress-test worry about motion variance is worth watching though. If generated clips simply move less than the real ones in the benchmarks, the similarity scores could be picking up that difference rather than any deeper generative determinism, and the claimed generalization across models would weaken once motion is controlled. The abstract states clear wins on AP, AUC, and ACC across four datasets but gives zero actual values, baselines, or failure cases, so the soundness cannot be judged yet. This work is aimed at people doing video forensics and multimodal detection who care about temporal structure. A reader looking for fresh ways to model long-range consistency would find the method description useful even if the results need checking. It deserves a serious referee because the core representation is independent and the idea targets a real gap, though the experiments will need close scrutiny for confounds and proper controls.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that AI-generated videos (AIGVs) exhibit a distinctive fingerprint called anomalous temporal self-similarity (ATSS) arising from deterministic anchor-driven trajectories (e.g., fixed text/image prompts), in contrast to the stochastic dynamics of real videos. It proposes a multimodal detection framework that builds visual, textual, and cross-modal similarity matrices from frame-wise descriptions, encodes them with dedicated Transformer encoders, and fuses them via a bidirectional cross-attentive module to capture intra- and inter-modal temporal anomalies, reporting superior AP, AUC, and ACC on the GenVideo, EvalCrafter, VideoPhy, and VidProM benchmarks with strong generalization across generation models.

Significance. If the central claim is substantiated, the work would advance AIGV detection by shifting focus from localized artifacts or short-term inconsistencies to global temporal evolution, potentially improving robustness and cross-model generalization in digital forensics. The multimodal triple-similarity representation and planned code release would support reproducibility and further development.

major comments (2)

[§3] §3 (Method): The similarity matrices are constructed directly from input frames and frame-wise descriptions without any explicit control or matching for motion statistics (e.g., optical-flow magnitude distributions) between real and generated videos. This leaves the core claim—that repetitive correlations reflect generative determinism rather than reduced temporal variance in prompt-conditioned clips—unverified and load-bearing for the generalization results.
[§4] §4 (Experiments): No ablation studies isolating the contribution of the cross-attentive fusion versus individual similarity matrices, nor any failure-mode analysis on videos with matched motion statistics, are described. Without these, it is impossible to confirm that ATSS captures an intrinsic fingerprint rather than a motion artifact.

minor comments (2)

[Abstract] Abstract: The claim of outperforming SOTA methods is stated without any numerical values for AP, AUC, or ACC; a one-sentence quantitative summary would improve clarity.
[§3.1] Notation: The precise mathematical definitions of the three similarity matrices and the bidirectional cross-attentive fusion are introduced late; moving the key equations to §3.1 would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the validation of our core claim regarding anomalous temporal self-similarity in AIGVs. We address each major point below and will revise the manuscript accordingly to strengthen the evidence that ATSS reflects generative determinism rather than motion variance alone.

read point-by-point responses

Referee: [§3] §3 (Method): The similarity matrices are constructed directly from input frames and frame-wise descriptions without any explicit control or matching for motion statistics (e.g., optical-flow magnitude distributions) between real and generated videos. This leaves the core claim—that repetitive correlations reflect generative determinism rather than reduced temporal variance in prompt-conditioned clips—unverified and load-bearing for the generalization results.

Authors: We agree that explicit matching of motion statistics would provide stronger isolation of the ATSS fingerprint. The textual and cross-modal similarity matrices incorporate semantic trajectories from frame-wise descriptions, which capture prompt-driven determinism beyond low-level motion variance; this is supported by our cross-model generalization results on benchmarks with diverse motion profiles. However, we acknowledge the concern is valid and will add a dedicated analysis section with optical-flow magnitude comparisons between real and generated videos, plus matched-subset experiments where feasible. revision: partial
Referee: [§4] §4 (Experiments): No ablation studies isolating the contribution of the cross-attentive fusion versus individual similarity matrices, nor any failure-mode analysis on videos with matched motion statistics, are described. Without these, it is impossible to confirm that ATSS captures an intrinsic fingerprint rather than a motion artifact.

Authors: We will incorporate the requested ablations in the revised version, including quantitative comparisons of performance using visual-only, textual-only, cross-modal-only matrices, and the full bidirectional cross-attentive fusion. We will also add failure-mode analysis on motion-matched video subsets (e.g., via optical-flow histogram matching) to demonstrate that ATSS detection remains robust when motion statistics are controlled. These additions will directly address whether the gains stem from the multimodal fusion of temporal anomalies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; core representations derived independently from video content

full rationale

The ATSS pipeline constructs visual, textual, and cross-modal similarity matrices directly from input frames and frame-wise descriptions. These matrices quantify temporal correlations without reference to detection labels or classifier outputs. Transformer encoders and bidirectional cross-attentive fusion operate on these fixed representations to produce detection scores. No equation or step reduces by construction to a fitted parameter renamed as prediction, nor does any load-bearing claim rely on self-citation chains that presuppose the target result. The method is self-contained against external benchmarks, with performance evaluated on held-out data from GenVideo, EvalCrafter, VideoPhy, and VidProM. The central hypothesis (deterministic vs. stochastic dynamics) is tested empirically rather than defined into existence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that generative models produce deterministic temporal trajectories unlike natural stochastic dynamics; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption AIGVs follow deterministic anchor-driven trajectories inducing unnaturally repetitive correlations
Invoked as the basis for the anomalous self-similarity fingerprint.

pith-pipeline@v0.9.0 · 5570 in / 1127 out tokens · 49353 ms · 2026-05-13T17:11:54.995509+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 4 internal anchors

[1]

A survey on video diffusion models,

Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y .-G. Jiang, “A survey on video diffusion models,”ACM Computing Surveys, vol. 57, no. 2, pp. 1–42, 2024

work page 2024
[2]

Survey of video diffusion models: Foundations, implementations, and applications,

Y . Wang, X. Liu, W. Pang, L. Ma, S. Yuan, P. Debevec, and N. Yu, “Survey of video diffusion models: Foundations, implementations, and applications,”Transactions on Machine Learning Research, 2025, survey Certification. [Online]. Available: https: //openreview.net/forum?id=2ODDBObKjH

work page 2025
[3]

Media forensics and deepfakes: an overview

L. Verdoliva, “Media forensics and deepfakes: an overview,”IEEE journal of selected topics in signal processing, vol. 14, no. 5, pp. 910– 932, 2020

work page 2020
[4]

Introducing runway gen-4,

Runway AI, Inc., “Introducing runway gen-4,” https://runwayml.com/ research/introducing-runway-gen-4, 2025, accessed: 2025-11-10

work page 2025
[5]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”International Journal of Computer Vision, vol. 133, no. 4, pp. 1879–1893, 2025

work page 2025
[6]

Video generation models as world simulators,

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhmanet al., “Video generation models as world simulators,”OpenAI Blog, vol. 1, no. 8, p. 1, 2024

work page 2024
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Pika.art,

Pika, “Pika.art,” https://pika.art/, 2022

work page 2022
[9]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

H. Chen, Y . Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y . Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320

work page 2024
[10]

Dynamicrafter: Animating open-domain im- ages with video diffusion priors,

J. Xing, M. Xia, Y . Zhang, H. Chen, W. Yu, H. Liu, G. Liu, X. Wang, Y . Shan, and T.-T. Wong, “Dynamicrafter: Animating open-domain im- ages with video diffusion priors,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 399–417

work page 2024
[11]

Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

X. Shi, Z. Huang, F.-Y . Wang, W. Bian, D. Li, Y . Zhang, M. Zhang, K. C. Cheung, S. See, H. Qinet al., “Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

work page 2024
[12]

Image conductor: Precision control for interactive video syn- thesis,

Y . Li, X. Wang, Z. Zhang, Z. Wang, Z. Yuan, L. Xie, Y . Shan, and Y . Zou, “Image conductor: Precision control for interactive video syn- thesis,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 5031–5038

work page 2025
[13]

Towards open-set identity preserving face synthesis,

J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, “Towards open-set identity preserving face synthesis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6713–6722

work page 2018
[14]

Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information pro- cessing systems, 37:36777–36804, 2024

Z. Guo, Y . Wu, C. Zhuowei, P. Zhang, Q. Heet al., “Pulid: Pure and lightning id customization via contrastive alignment,”Advances in neural information processing systems, vol. 37, pp. 36 777–36 804, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

work page 2024
[15]

Out- of-distribution detection learning with unreliable out-of-distribution sources,

H. Zheng, Q. Wang, Z. Fang, X. Xia, F. Liu, T. Liu, and B. Han, “Out- of-distribution detection learning with unreliable out-of-distribution sources,”Advances in neural information processing systems, vol. 36, pp. 72 110–72 123, 2023

work page 2023
[16]

Hififace: 3d shape and semantic prior guided high fidelity face swapping,

Y . Wang, X. Chen, J. Zhu, W. Chu, Y . Tai, C. Wang, J. Li, Y . Wu, F. Huang, and R. Ji, “Hififace: 3d shape and semantic prior guided high fidelity face swapping,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Z.- H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 ...

work page doi:10.24963/ijcai.2021/157 2021
[17]

Diffswap: High- fidelity and controllable face swapping via 3d-aware masked diffusion,

W. Zhao, Y . Rao, W. Shi, Z. Liu, J. Zhou, and J. Lu, “Diffswap: High- fidelity and controllable face swapping via 3d-aware masked diffusion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8568–8577

work page 2023
[18]

Avff: Audio-visual feature fusion for video deepfake detection,

T. Oorloff, S. Koppisetti, N. Bonettini, D. Solanki, B. Colman, Y . Ya- coob, A. Shahriyari, and G. Bharaj, “Avff: Audio-visual feature fusion for video deepfake detection,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 27 102– 27 112

work page 2024
[19]

Learning defense transformations for counterattacking adversarial examples,

J. Li, S. Zhang, J. Cao, and M. Tan, “Learning defense transformations for counterattacking adversarial examples,”Neural Networks, vol. 164, pp. 177–185, 2023

work page 2023
[20]

Identifying and mitigating the security risks of generative ai,

C. Barrett, B. Boyd, E. Bursztein, N. Carlini, B. Chen, J. Choi, A. R. Chowdhury, M. Christodorescu, A. Datta, S. Feiziet al., “Identifying and mitigating the security risks of generative ai,”Foundations and Trends® in Privacy and Security, vol. 6, no. 1, pp. 1–52, 2023

work page 2023
[21]

A survey of detection and mitigation for fake images on social media platforms,

D. K. Sharma, B. Singh, S. Agarwal, L. Garg, C. Kim, and K.-H. Jung, “A survey of detection and mitigation for fake images on social media platforms,”Applied Sciences, vol. 13, no. 19, p. 10980, 2023

work page 2023
[22]

Spatiotemporal inconsistency learning for deepfake video detection,

Z. Gu, Y . Chen, T. Yao, S. Ding, J. Li, F. Huang, and L. Ma, “Spatiotemporal inconsistency learning for deepfake video detection,” in Proceedings of the 29th ACM International Conference on Multimedia, ser. MM ’21, 2021, p. 3473–3481

work page 2021
[23]

Exploring temporal coherence for more general video face forgery detection,

Y . Zheng, J. Bao, D. Chen, M. Zeng, and F. Wen, “Exploring temporal coherence for more general video face forgery detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 044–15 054

work page 2021
[24]

Tall: Thumbnail layout for deepfake video detection,

Y . Xu, J. Liang, G. Jia, Z. Yang, Y . Zhang, and R. He, “Tall: Thumbnail layout for deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 658–22 668

work page 2023
[25]

Ai-generated video detection via spatial-temporal anomaly learning,

J. Bai, M. Lin, G. Cao, and Z. Lou, “Ai-generated video detection via spatial-temporal anomaly learning,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 460– 470

work page 2024
[26]

Demamba: Ai-generated video detection on million-scale genvideo benchmark,

H. Chen, Y . Hong, Z. Huang, Z. Xu, Z. Gu, Y . Li, J. Lan, H. Zhu, J. Zhang, W. Wanget al., “Demamba: Ai-generated video detection on million-scale genvideo benchmark,”arXiv preprint arXiv:2405.19707, 2024

work page arXiv 2024
[27]

D3: Training-free ai-generated video detection using second-order features,

C. Zheng, R. Suo, C. Lin, Z. Zhao, L. Yang, S. Liu, M. Yang, C. Wang, and C. Shen, “D3: Training-free ai-generated video detection using second-order features,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 12 852– 12 862

work page 2025
[28]

Mre-net: Multi-rate excitation network for deepfake video detection,

G. Pang, B. Zhang, Z. Teng, Z. Qi, and J. Fan, “Mre-net: Multi-rate excitation network for deepfake video detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 3663– 3676, 2023

work page 2023
[29]

Istvt: interpretable spatial-temporal video transformer for deepfake detection,

C. Zhao, C. Wang, G. Hu, H. Chen, C. Liu, and J. Tang, “Istvt: interpretable spatial-temporal video transformer for deepfake detection,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1335–1348, 2023

work page 2023
[30]

Altfreezing for more general video face forgery detection,

Z. Wang, J. Bao, W. Zhou, W. Wang, and H. Li, “Altfreezing for more general video face forgery detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 4129– 4138

work page 2023
[31]

Vulnerability-aware spatio-temporal learning for generalizable deep- fake video detection,

D. Nguyen, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “Vulnerability-aware spatio-temporal learning for generalizable deep- fake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 10 786–10 796

work page 2025
[32]

Compressed deepfake video detection based on 3d spatiotemporal trajectories,

Z. Chen, X. Liao, X. Wu, and Y . Chen, “Compressed deepfake video detection based on 3d spatiotemporal trajectories,” in2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2024, pp. 1–8

work page 2024
[33]

Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,

T. Kim, J. Choi, Y . Jeong, H. Noh, J. Yoo, S. Baek, and J. Choi, “Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 11 198–11 207

work page 2025
[34]

Dip: diffusion learning of inconsistency pattern for general deepfake detection,

F. Nie, J. Ni, J. Zhang, B. Zhang, and W. Zhang, “Dip: diffusion learning of inconsistency pattern for general deepfake detection,”IEEE Transactions on Multimedia, 2024

work page 2024
[35]

Deepfake detection with spatio-temporal consistency and attention,

Y . Chen, N. Akhtar, N. A. H. Haldar, and A. Mian, “Deepfake detection with spatio-temporal consistency and attention,” in2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA). IEEE, 2022, pp. 1–8

work page 2022
[36]

Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,

Z. Yan, Y . Zhao, S. Chen, M. Guo, X. Fu, T. Yao, S. Ding, Y . Wu, and L. Yuan, “Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 615–12 625

work page 2025
[37]

Id-reveal: Identity-aware deepfake video detection,

D. Cozzolino, A. R ¨ossler, J. Thies, M. Nießner, and L. Verdoliva, “Id-reveal: Identity-aware deepfake video detection,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 108–15 117

work page 2021
[38]

Learning natural consistency representation for face forgery video detection,

D. Zhang, Z. Xiao, S. Li, F. Lin, J. Li, and S. Ge, “Learning natural consistency representation for face forgery video detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 407–424

work page 2024
[39]

Self-supervised video forensics by audio-visual anomaly detection,

C. Feng, Z. Chen, and A. Owens, “Self-supervised video forensics by audio-visual anomaly detection,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10 491–10 503

work page 2023
[40]

Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content,

R. Kundu, H. Xiong, V . Mohanty, A. Balachandran, and A. K. Roy- Chowdhury, “Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 050–28 060

work page 2025
[41]

Mintime: Multi-identity size-invariant video deepfake detection,

D. A. Coccomini, G. K. Zilos, G. Amato, R. Caldelli, F. Falchi, S. Pa- padopoulos, and C. Gennaro, “Mintime: Multi-identity size-invariant video deepfake detection,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 6084–6096, 2024

work page 2024
[42]

Exploiting style latent flows for generalizing deepfake video detection,

J. Choi, T. Kim, Y . Jeong, S. Baek, and J. Choi, “Exploiting style latent flows for generalizing deepfake video detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 1133–1143

work page 2024
[43]

Shaking the fake: Detecting deepfake videos in real time via active probes,

Z. Xie and J. Luo, “Shaking the fake: Detecting deepfake videos in real time via active probes,”arXiv preprint arXiv:2409.10889, 2024

work page arXiv 2024
[44]

Distinguish any fake videos: Unleashing the power of large-scale data and motion features,

L. Ji, Y . Lin, Z. Huang, Y . Han, X. Xu, J. Wu, C. Wang, and Z. Liu, “Distinguish any fake videos: Unleashing the power of large-scale data and motion features,”arXiv preprint arXiv:2405.15343, 2024

work page arXiv 2024
[45]

Detecting ai- generated video via frame consistency,

L. Ma, Z. Yan, Q. Guo, Y . Liao, H. Yu, and P. Zhou, “Detecting ai- generated video via frame consistency,” in2025 IEEE International Conference on Multimedia and Expo (ICME), 2025, pp. 1–6

work page 2025
[46]

Turns out i’m not real: Towards robust detection of ai-generated videos,

Q. Liu, P. Shi, Y .-Y . Tsai, C. Mao, and J. Yang, “Turns out i’m not real: Towards robust detection of ai-generated videos,”arXiv preprint arXiv:2406.09601, 2024

work page arXiv 2024
[47]

Physics-driven spatiotemporal modeling for ai-generated video detection,

S. Zhang, Z. Lian, J. Yang, D. Li, G. Pang, F. Liu, B. Han, S. Li, and M. Tan, “Physics-driven spatiotemporal modeling for ai-generated video detection,” inAdvances in Neural Information Processing Systems, 2025

work page 2025
[48]

AI-generated video detection via perceptual straightening,

C. Intern `o, R. Geirhos, M. Olhofer, S. Liu, B. Hammer, and D. Klindt, “AI-generated video detection via perceptual straightening,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id= LsmUgStXby

work page 2025
[49]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

work page 2023
[50]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page internal anchor Pith review 2017
[51]

Expanding language-image pretrained models for gen- eral video recognition,

B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for gen- eral video recognition,” inEuropean conference on computer vision. Springer, 2022, pp. 1–18

work page 2022
[52]

Breaking semantic artifacts for generalized ai-generated image detec- tion,

C. Zheng, C. Lin, Z. Zhao, H. Wang, X. Guo, S. Liu, and C. Shen, “Breaking semantic artifacts for generalized ai-generated image detec- tion,” inAdvances in Neural Information Processing Systems, A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 59 570– 59 596

work page 2024
[53]

Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,

C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,” inProceedings of the IEEE/CVF Conference on JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 28 130–28 139

work page 2021
[54]

Evalcrafter: Benchmarking and evaluating large video generation models,

Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 139–22 149

work page 2024
[55]

Videophy: Evaluating physical commonsense for video generation

H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y . Bitton, C. Jiang, Y . Sun, K.-W. Chang, and A. Grover, “Videophy: Evaluating physical commonsense for video generation,”arXiv preprint arXiv:2406.03520, 2024

work page arXiv 2024
[56]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,

W. Wang and Y . Yang, “Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,” 2024

work page 2024
[57]

Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,

H. Xu, Q. Ye, X. Wu, M. Yan, Y . Miao, J. Ye, G. Xu, A. Hu, Y . Shi, G. Xuet al., “Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,”arXiv preprint arXiv:2306.04362, 2023

work page arXiv 2023
[58]

moonvalley.ai,

moonvalley.ai, “moonvalley.ai,” https://moonvalley.ai/, 2022

work page 2022
[59]

ModelScope Text-to-Video Technical Report

J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Mod- elscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Morph studio,

Morph Studio, “Morph studio,” https://www.morphstudio.com/, 2023

work page 2023
[61]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”International Journal of Computer Vision, pp. 1–15, 2024

work page 2024
[62]

Hotshot-xl,

Hotshot, “Hotshot-xl,” https://huggingface.co/hotshotco/Hotshot-XL, 2023

work page 2023
[63]

Structure and content-guided video synthesis with diffusion models,

P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 7346–7356

work page 2023
[64]

Lavie: High-quality video generation with cascaded latent diffusion models,

Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,”International Journal of Computer Vision, vol. 133, no. 5, pp. 3059–3078, 2025

work page 2025
[65]

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al

H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wanget al., “Videocrafter1: Open diffusion models for high-quality video generation,”arXiv preprint arXiv:2310.19512, 2023

work page arXiv 2023
[66]

Dreamvideo: Composing your dream videos with customized subject and motion,

Y . Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y . Liu, Y . Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6537–6549

work page 2024
[67]

Dreamoving: A human video generation framework based on diffusion models,

M. Feng, J. Liu, K. Yu, Y . Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Liet al., “Dreamoving: A human video generation framework based on diffusion models,”arXiv preprint arXiv:2312.05107, 2023

work page arXiv 2023
[68]

Magicanimate: Temporally consistent human image animation using diffusion model,

Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, and M. Z. Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1481–1490

work page 2024
[69]

Msr-vtt: A large video description dataset for bridging video and language,

J. Xu, T. Mei, T. Yao, and Y . Rui, “Msr-vtt: A large video description dataset for bridging video and language,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288– 5296

work page 2016
[70]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion.” inProceedings of the 3rd International Conference on Learning Representations (ICLR 2015), 2015

work page internal anchor Pith review 2015
[71]

Rethinking the inception architecture for computer vision,

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826. Hang Wangreceived his bachelor’s degree in the School of Computer Science and Technology from Xi’an Jiaotong University in 2017, and...

work page 2016