Recognition: no theorem link
ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity
Pith reviewed 2026-05-13 17:11 UTC · model grok-4.3
The pith
AI-generated videos exhibit anomalous temporal self-similarity because they follow deterministic prompt-driven paths unlike the variable dynamics of real videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper's core claim is that AIGVs follow deterministic anchor-driven trajectories from text or image prompts, inducing unnaturally repetitive correlations across visual and semantic domains, which can be quantified by visual, textual, and cross-modal similarity matrices; these matrices are encoded by dedicated Transformer encoders and integrated via bidirectional cross-attentive fusion to detect generation with higher accuracy than prior methods.
What carries the argument
Anomalous temporal self-similarity measured through a triple-similarity representation of visual, textual, and cross-modal matrices built from frame-wise descriptions, processed by Transformer encoders and fused with cross-attention.
If this is right
- Detection performance improves in AP, AUC, and ACC on large benchmarks including GenVideo and VidProM.
- The approach generalizes better across different video generation models than local-artifact methods.
- Emphasis shifts to global temporal evolution captured by multimodal similarity rather than short-term inconsistencies.
- The fusion of intra- and inter-modal dynamics provides a unified framework for quantifying generative determinism.
Where Pith is reading between the lines
- Video generation systems could add controlled temporal noise to reduce the repetitive self-similarity pattern and evade detection.
- The similarity-matrix approach might adapt to spotting AI-generated sequences in audio or motion capture data by swapping the input modalities.
- Hybrid videos containing both real and generated segments could be tested to determine how the anomaly signal degrades with partial replacement.
Load-bearing premise
AI-generated videos always follow deterministic trajectories from fixed prompts that create repetitive temporal correlations absent in real videos.
What would settle it
Construct the visual, textual, and cross-modal similarity matrices for both real videos and videos from the latest generators, then measure whether the resulting anomaly scores show clear separation or substantial overlap.
Figures
read the original abstract
AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that AI-generated videos (AIGVs) exhibit a distinctive fingerprint called anomalous temporal self-similarity (ATSS) arising from deterministic anchor-driven trajectories (e.g., fixed text/image prompts), in contrast to the stochastic dynamics of real videos. It proposes a multimodal detection framework that builds visual, textual, and cross-modal similarity matrices from frame-wise descriptions, encodes them with dedicated Transformer encoders, and fuses them via a bidirectional cross-attentive module to capture intra- and inter-modal temporal anomalies, reporting superior AP, AUC, and ACC on the GenVideo, EvalCrafter, VideoPhy, and VidProM benchmarks with strong generalization across generation models.
Significance. If the central claim is substantiated, the work would advance AIGV detection by shifting focus from localized artifacts or short-term inconsistencies to global temporal evolution, potentially improving robustness and cross-model generalization in digital forensics. The multimodal triple-similarity representation and planned code release would support reproducibility and further development.
major comments (2)
- [§3] §3 (Method): The similarity matrices are constructed directly from input frames and frame-wise descriptions without any explicit control or matching for motion statistics (e.g., optical-flow magnitude distributions) between real and generated videos. This leaves the core claim—that repetitive correlations reflect generative determinism rather than reduced temporal variance in prompt-conditioned clips—unverified and load-bearing for the generalization results.
- [§4] §4 (Experiments): No ablation studies isolating the contribution of the cross-attentive fusion versus individual similarity matrices, nor any failure-mode analysis on videos with matched motion statistics, are described. Without these, it is impossible to confirm that ATSS captures an intrinsic fingerprint rather than a motion artifact.
minor comments (2)
- [Abstract] Abstract: The claim of outperforming SOTA methods is stated without any numerical values for AP, AUC, or ACC; a one-sentence quantitative summary would improve clarity.
- [§3.1] Notation: The precise mathematical definitions of the three similarity matrices and the bidirectional cross-attentive fusion are introduced late; moving the key equations to §3.1 would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the validation of our core claim regarding anomalous temporal self-similarity in AIGVs. We address each major point below and will revise the manuscript accordingly to strengthen the evidence that ATSS reflects generative determinism rather than motion variance alone.
read point-by-point responses
-
Referee: [§3] §3 (Method): The similarity matrices are constructed directly from input frames and frame-wise descriptions without any explicit control or matching for motion statistics (e.g., optical-flow magnitude distributions) between real and generated videos. This leaves the core claim—that repetitive correlations reflect generative determinism rather than reduced temporal variance in prompt-conditioned clips—unverified and load-bearing for the generalization results.
Authors: We agree that explicit matching of motion statistics would provide stronger isolation of the ATSS fingerprint. The textual and cross-modal similarity matrices incorporate semantic trajectories from frame-wise descriptions, which capture prompt-driven determinism beyond low-level motion variance; this is supported by our cross-model generalization results on benchmarks with diverse motion profiles. However, we acknowledge the concern is valid and will add a dedicated analysis section with optical-flow magnitude comparisons between real and generated videos, plus matched-subset experiments where feasible. revision: partial
-
Referee: [§4] §4 (Experiments): No ablation studies isolating the contribution of the cross-attentive fusion versus individual similarity matrices, nor any failure-mode analysis on videos with matched motion statistics, are described. Without these, it is impossible to confirm that ATSS captures an intrinsic fingerprint rather than a motion artifact.
Authors: We will incorporate the requested ablations in the revised version, including quantitative comparisons of performance using visual-only, textual-only, cross-modal-only matrices, and the full bidirectional cross-attentive fusion. We will also add failure-mode analysis on motion-matched video subsets (e.g., via optical-flow histogram matching) to demonstrate that ATSS detection remains robust when motion statistics are controlled. These additions will directly address whether the gains stem from the multimodal fusion of temporal anomalies. revision: yes
Circularity Check
No significant circularity; core representations derived independently from video content
full rationale
The ATSS pipeline constructs visual, textual, and cross-modal similarity matrices directly from input frames and frame-wise descriptions. These matrices quantify temporal correlations without reference to detection labels or classifier outputs. Transformer encoders and bidirectional cross-attentive fusion operate on these fixed representations to produce detection scores. No equation or step reduces by construction to a fitted parameter renamed as prediction, nor does any load-bearing claim rely on self-citation chains that presuppose the target result. The method is self-contained against external benchmarks, with performance evaluated on held-out data from GenVideo, EvalCrafter, VideoPhy, and VidProM. The central hypothesis (deterministic vs. stochastic dynamics) is tested empirically rather than defined into existence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption AIGVs follow deterministic anchor-driven trajectories inducing unnaturally repetitive correlations
Reference graph
Works this paper leans on
-
[1]
A survey on video diffusion models,
Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y .-G. Jiang, “A survey on video diffusion models,”ACM Computing Surveys, vol. 57, no. 2, pp. 1–42, 2024
work page 2024
-
[2]
Survey of video diffusion models: Foundations, implementations, and applications,
Y . Wang, X. Liu, W. Pang, L. Ma, S. Yuan, P. Debevec, and N. Yu, “Survey of video diffusion models: Foundations, implementations, and applications,”Transactions on Machine Learning Research, 2025, survey Certification. [Online]. Available: https: //openreview.net/forum?id=2ODDBObKjH
work page 2025
-
[3]
Media forensics and deepfakes: an overview
L. Verdoliva, “Media forensics and deepfakes: an overview,”IEEE journal of selected topics in signal processing, vol. 14, no. 5, pp. 910– 932, 2020
work page 2020
-
[4]
Runway AI, Inc., “Introducing runway gen-4,” https://runwayml.com/ research/introducing-runway-gen-4, 2025, accessed: 2025-11-10
work page 2025
-
[5]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation,
D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”International Journal of Computer Vision, vol. 133, no. 4, pp. 1879–1893, 2025
work page 2025
-
[6]
Video generation models as world simulators,
T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhmanet al., “Video generation models as world simulators,”OpenAI Blog, vol. 1, no. 8, p. 1, 2024
work page 2024
-
[7]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [8]
-
[9]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
H. Chen, Y . Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y . Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320
work page 2024
-
[10]
Dynamicrafter: Animating open-domain im- ages with video diffusion priors,
J. Xing, M. Xia, Y . Zhang, H. Chen, W. Yu, H. Liu, G. Liu, X. Wang, Y . Shan, and T.-T. Wong, “Dynamicrafter: Animating open-domain im- ages with video diffusion priors,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 399–417
work page 2024
-
[11]
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling
X. Shi, Z. Huang, F.-Y . Wang, W. Bian, D. Li, Y . Zhang, M. Zhang, K. C. Cheung, S. See, H. Qinet al., “Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11
work page 2024
-
[12]
Image conductor: Precision control for interactive video syn- thesis,
Y . Li, X. Wang, Z. Zhang, Z. Wang, Z. Yuan, L. Xie, Y . Shan, and Y . Zou, “Image conductor: Precision control for interactive video syn- thesis,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 5031–5038
work page 2025
-
[13]
Towards open-set identity preserving face synthesis,
J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, “Towards open-set identity preserving face synthesis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6713–6722
work page 2018
-
[14]
Z. Guo, Y . Wu, C. Zhuowei, P. Zhang, Q. Heet al., “Pulid: Pure and lightning id customization via contrastive alignment,”Advances in neural information processing systems, vol. 37, pp. 36 777–36 804, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15
work page 2024
-
[15]
Out- of-distribution detection learning with unreliable out-of-distribution sources,
H. Zheng, Q. Wang, Z. Fang, X. Xia, F. Liu, T. Liu, and B. Han, “Out- of-distribution detection learning with unreliable out-of-distribution sources,”Advances in neural information processing systems, vol. 36, pp. 72 110–72 123, 2023
work page 2023
-
[16]
Hififace: 3d shape and semantic prior guided high fidelity face swapping,
Y . Wang, X. Chen, J. Zhu, W. Chu, Y . Tai, C. Wang, J. Li, Y . Wu, F. Huang, and R. Ji, “Hififace: 3d shape and semantic prior guided high fidelity face swapping,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Z.- H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 ...
-
[17]
Diffswap: High- fidelity and controllable face swapping via 3d-aware masked diffusion,
W. Zhao, Y . Rao, W. Shi, Z. Liu, J. Zhou, and J. Lu, “Diffswap: High- fidelity and controllable face swapping via 3d-aware masked diffusion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8568–8577
work page 2023
-
[18]
Avff: Audio-visual feature fusion for video deepfake detection,
T. Oorloff, S. Koppisetti, N. Bonettini, D. Solanki, B. Colman, Y . Ya- coob, A. Shahriyari, and G. Bharaj, “Avff: Audio-visual feature fusion for video deepfake detection,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 27 102– 27 112
work page 2024
-
[19]
Learning defense transformations for counterattacking adversarial examples,
J. Li, S. Zhang, J. Cao, and M. Tan, “Learning defense transformations for counterattacking adversarial examples,”Neural Networks, vol. 164, pp. 177–185, 2023
work page 2023
-
[20]
Identifying and mitigating the security risks of generative ai,
C. Barrett, B. Boyd, E. Bursztein, N. Carlini, B. Chen, J. Choi, A. R. Chowdhury, M. Christodorescu, A. Datta, S. Feiziet al., “Identifying and mitigating the security risks of generative ai,”Foundations and Trends® in Privacy and Security, vol. 6, no. 1, pp. 1–52, 2023
work page 2023
-
[21]
A survey of detection and mitigation for fake images on social media platforms,
D. K. Sharma, B. Singh, S. Agarwal, L. Garg, C. Kim, and K.-H. Jung, “A survey of detection and mitigation for fake images on social media platforms,”Applied Sciences, vol. 13, no. 19, p. 10980, 2023
work page 2023
-
[22]
Spatiotemporal inconsistency learning for deepfake video detection,
Z. Gu, Y . Chen, T. Yao, S. Ding, J. Li, F. Huang, and L. Ma, “Spatiotemporal inconsistency learning for deepfake video detection,” in Proceedings of the 29th ACM International Conference on Multimedia, ser. MM ’21, 2021, p. 3473–3481
work page 2021
-
[23]
Exploring temporal coherence for more general video face forgery detection,
Y . Zheng, J. Bao, D. Chen, M. Zeng, and F. Wen, “Exploring temporal coherence for more general video face forgery detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 044–15 054
work page 2021
-
[24]
Tall: Thumbnail layout for deepfake video detection,
Y . Xu, J. Liang, G. Jia, Z. Yang, Y . Zhang, and R. He, “Tall: Thumbnail layout for deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 658–22 668
work page 2023
-
[25]
Ai-generated video detection via spatial-temporal anomaly learning,
J. Bai, M. Lin, G. Cao, and Z. Lou, “Ai-generated video detection via spatial-temporal anomaly learning,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 460– 470
work page 2024
-
[26]
Demamba: Ai-generated video detection on million-scale genvideo benchmark,
H. Chen, Y . Hong, Z. Huang, Z. Xu, Z. Gu, Y . Li, J. Lan, H. Zhu, J. Zhang, W. Wanget al., “Demamba: Ai-generated video detection on million-scale genvideo benchmark,”arXiv preprint arXiv:2405.19707, 2024
-
[27]
D3: Training-free ai-generated video detection using second-order features,
C. Zheng, R. Suo, C. Lin, Z. Zhao, L. Yang, S. Liu, M. Yang, C. Wang, and C. Shen, “D3: Training-free ai-generated video detection using second-order features,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 12 852– 12 862
work page 2025
-
[28]
Mre-net: Multi-rate excitation network for deepfake video detection,
G. Pang, B. Zhang, Z. Teng, Z. Qi, and J. Fan, “Mre-net: Multi-rate excitation network for deepfake video detection,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 3663– 3676, 2023
work page 2023
-
[29]
Istvt: interpretable spatial-temporal video transformer for deepfake detection,
C. Zhao, C. Wang, G. Hu, H. Chen, C. Liu, and J. Tang, “Istvt: interpretable spatial-temporal video transformer for deepfake detection,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1335–1348, 2023
work page 2023
-
[30]
Altfreezing for more general video face forgery detection,
Z. Wang, J. Bao, W. Zhou, W. Wang, and H. Li, “Altfreezing for more general video face forgery detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 4129– 4138
work page 2023
-
[31]
Vulnerability-aware spatio-temporal learning for generalizable deep- fake video detection,
D. Nguyen, M. Astrid, A. Kacem, E. Ghorbel, and D. Aouada, “Vulnerability-aware spatio-temporal learning for generalizable deep- fake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 10 786–10 796
work page 2025
-
[32]
Compressed deepfake video detection based on 3d spatiotemporal trajectories,
Z. Chen, X. Liao, X. Wu, and Y . Chen, “Compressed deepfake video detection based on 3d spatiotemporal trajectories,” in2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2024, pp. 1–8
work page 2024
-
[33]
Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,
T. Kim, J. Choi, Y . Jeong, H. Noh, J. Yoo, S. Baek, and J. Choi, “Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 11 198–11 207
work page 2025
-
[34]
Dip: diffusion learning of inconsistency pattern for general deepfake detection,
F. Nie, J. Ni, J. Zhang, B. Zhang, and W. Zhang, “Dip: diffusion learning of inconsistency pattern for general deepfake detection,”IEEE Transactions on Multimedia, 2024
work page 2024
-
[35]
Deepfake detection with spatio-temporal consistency and attention,
Y . Chen, N. Akhtar, N. A. H. Haldar, and A. Mian, “Deepfake detection with spatio-temporal consistency and attention,” in2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA). IEEE, 2022, pp. 1–8
work page 2022
-
[36]
Z. Yan, Y . Zhao, S. Chen, M. Guo, X. Fu, T. Yao, S. Ding, Y . Wu, and L. Yuan, “Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 615–12 625
work page 2025
-
[37]
Id-reveal: Identity-aware deepfake video detection,
D. Cozzolino, A. R ¨ossler, J. Thies, M. Nießner, and L. Verdoliva, “Id-reveal: Identity-aware deepfake video detection,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 108–15 117
work page 2021
-
[38]
Learning natural consistency representation for face forgery video detection,
D. Zhang, Z. Xiao, S. Li, F. Lin, J. Li, and S. Ge, “Learning natural consistency representation for face forgery video detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 407–424
work page 2024
-
[39]
Self-supervised video forensics by audio-visual anomaly detection,
C. Feng, Z. Chen, and A. Owens, “Self-supervised video forensics by audio-visual anomaly detection,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10 491–10 503
work page 2023
-
[40]
R. Kundu, H. Xiong, V . Mohanty, A. Balachandran, and A. K. Roy- Chowdhury, “Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 050–28 060
work page 2025
-
[41]
Mintime: Multi-identity size-invariant video deepfake detection,
D. A. Coccomini, G. K. Zilos, G. Amato, R. Caldelli, F. Falchi, S. Pa- padopoulos, and C. Gennaro, “Mintime: Multi-identity size-invariant video deepfake detection,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 6084–6096, 2024
work page 2024
-
[42]
Exploiting style latent flows for generalizing deepfake video detection,
J. Choi, T. Kim, Y . Jeong, S. Baek, and J. Choi, “Exploiting style latent flows for generalizing deepfake video detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 1133–1143
work page 2024
-
[43]
Shaking the fake: Detecting deepfake videos in real time via active probes,
Z. Xie and J. Luo, “Shaking the fake: Detecting deepfake videos in real time via active probes,”arXiv preprint arXiv:2409.10889, 2024
-
[44]
Distinguish any fake videos: Unleashing the power of large-scale data and motion features,
L. Ji, Y . Lin, Z. Huang, Y . Han, X. Xu, J. Wu, C. Wang, and Z. Liu, “Distinguish any fake videos: Unleashing the power of large-scale data and motion features,”arXiv preprint arXiv:2405.15343, 2024
-
[45]
Detecting ai- generated video via frame consistency,
L. Ma, Z. Yan, Q. Guo, Y . Liao, H. Yu, and P. Zhou, “Detecting ai- generated video via frame consistency,” in2025 IEEE International Conference on Multimedia and Expo (ICME), 2025, pp. 1–6
work page 2025
-
[46]
Turns out i’m not real: Towards robust detection of ai-generated videos,
Q. Liu, P. Shi, Y .-Y . Tsai, C. Mao, and J. Yang, “Turns out i’m not real: Towards robust detection of ai-generated videos,”arXiv preprint arXiv:2406.09601, 2024
-
[47]
Physics-driven spatiotemporal modeling for ai-generated video detection,
S. Zhang, Z. Lian, J. Yang, D. Li, G. Pang, F. Liu, B. Han, S. Li, and M. Tan, “Physics-driven spatiotemporal modeling for ai-generated video detection,” inAdvances in Neural Information Processing Systems, 2025
work page 2025
-
[48]
AI-generated video detection via perceptual straightening,
C. Intern `o, R. Geirhos, M. Olhofer, S. Liu, B. Hammer, and D. Klindt, “AI-generated video detection via perceptual straightening,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id= LsmUgStXby
work page 2025
-
[49]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742
work page 2023
-
[50]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
work page internal anchor Pith review 2017
-
[51]
Expanding language-image pretrained models for gen- eral video recognition,
B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for gen- eral video recognition,” inEuropean conference on computer vision. Springer, 2022, pp. 1–18
work page 2022
-
[52]
Breaking semantic artifacts for generalized ai-generated image detec- tion,
C. Zheng, C. Lin, Z. Zhao, H. Wang, X. Guo, S. Liu, and C. Shen, “Breaking semantic artifacts for generalized ai-generated image detec- tion,” inAdvances in Neural Information Processing Systems, A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 59 570– 59 596
work page 2024
-
[53]
C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,” inProceedings of the IEEE/CVF Conference on JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 28 130–28 139
work page 2021
-
[54]
Evalcrafter: Benchmarking and evaluating large video generation models,
Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 139–22 149
work page 2024
-
[55]
Videophy: Evaluating physical commonsense for video generation
H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y . Bitton, C. Jiang, Y . Sun, K.-W. Chang, and A. Grover, “Videophy: Evaluating physical commonsense for video generation,”arXiv preprint arXiv:2406.03520, 2024
-
[56]
Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,
W. Wang and Y . Yang, “Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,” 2024
work page 2024
-
[57]
H. Xu, Q. Ye, X. Wu, M. Yan, Y . Miao, J. Ye, G. Xu, A. Hu, Y . Shi, G. Xuet al., “Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks,”arXiv preprint arXiv:2306.04362, 2023
- [58]
-
[59]
ModelScope Text-to-Video Technical Report
J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Mod- elscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [60]
-
[61]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation,
D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”International Journal of Computer Vision, pp. 1–15, 2024
work page 2024
- [62]
-
[63]
Structure and content-guided video synthesis with diffusion models,
P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 7346–7356
work page 2023
-
[64]
Lavie: High-quality video generation with cascaded latent diffusion models,
Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,”International Journal of Computer Vision, vol. 133, no. 5, pp. 3059–3078, 2025
work page 2025
-
[65]
H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wanget al., “Videocrafter1: Open diffusion models for high-quality video generation,”arXiv preprint arXiv:2310.19512, 2023
-
[66]
Dreamvideo: Composing your dream videos with customized subject and motion,
Y . Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y . Liu, Y . Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6537–6549
work page 2024
-
[67]
Dreamoving: A human video generation framework based on diffusion models,
M. Feng, J. Liu, K. Yu, Y . Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Liet al., “Dreamoving: A human video generation framework based on diffusion models,”arXiv preprint arXiv:2312.05107, 2023
-
[68]
Magicanimate: Temporally consistent human image animation using diffusion model,
Z. Xu, J. Zhang, J. H. Liew, H. Yan, J.-W. Liu, C. Zhang, J. Feng, and M. Z. Shou, “Magicanimate: Temporally consistent human image animation using diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1481–1490
work page 2024
-
[69]
Msr-vtt: A large video description dataset for bridging video and language,
J. Xu, T. Mei, T. Yao, and Y . Rui, “Msr-vtt: A large video description dataset for bridging video and language,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288– 5296
work page 2016
-
[70]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion.” inProceedings of the 3rd International Conference on Learning Representations (ICLR 2015), 2015
work page internal anchor Pith review 2015
-
[71]
Rethinking the inception architecture for computer vision,
C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826. Hang Wangreceived his bachelor’s degree in the School of Computer Science and Technology from Xi’an Jiaotong University in 2017, and...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.