Towards multi-modal forgery representation learning for AI-generated video detection and localization

· 2026 · cs.CV · arXiv 2605.07232

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Towards multi-modal forgery representation learning for AI-generated video detection and localization

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

A multi-modal model with LMM semantic, ST visual, and PS audio branches enables simultaneous detection and fine-grained temporal localization of partial AI video forgeries, outperforming prior methods.

citing papers explorer

Showing 1 of 1 citing paper.

Towards multi-modal forgery representation learning for AI-generated video detection and localization cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
A multi-modal model with LMM semantic, ST visual, and PS audio branches enables simultaneous detection and fine-grained temporal localization of partial AI video forgeries, outperforming prior methods.

Towards multi-modal forgery representation learning for AI-generated video detection and localization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer