Recognition: unknown
Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes
Pith reviewed 2026-05-10 16:35 UTC · model grok-4.3
The pith
Training deepfake detectors solely on real videos plus self-generated pseudo-fakes improves detection of unseen deepfakes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that by creating self-generated Audio-Visual Pseudo-Fakes from authentic samples alone, which incorporate diverse cross-modal correspondence patterns typical of real-world deepfakes, and training detectors exclusively on authentic data combined with these pseudo-fakes, the models achieve notably enhanced generalizability, with an average performance improvement of up to 7.4% across multiple standard datasets.
What carries the argument
AVPF, a generation process that produces pseudo-fake video samples from real ones by altering audio-visual alignments to mimic common deepfake discrepancies, serving as synthetic training data to expose models to varied mismatch patterns.
If this is right
- Models achieve better results on unseen deepfakes without training on real fakes.
- The method relies only on authentic videos for both real and fake training examples.
- Generalizability improves due to exposure to a broader range of audio-visual correspondence issues.
- Performance gains are demonstrated on multiple standard deepfake datasets.
Where Pith is reading between the lines
- This technique might allow training in data-scarce environments where real deepfakes are not available for ethical or legal reasons.
- Similar pseudo-sample generation could be explored for other detection tasks involving multimodal data.
- If successful, it suggests that the key to generalization lies in simulating the artifacts rather than collecting them.
Load-bearing premise
The self-generated pseudo-fakes must accurately replicate the diverse audio-visual mismatch patterns that appear in actual deepfakes created by various methods.
What would settle it
Testing the trained model on a dataset of deepfakes featuring audio-visual inconsistencies that differ substantially from those in the pseudo-fakes and observing no performance gain or a decrease compared to baseline methods.
Figures
read the original abstract
Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. While they have shown promising results, their effectiveness often degrades in real-world scenarios, as the limited diversity of training datasets naturally restricts generalizability to unseen cases. To address this, we propose a simple yet effective method, called AVPF, which can notably enhance model generalizability by training with self-generated Audio-Visual Pseudo-Fakes.The key idea of AVPF is to create pseudo-fake training samples that contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes. We highlight that AVPF is generated solely from authentic samples, and training relies only on authentic data and AVPF, without requiring any real deepfakes.Extensive experiments on multiple standard datasets demonstrate the strong generalizability of the proposed method, achieving an average performance improvement of up to 7.4%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AVPF, a method to generate Audio-Visual Pseudo-Fakes solely from authentic video samples in order to train deepfake detectors. The central claim is that this training strategy, which avoids any real deepfake data, produces models with substantially better generalization to unseen deepfakes, yielding an average performance gain of up to 7.4% across multiple standard datasets.
Significance. If the empirical claims are substantiated, the work would offer a practical route to improving detector robustness without relying on scarce or generator-specific deepfake corpora. The core idea of synthesizing controllable cross-modal inconsistencies from real data is attractive for the field, but its value hinges on whether the generated artifacts actually transfer to the implicit synthesis errors produced by contemporary GAN- and diffusion-based pipelines.
major comments (2)
- Abstract: the stated 'average performance improvement of up to 7.4%' is presented without any accompanying description of the pseudo-fake generation procedure, the train/test splits, the baseline detectors, the evaluation metrics, or error bars. This absence leaves the central generalization claim without verifiable support in the provided text.
- Method description (inferred from abstract): the claim that self-generated AVPF 'contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes' requires explicit justification. Because generation begins from authentic samples, the resulting inconsistencies are produced by explicit, controllable operations (e.g., audio swapping or temporal misalignment). These may differ systematically from the implicit, model-specific artifacts left by real synthesis pipelines; the manuscript must demonstrate that detectors trained on the former still detect the latter on held-out generators.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the full paper content and indicating planned revisions where appropriate to enhance clarity and support for our claims.
read point-by-point responses
-
Referee: Abstract: the stated 'average performance improvement of up to 7.4%' is presented without any accompanying description of the pseudo-fake generation procedure, the train/test splits, the baseline detectors, the evaluation metrics, or error bars. This absence leaves the central generalization claim without verifiable support in the provided text.
Authors: We acknowledge that the abstract is intentionally concise and omits granular details due to typical length constraints. However, the full manuscript provides all requested information in dedicated sections: the AVPF generation procedure (including specific audio-visual manipulations from authentic samples) is described in Section 3; train/test splits and datasets are detailed in Section 4 along with cross-generator evaluation protocols; baseline detectors and metrics (AUC, EER) are specified in Section 4.1; and results include means with standard deviations (error bars) across runs in Tables 1-4. To address the concern directly, we will revise the abstract to include a brief supporting clause: 'AVPF creates pseudo-fakes via controllable audio-visual manipulations on real videos only, evaluated on standard datasets with held-out generators, achieving up to 7.4% average improvement.' This maintains brevity while adding context. revision: yes
-
Referee: Method description (inferred from abstract): the claim that self-generated AVPF 'contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes' requires explicit justification. Because generation begins from authentic samples, the resulting inconsistencies are produced by explicit, controllable operations (e.g., audio swapping or temporal misalignment). These may differ systematically from the implicit, model-specific artifacts left by real synthesis pipelines; the manuscript must demonstrate that detectors trained on the former still detect the latter on held-out generators.
Authors: We agree that explicit justification strengthens the contribution and that the distinction between explicit and implicit artifacts merits discussion. Our generation operations are deliberately chosen to replicate prevalent real-world deepfake inconsistencies (e.g., lip desynchronization, audio-visual mismatches) that appear across GAN- and diffusion-based pipelines. The primary demonstration of transfer is empirical: models trained exclusively on real data plus AVPF (no real deepfakes) are tested on held-out deepfake corpora from unseen generators, yielding consistent gains up to 7.4% as reported in the experiments. This indicates effective generalization to implicit artifacts. In revision, we will add a new paragraph in Section 3 explicitly linking each generation operation to observed real deepfake patterns and discussing the rationale for transferability. revision: partial
Circularity Check
No circularity: method is empirical data augmentation without derivations or load-bearing self-citations
full rationale
The paper proposes AVPF as a technique to generate pseudo-fake samples from authentic video data only, then trains detectors on authentic samples plus these pseudo-fakes. No equations, first-principles derivations, or mathematical reductions appear in the abstract or described method. Claims of improved generalizability rest on experimental results across standard datasets rather than any self-referential fitting or uniqueness theorem imported from prior author work. The central premise (that self-generated pseudo-fakes capture relevant cross-modal patterns) is an empirical hypothesis, not a definitional or fitted tautology. This is a standard non-circular ML method paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-generated pseudo-fakes from authentic data contain diverse audio-visual correspondence patterns observed in real deepfakes
invented entities (1)
-
AVPF (Audio-Visual Pseudo-Fakes)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Intra-modal and cross-modal synchroniza- tion for audio-visual deepfake detection and temporal local- ization
Ashutosh Anshul, Shreyas Gopal, Deepu Rajan, and Eng Siong Chng. Intra-modal and cross-modal synchroniza- tion for audio-visual deepfake detection and temporal local- ization. InIEEE International Conference on Computer Vi- sion, 2025. 1
2025
-
[2]
Aunet: Learning relations between action units for face forgery detection
Weiming Bai, Yufan Liu, Zhipeng Zhang, Bing Li, and Weiming Hu. Aunet: Learning relations between action units for face forgery detection. InIEEE Conference on Computer Vision and Pattern Recognition, 2023. 2
2023
-
[3]
Idiff-face: Synthetic-based face recognition through fizzy identity-conditioned diffusion models
Fadi Boutros, Jonas Henry Grebe, Arjan Kuijper, and Naser Damer. Idiff-face: Synthetic-based face recognition through fizzy identity-conditioned diffusion models. InIEEE Inter- national Conference on Computer Vision, 2023. 1
2023
-
[4]
Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization.Computer Vision and Im- age Understanding, 2023
Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization.Computer Vision and Im- age Understanding, 2023. 2, 5
2023
-
[5]
Av-deepfake1m: A large-scale llm-driven audio-visual deep- fake dataset
Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar 9 Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov. Av-deepfake1m: A large-scale llm-driven audio-visual deep- fake dataset. InACM International Conference on Multime- dia, 2024. 4
2024
-
[6]
Self-supervised learning of adversarial exam- ple: Towards good generalizations for deepfake detection
Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial exam- ple: Towards good generalizations for deepfake detection. InIEEE Conference on Computer Vision and Pattern Recog- nition, 2022. 3
2022
-
[7]
V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications, and Applications, 2023
Harry Cheng, Yangyang Guo, Tianyi Wang, Qi Li, Xiaojun Chang, and Liqiang Nie. V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications, and Applications, 2023. 2, 5
2023
-
[8]
Not made for each other- audio-visual dissonance-based deepfake detection and localization
Komal Chugh, Parul Gupta, Abhinav Dhall, and Ramanathan Subramanian. Not made for each other- audio-visual dissonance-based deepfake detection and localization. In ACM International Conference on Multimedia, 2020. 2, 5
2020
-
[9]
Forensics adapter: Adapting clip for generalizable face forgery detection
Xinjie Cui, Yuezun Li, Ao Luo, Jiaran Zhou, and Junyu Dong. Forensics adapter: Adapting clip for generalizable face forgery detection. InIEEE Conference on Computer Vision and Pattern Recognition, 2025. 1
2025
-
[10]
Self- supervised video forensics by audio-visual anomaly detec- tion
Chao Feng, Ziyang Chen, and Andrew Owens. Self- supervised video forensics by audio-visual anomaly detec- tion. InIEEE Conference on Computer Vision and Pattern Recognition, 2023. 2, 5
2023
-
[11]
Social, legal, and ethical implications of ai-generated deepfake pornography on digital platforms: A systematic literature review.Social Sciences & Humanities Open, 2025
Furizal, Alfian Ma’arif, Hari Maghfiroh, Iswanto Suwarno, Denis Prayogi, Kariyamin, Syahrani Lonang, and Abdel- Nasser Sharkawy. Social, legal, and ethical implications of ai-generated deepfake pornography on digital platforms: A systematic literature review.Social Sciences & Humanities Open, 2025. 1
2025
-
[12]
St-sbv: Spatial-temporal self-blended videos for deep- fake detection
Weinan Guan, Wei Wang, Bo Peng, Jing Dong, and Tieniu Tan. St-sbv: Spatial-temporal self-blended videos for deep- fake detection. InChinese Conference on Pattern Recogni- tion and Computer Vision, 2024. 3
2024
-
[13]
Michael Hameleers, Toni G. L. A. van der Meer, and Tom Dobber. Distorting the truth versus blatant lies: The effects of different degrees of deception in domestic and foreign po- litical deepfakes.Computers in Human Behavior, 2024. 1
2024
-
[14]
Contextual cross- modal attention for audio-visual deepfake detection and lo- calization
Vinaya Sree Katamneni and Ajita Rattani. Contextual cross- modal attention for audio-visual deepfake detection and lo- calization. InIEEE International Joint Conference on Bio- metrics, 2024. 2, 5
2024
- [15]
-
[16]
Diffusion-driven gan inversion for multi- modal face image generation
Jihyun Kim, Changjae Oh, Hoseok Do, Soohyun Kim, and Kwanghoon Sohn. Diffusion-driven gan inversion for multi- modal face image generation. InIEEE Conference on Com- puter Vision and Pattern Recognition, 2024. 1
2024
-
[17]
Christos Koutlis and Symeon Papadopoulos. Dimodif: Discourse modality-information differentiation for audio- visual deepfake detection and localization.arXiv preprint arXiv:2411.10193, 2024. 2, 5
-
[18]
Face x-ray for more general face forgery detection
Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. Face x-ray for more general face forgery detection. InIEEE Conference on Com- puter Vision and Pattern Recognition, 2020. 3
2020
-
[19]
Spatio-temporal catcher: A self-supervised transformer for deepfake video detection
Maosen Li, Xurong Li, Kun Yu, Cheng Deng, Heng Huang, Feng Mao, Hui Xue, and Minghao Li. Spatio-temporal catcher: A self-supervised transformer for deepfake video detection. InACM International Conference on Multimedia, 2023
2023
-
[20]
Exposing deepfake videos by de- tecting face warping artifacts
Yuezun Li and Siwei Lyu. Exposing deepfake videos by de- tecting face warping artifacts. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2019. 3
2019
-
[21]
Speechforensics: Audio-visual speech representation learn- ing for face forgery detection
Yachao Liang, Min Yu, Gang Li, Jianguo Jiang, Boquan Li, Feng Yu, Ning Zhang, Xiang Meng, and Weiqing Huang. Speechforensics: Audio-visual speech representation learn- ing for face forgery detection. InAdvances in Neural Infor- mation Processing Systems, 2024. 2, 5
2024
-
[22]
Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes
Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, Ziyou Liang, and Run Wang. Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes. InAdvances in Neural Information Processing Systems, 2024. 2, 4, 5
2024
-
[23]
Beyond the prior forgery knowl- edge: Mining critical clues for general face forgery detec- tion.IEEE Transactions on Information Forensics and Secu- rity, 2024
Anwei Luo, Chenqi Kong, Jiwu Huang, Yongjian Hu, Xian- gui Kang, and Alex C Kot. Beyond the prior forgery knowl- edge: Mining critical clues for general face forgery detec- tion.IEEE Transactions on Information Forensics and Secu- rity, 2024. 2
2024
-
[24]
Multi-modal deepfake detection via multi-task audio-visual prompt learning
Hui Miao, Yuanfang Guo, Zeming Liu, and Yunhong Wang. Multi-modal deepfake detection via multi-task audio-visual prompt learning. InAAAI Conference on Artificial Intelli- gence, 2025. 1, 2, 5
2025
-
[25]
Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection
Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection. InIEEE International Conference on Computer Vision, 2025. 2, 3
2025
-
[26]
Avff: Audio-visual feature fusion for video deepfake detection
Trevine Oorloff, Surya Koppisetti, Nicol `o Bonettini, Di- vyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, and Gaurav Bharaj. Avff: Audio-visual feature fusion for video deepfake detection. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 2, 5
2024
-
[27]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019. 5
2019
-
[28]
Evaluating deepfake detectors in the wild
Viacheslav Pirogov. Evaluating deepfake detectors in the wild. InInternational Conference on Machine Learning workshop, 2025. 2
2025
-
[29]
Learning audio-visual speech representation by masked multimodal cluster prediction
Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrah- man Mohamed. Learning audio-visual speech representation by masked multimodal cluster prediction. InInternational Conference on Learning Representations, 2022. 4
2022
-
[30]
Detecting deep- fakes with self-blended images
Kaede Shiohara and Toshihiko Yamasaki. Detecting deep- fakes with self-blended images. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 2, 3
2022
-
[31]
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning
Stefan Smeu, Dragos-Alexandru Boldisor, Dan Oneata, and Elisabeta Oneata. Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 1, 2, 4, 5, 6 10
2025
-
[32]
Managing deepfakes with artificial intel- ligence: Introducing the business privacy calculus.Journal of Business Research, 2025
Giuseppe Vecchietti, Gajendra Liyanaarachchi, and Gi- ampaolo Viglia. Managing deepfakes with artificial intel- ligence: Introducing the business privacy calculus.Journal of Business Research, 2025. 1
2025
-
[33]
Audio–visual deepfake detec- tion using articulatory representation learning.Computer Vi- sion and Image Understanding, 2024
Yujia Wang and Hua Huang. Audio–visual deepfake detec- tion using articulatory representation learning.Computer Vi- sion and Image Understanding, 2024. 2
2024
-
[34]
Talkingheadbench: A multi-modal bench- mark & analysis of talking-head deepfake detection
Xinqi Xiong, Prakrut Patel, Qingyuan Fan, Amisha Wadhwa, Sarathy Selvam, Xiao Guo, Luchao Qi, Xiaoming Liu, and Roni Sengupta. Talkingheadbench: A multi-modal bench- mark & analysis of talking-head deepfake detection. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, 2026. 4
2026
-
[35]
Transcending forgery specificity with latent space augmentation for generalizable deepfake detection
Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, and Baoyuan Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2024. 1
2024
-
[36]
Generalizing deepfake video detection with plug- and-play: Video-level blending and spatiotemporal adapter tuning
Zhiyuan Yan, Yandan Zhao, Shen Chen, Mingyi Guo, Xinghe Fu, Taiping Yao, Shouhong Ding, Yunsheng Wu, and Li Yuan. Generalizing deepfake video detection with plug- and-play: Video-level blending and spatiotemporal adapter tuning. InIEEE Conference on Computer Vision and Pattern Recognition, 2025. 3
2025
-
[37]
Fine-grained multimodal deepfake classification via heterogeneous graphs.International Jour- nal of Computer Vision, 2024
Qilin Yin, Wei Lu, Xiaochun Cao, Xiangyang Luo, Yicong Zhou, and Jiwu Huang. Fine-grained multimodal deepfake classification via heterogeneous graphs.International Jour- nal of Computer Vision, 2024. 2, 5
2024
-
[38]
Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection
Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, and Chip Hong Chang. Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection. InInternational Conference on Machine Learning, 2025. 1
2025
-
[39]
Facednerf: Semantics-driven face reconstruc- tion, prompt editing and relighting with diffusion models
Hao Zhang, Tianyuan Dai, Yanbo Xu, Yu-Wing Tai, and Chi- Keung Tang. Facednerf: Semantics-driven face reconstruc- tion, prompt editing and relighting with diffusion models. In Advances in Neural Information Processing Systems, 2023. 1
2023
-
[40]
Fast text-to-3d-aware face generation and manipulation via direct cross-modal mapping and geometric regularization
Jinlu Zhang, Yiyi Zhou, Qiancheng Zheng, Xiaoxiong Du, Gen Luo, Jun Peng, Xiaoshuai Sun, and Rongrong Ji. Fast text-to-3d-aware face generation and manipulation via direct cross-modal mapping and geometric regularization. InInter- national Conference on Machine Learning, 2024. 1
2024
-
[41]
I can hear you: Selective robust training for deepfake audio detec- tion
Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, and Chengzhi Mao. I can hear you: Selective robust training for deepfake audio detec- tion. InInternational Conference on Learning Representa- tions, 2025. 2
2025
-
[42]
Learning self-consistency for deepfake detection
Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun Xiong, and Wei Xia. Learning self-consistency for deepfake detection. InIEEE International Conference on Computer Vision, 2021. 3
2021
-
[43]
Joint audio-visual deepfake detection
Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. InIEEE International Conference on Computer Vision, 2021. 2, 5
2021
-
[44]
Slim: Style-linguistics mismatch model for generalized au- dio deepfake detection.Advances in Neural Information Pro- cessing Systems, 2024
Yi Zhu, Surya Koppisetti, Trang Tran, and Gaurav Bharaj. Slim: Style-linguistics mismatch model for generalized au- dio deepfake detection.Advances in Neural Information Pro- cessing Systems, 2024. 2
2024
-
[45]
Cross-modality and within- modality regularization for audio-visual deepfake detection
Heqing Zou, Meng Shen, Yuchen Hu, Chen Chen, Eng Siong Chng, and Deepu Rajan. Cross-modality and within- modality regularization for audio-visual deepfake detection. InIEEE International Conference on Acoustics, Speech and Signal Processing, 2024. 2, 5 11
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.