arxiv: 2604.09110 · v1 · submitted 2026-04-10 · 💻 cs.MM

Recognition: unknown

Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:35 UTC · model grok-4.3

classification 💻 cs.MM

keywords deepfake detectionvideo deepfakesaudio-visual correspondencepseudo-fakesgeneralizabilityself-generated datamultimodal analysisforgery detection

0 comments

The pith

Training deepfake detectors solely on real videos plus self-generated pseudo-fakes improves detection of unseen deepfakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing deepfake detection methods struggle with generalizability because training datasets lack diversity in audio-visual mismatch patterns. The paper proposes AVPF to generate pseudo-fake samples from authentic videos that simulate those patterns found in real deepfakes. Models trained only on authentic data and these pseudo-fakes show better performance on standard benchmarks without ever seeing actual deepfakes. This approach matters as it could allow effective detectors to be built and updated without access to large sets of real forged videos.

Core claim

The paper establishes that by creating self-generated Audio-Visual Pseudo-Fakes from authentic samples alone, which incorporate diverse cross-modal correspondence patterns typical of real-world deepfakes, and training detectors exclusively on authentic data combined with these pseudo-fakes, the models achieve notably enhanced generalizability, with an average performance improvement of up to 7.4% across multiple standard datasets.

What carries the argument

AVPF, a generation process that produces pseudo-fake video samples from real ones by altering audio-visual alignments to mimic common deepfake discrepancies, serving as synthetic training data to expose models to varied mismatch patterns.

If this is right

Models achieve better results on unseen deepfakes without training on real fakes.
The method relies only on authentic videos for both real and fake training examples.
Generalizability improves due to exposure to a broader range of audio-visual correspondence issues.
Performance gains are demonstrated on multiple standard deepfake datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique might allow training in data-scarce environments where real deepfakes are not available for ethical or legal reasons.
Similar pseudo-sample generation could be explored for other detection tasks involving multimodal data.
If successful, it suggests that the key to generalization lies in simulating the artifacts rather than collecting them.

Load-bearing premise

The self-generated pseudo-fakes must accurately replicate the diverse audio-visual mismatch patterns that appear in actual deepfakes created by various methods.

What would settle it

Testing the trained model on a dataset of deepfakes featuring audio-visual inconsistencies that differ substantially from those in the pseudo-fakes and observing no performance gain or a decrease compared to baseline methods.

Figures

Figures reproduced from arXiv: 2604.09110 by Yuezun Li, Zihe Wei.

**Figure 1.** Figure 1: ■ and ■ denote authentic and deepfake training samples with visual vt and audio at modality. (a) Training on known datasets suffers from limited generalizability, as the distribution of real-world deepfakes is far more complex. (b) Several recent works employ self-supervised learning to model the audio–visual correspondence of authentic videos and identify deepfakes by measuring deviations from the learne… view at source ↗

**Figure 2.** Figure 2: Overview of Audio-Visual Self-Blending (AVSB) strat [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Audio-Visual Self-Splicing (AVSS) strat [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (a,b,c,d) correspond to analysis results of authentic videos, easy deepfakes, naive pseudo-fakes, and our method. The top-to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of prediction scores for AVH-Align (left column) and AVPF (right column) on the AV1M (top), FAVC (middle) and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Robustness comparison between AVH-Align and AVPF (ours) under five different image degradations: JPEG compression, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. While they have shown promising results, their effectiveness often degrades in real-world scenarios, as the limited diversity of training datasets naturally restricts generalizability to unseen cases. To address this, we propose a simple yet effective method, called AVPF, which can notably enhance model generalizability by training with self-generated Audio-Visual Pseudo-Fakes.The key idea of AVPF is to create pseudo-fake training samples that contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes. We highlight that AVPF is generated solely from authentic samples, and training relies only on authentic data and AVPF, without requiring any real deepfakes.Extensive experiments on multiple standard datasets demonstrate the strong generalizability of the proposed method, achieving an average performance improvement of up to 7.4%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVPF generates pseudo-fakes from real videos to train detectors without real deepfakes, but the abstract gives almost no method or result details to evaluate the 7.4% gain.

read the letter

The paper's main move is to build training examples called AVPF by altering real audio-visual clips to create mismatches, then train detectors on those plus the original real data. This avoids any need for actual deepfake samples, which is a practical angle for generalization work in this space. It directly targets the common issue that models overfit to the limited fakes in public datasets and then fail on new generators. That construction from authentic data only is the concrete new piece here, and it fits the ongoing push for cross-modal detectors that look at lip sync and timing issues. The reported average lift of 7.4% across standard datasets is the kind of number that would matter if the experiments hold up. The approach is straightforward enough that it could be tried quickly by groups already running audio-visual baselines. The main weakness is that the abstract supplies no description of how the pseudo-fakes are actually made, no training protocol, no baseline tables, and no error bars. Without those, the performance claim sits on thin ground. The stress-test point also lands: simple swaps or misalignments on real clips are likely to leave different traces than the blending or temporal artifacts that come out of GAN or diffusion pipelines. If the detector ends up keying on the controllable generation steps instead of the subtler real-world inconsistencies, the generalization benefit will shrink on unseen cases. This is for people already working on audio-visual deepfake detection who need ideas for data augmentation when real fakes are scarce. A reader who wants to test new training tricks might get value from the full version, but only if it includes ablations on the generation steps and transfer tests to multiple unseen synthesizers. It deserves a serious referee to check the experiments and see whether the pseudo-fakes actually close the gap or just create an easier detection task.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes AVPF, a method to generate Audio-Visual Pseudo-Fakes solely from authentic video samples in order to train deepfake detectors. The central claim is that this training strategy, which avoids any real deepfake data, produces models with substantially better generalization to unseen deepfakes, yielding an average performance gain of up to 7.4% across multiple standard datasets.

Significance. If the empirical claims are substantiated, the work would offer a practical route to improving detector robustness without relying on scarce or generator-specific deepfake corpora. The core idea of synthesizing controllable cross-modal inconsistencies from real data is attractive for the field, but its value hinges on whether the generated artifacts actually transfer to the implicit synthesis errors produced by contemporary GAN- and diffusion-based pipelines.

major comments (2)

Abstract: the stated 'average performance improvement of up to 7.4%' is presented without any accompanying description of the pseudo-fake generation procedure, the train/test splits, the baseline detectors, the evaluation metrics, or error bars. This absence leaves the central generalization claim without verifiable support in the provided text.
Method description (inferred from abstract): the claim that self-generated AVPF 'contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes' requires explicit justification. Because generation begins from authentic samples, the resulting inconsistencies are produced by explicit, controllable operations (e.g., audio swapping or temporal misalignment). These may differ systematically from the implicit, model-specific artifacts left by real synthesis pipelines; the manuscript must demonstrate that detectors trained on the former still detect the latter on held-out generators.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the full paper content and indicating planned revisions where appropriate to enhance clarity and support for our claims.

read point-by-point responses

Referee: Abstract: the stated 'average performance improvement of up to 7.4%' is presented without any accompanying description of the pseudo-fake generation procedure, the train/test splits, the baseline detectors, the evaluation metrics, or error bars. This absence leaves the central generalization claim without verifiable support in the provided text.

Authors: We acknowledge that the abstract is intentionally concise and omits granular details due to typical length constraints. However, the full manuscript provides all requested information in dedicated sections: the AVPF generation procedure (including specific audio-visual manipulations from authentic samples) is described in Section 3; train/test splits and datasets are detailed in Section 4 along with cross-generator evaluation protocols; baseline detectors and metrics (AUC, EER) are specified in Section 4.1; and results include means with standard deviations (error bars) across runs in Tables 1-4. To address the concern directly, we will revise the abstract to include a brief supporting clause: 'AVPF creates pseudo-fakes via controllable audio-visual manipulations on real videos only, evaluated on standard datasets with held-out generators, achieving up to 7.4% average improvement.' This maintains brevity while adding context. revision: yes
Referee: Method description (inferred from abstract): the claim that self-generated AVPF 'contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes' requires explicit justification. Because generation begins from authentic samples, the resulting inconsistencies are produced by explicit, controllable operations (e.g., audio swapping or temporal misalignment). These may differ systematically from the implicit, model-specific artifacts left by real synthesis pipelines; the manuscript must demonstrate that detectors trained on the former still detect the latter on held-out generators.

Authors: We agree that explicit justification strengthens the contribution and that the distinction between explicit and implicit artifacts merits discussion. Our generation operations are deliberately chosen to replicate prevalent real-world deepfake inconsistencies (e.g., lip desynchronization, audio-visual mismatches) that appear across GAN- and diffusion-based pipelines. The primary demonstration of transfer is empirical: models trained exclusively on real data plus AVPF (no real deepfakes) are tested on held-out deepfake corpora from unseen generators, yielding consistent gains up to 7.4% as reported in the experiments. This indicates effective generalization to implicit artifacts. In revision, we will add a new paragraph in Section 3 explicitly linking each generation operation to observed real deepfake patterns and discussing the rationale for transferability. revision: partial

Circularity Check

0 steps flagged

No circularity: method is empirical data augmentation without derivations or load-bearing self-citations

full rationale

The paper proposes AVPF as a technique to generate pseudo-fake samples from authentic video data only, then trains detectors on authentic samples plus these pseudo-fakes. No equations, first-principles derivations, or mathematical reductions appear in the abstract or described method. Claims of improved generalizability rest on experimental results across standard datasets rather than any self-referential fitting or uniqueness theorem imported from prior author work. The central premise (that self-generated pseudo-fakes capture relevant cross-modal patterns) is an empirical hypothesis, not a definitional or fitted tautology. This is a standard non-circular ML method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified premise that pseudo-fakes can faithfully reproduce the statistical distribution of real deepfake inconsistencies; no free parameters, axioms, or invented entities are specified in the abstract.

axioms (1)

domain assumption Self-generated pseudo-fakes from authentic data contain diverse audio-visual correspondence patterns observed in real deepfakes
This assumption underpins the claim that training on real data plus pseudo-fakes yields improved generalization.

invented entities (1)

AVPF (Audio-Visual Pseudo-Fakes) no independent evidence
purpose: Training samples that simulate real deepfake inconsistencies without using actual deepfakes
Newly introduced concept whose effectiveness is asserted but not demonstrated in the abstract.

pith-pipeline@v0.9.0 · 5469 in / 1231 out tokens · 31023 ms · 2026-05-10T16:35:11.468220+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 2 canonical work pages

[1]

Intra-modal and cross-modal synchroniza- tion for audio-visual deepfake detection and temporal local- ization

Ashutosh Anshul, Shreyas Gopal, Deepu Rajan, and Eng Siong Chng. Intra-modal and cross-modal synchroniza- tion for audio-visual deepfake detection and temporal local- ization. InIEEE International Conference on Computer Vi- sion, 2025. 1

2025
[2]

Aunet: Learning relations between action units for face forgery detection

Weiming Bai, Yufan Liu, Zhipeng Zhang, Bing Li, and Weiming Hu. Aunet: Learning relations between action units for face forgery detection. InIEEE Conference on Computer Vision and Pattern Recognition, 2023. 2

2023
[3]

Idiff-face: Synthetic-based face recognition through fizzy identity-conditioned diffusion models

Fadi Boutros, Jonas Henry Grebe, Arjan Kuijper, and Naser Damer. Idiff-face: Synthetic-based face recognition through fizzy identity-conditioned diffusion models. InIEEE Inter- national Conference on Computer Vision, 2023. 1

2023
[4]

Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization.Computer Vision and Im- age Understanding, 2023

Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat. Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization.Computer Vision and Im- age Understanding, 2023. 2, 5

2023
[5]

Av-deepfake1m: A large-scale llm-driven audio-visual deep- fake dataset

Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar 9 Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov. Av-deepfake1m: A large-scale llm-driven audio-visual deep- fake dataset. InACM International Conference on Multime- dia, 2024. 4

2024
[6]

Self-supervised learning of adversarial exam- ple: Towards good generalizations for deepfake detection

Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial exam- ple: Towards good generalizations for deepfake detection. InIEEE Conference on Computer Vision and Pattern Recog- nition, 2022. 3

2022
[7]

V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications, and Applications, 2023

Harry Cheng, Yangyang Guo, Tianyi Wang, Qi Li, Xiaojun Chang, and Liqiang Nie. V oice-face homogeneity tells deep- fake.ACM Transactions on Multimedia Computing, Com- munications, and Applications, 2023. 2, 5

2023
[8]

Not made for each other- audio-visual dissonance-based deepfake detection and localization

Komal Chugh, Parul Gupta, Abhinav Dhall, and Ramanathan Subramanian. Not made for each other- audio-visual dissonance-based deepfake detection and localization. In ACM International Conference on Multimedia, 2020. 2, 5

2020
[9]

Forensics adapter: Adapting clip for generalizable face forgery detection

Xinjie Cui, Yuezun Li, Ao Luo, Jiaran Zhou, and Junyu Dong. Forensics adapter: Adapting clip for generalizable face forgery detection. InIEEE Conference on Computer Vision and Pattern Recognition, 2025. 1

2025
[10]

Self- supervised video forensics by audio-visual anomaly detec- tion

Chao Feng, Ziyang Chen, and Andrew Owens. Self- supervised video forensics by audio-visual anomaly detec- tion. InIEEE Conference on Computer Vision and Pattern Recognition, 2023. 2, 5

2023
[11]

Social, legal, and ethical implications of ai-generated deepfake pornography on digital platforms: A systematic literature review.Social Sciences & Humanities Open, 2025

Furizal, Alfian Ma’arif, Hari Maghfiroh, Iswanto Suwarno, Denis Prayogi, Kariyamin, Syahrani Lonang, and Abdel- Nasser Sharkawy. Social, legal, and ethical implications of ai-generated deepfake pornography on digital platforms: A systematic literature review.Social Sciences & Humanities Open, 2025. 1

2025
[12]

St-sbv: Spatial-temporal self-blended videos for deep- fake detection

Weinan Guan, Wei Wang, Bo Peng, Jing Dong, and Tieniu Tan. St-sbv: Spatial-temporal self-blended videos for deep- fake detection. InChinese Conference on Pattern Recogni- tion and Computer Vision, 2024. 3

2024
[13]

Michael Hameleers, Toni G. L. A. van der Meer, and Tom Dobber. Distorting the truth versus blatant lies: The effects of different degrees of deception in domestic and foreign po- litical deepfakes.Computers in Human Behavior, 2024. 1

2024
[14]

Contextual cross- modal attention for audio-visual deepfake detection and lo- calization

Vinaya Sree Katamneni and Ajita Rattani. Contextual cross- modal attention for audio-visual deepfake detection and lo- calization. InIEEE International Joint Conference on Bio- metrics, 2024. 2, 5

2024
[15]

Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S. Woo. Fakeavceleb: A novel audio-video multimodal deep- fake dataset.arXiv preprint arXiv:2108.05080, 2021. 4

work page arXiv 2021
[16]

Diffusion-driven gan inversion for multi- modal face image generation

Jihyun Kim, Changjae Oh, Hoseok Do, Soohyun Kim, and Kwanghoon Sohn. Diffusion-driven gan inversion for multi- modal face image generation. InIEEE Conference on Com- puter Vision and Pattern Recognition, 2024. 1

2024
[17]

Dimodif: Discourse modality-information differentiation for audio- visual deepfake detection and localization.arXiv preprint arXiv:2411.10193, 2024

Christos Koutlis and Symeon Papadopoulos. Dimodif: Discourse modality-information differentiation for audio- visual deepfake detection and localization.arXiv preprint arXiv:2411.10193, 2024. 2, 5

work page arXiv 2024
[18]

Face x-ray for more general face forgery detection

Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. Face x-ray for more general face forgery detection. InIEEE Conference on Com- puter Vision and Pattern Recognition, 2020. 3

2020
[19]

Spatio-temporal catcher: A self-supervised transformer for deepfake video detection

Maosen Li, Xurong Li, Kun Yu, Cheng Deng, Heng Huang, Feng Mao, Hui Xue, and Minghao Li. Spatio-temporal catcher: A self-supervised transformer for deepfake video detection. InACM International Conference on Multimedia, 2023

2023
[20]

Exposing deepfake videos by de- tecting face warping artifacts

Yuezun Li and Siwei Lyu. Exposing deepfake videos by de- tecting face warping artifacts. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2019. 3

2019
[21]

Speechforensics: Audio-visual speech representation learn- ing for face forgery detection

Yachao Liang, Min Yu, Gang Li, Jianguo Jiang, Boquan Li, Feng Yu, Ning Zhang, Xiang Meng, and Weiqing Huang. Speechforensics: Audio-visual speech representation learn- ing for face forgery detection. InAdvances in Neural Infor- mation Processing Systems, 2024. 2, 5

2024
[22]

Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes

Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, Ziyou Liang, and Run Wang. Lips are lying: Spot- ting the temporal inconsistency between audio and visual in lip-syncing deepfakes. InAdvances in Neural Information Processing Systems, 2024. 2, 4, 5

2024
[23]

Beyond the prior forgery knowl- edge: Mining critical clues for general face forgery detec- tion.IEEE Transactions on Information Forensics and Secu- rity, 2024

Anwei Luo, Chenqi Kong, Jiwu Huang, Yongjian Hu, Xian- gui Kang, and Alex C Kot. Beyond the prior forgery knowl- edge: Mining critical clues for general face forgery detec- tion.IEEE Transactions on Information Forensics and Secu- rity, 2024. 2

2024
[24]

Multi-modal deepfake detection via multi-task audio-visual prompt learning

Hui Miao, Yuanfang Guo, Zeming Liu, and Yunhong Wang. Multi-modal deepfake detection via multi-task audio-visual prompt learning. InAAAI Conference on Artificial Intelli- gence, 2025. 1, 2, 5

2025
[25]

Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection

Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. Vulnerability-aware spatio-temporal learning for generalizable deepfake video detection. InIEEE International Conference on Computer Vision, 2025. 2, 3

2025
[26]

Avff: Audio-visual feature fusion for video deepfake detection

Trevine Oorloff, Surya Koppisetti, Nicol `o Bonettini, Di- vyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, and Gaurav Bharaj. Avff: Audio-visual feature fusion for video deepfake detection. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 2, 5

2024
[27]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019. 5

2019
[28]

Evaluating deepfake detectors in the wild

Viacheslav Pirogov. Evaluating deepfake detectors in the wild. InInternational Conference on Machine Learning workshop, 2025. 2

2025
[29]

Learning audio-visual speech representation by masked multimodal cluster prediction

Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrah- man Mohamed. Learning audio-visual speech representation by masked multimodal cluster prediction. InInternational Conference on Learning Representations, 2022. 4

2022
[30]

Detecting deep- fakes with self-blended images

Kaede Shiohara and Toshihiko Yamasaki. Detecting deep- fakes with self-blended images. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 2, 3

2022
[31]

Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning

Stefan Smeu, Dragos-Alexandru Boldisor, Dan Oneata, and Elisabeta Oneata. Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 1, 2, 4, 5, 6 10

2025
[32]

Managing deepfakes with artificial intel- ligence: Introducing the business privacy calculus.Journal of Business Research, 2025

Giuseppe Vecchietti, Gajendra Liyanaarachchi, and Gi- ampaolo Viglia. Managing deepfakes with artificial intel- ligence: Introducing the business privacy calculus.Journal of Business Research, 2025. 1

2025
[33]

Audio–visual deepfake detec- tion using articulatory representation learning.Computer Vi- sion and Image Understanding, 2024

Yujia Wang and Hua Huang. Audio–visual deepfake detec- tion using articulatory representation learning.Computer Vi- sion and Image Understanding, 2024. 2

2024
[34]

Talkingheadbench: A multi-modal bench- mark & analysis of talking-head deepfake detection

Xinqi Xiong, Prakrut Patel, Qingyuan Fan, Amisha Wadhwa, Sarathy Selvam, Xiao Guo, Luchao Qi, Xiaoming Liu, and Roni Sengupta. Talkingheadbench: A multi-modal bench- mark & analysis of talking-head deepfake detection. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, 2026. 4

2026
[35]

Transcending forgery specificity with latent space augmentation for generalizable deepfake detection

Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, and Baoyuan Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2024. 1

2024
[36]

Generalizing deepfake video detection with plug- and-play: Video-level blending and spatiotemporal adapter tuning

Zhiyuan Yan, Yandan Zhao, Shen Chen, Mingyi Guo, Xinghe Fu, Taiping Yao, Shouhong Ding, Yunsheng Wu, and Li Yuan. Generalizing deepfake video detection with plug- and-play: Video-level blending and spatiotemporal adapter tuning. InIEEE Conference on Computer Vision and Pattern Recognition, 2025. 3

2025
[37]

Fine-grained multimodal deepfake classification via heterogeneous graphs.International Jour- nal of Computer Vision, 2024

Qilin Yin, Wei Lu, Xiaochun Cao, Xiangyang Luo, Yicong Zhou, and Jiwu Huang. Fine-grained multimodal deepfake classification via heterogeneous graphs.International Jour- nal of Computer Vision, 2024. 2, 5

2024
[38]

Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection

Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, and Chip Hong Chang. Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection. InInternational Conference on Machine Learning, 2025. 1

2025
[39]

Facednerf: Semantics-driven face reconstruc- tion, prompt editing and relighting with diffusion models

Hao Zhang, Tianyuan Dai, Yanbo Xu, Yu-Wing Tai, and Chi- Keung Tang. Facednerf: Semantics-driven face reconstruc- tion, prompt editing and relighting with diffusion models. In Advances in Neural Information Processing Systems, 2023. 1

2023
[40]

Fast text-to-3d-aware face generation and manipulation via direct cross-modal mapping and geometric regularization

Jinlu Zhang, Yiyi Zhou, Qiancheng Zheng, Xiaoxiong Du, Gen Luo, Jun Peng, Xiaoshuai Sun, and Rongrong Ji. Fast text-to-3d-aware face generation and manipulation via direct cross-modal mapping and geometric regularization. InInter- national Conference on Machine Learning, 2024. 1

2024
[41]

I can hear you: Selective robust training for deepfake audio detec- tion

Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, and Chengzhi Mao. I can hear you: Selective robust training for deepfake audio detec- tion. InInternational Conference on Learning Representa- tions, 2025. 2

2025
[42]

Learning self-consistency for deepfake detection

Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun Xiong, and Wei Xia. Learning self-consistency for deepfake detection. InIEEE International Conference on Computer Vision, 2021. 3

2021
[43]

Joint audio-visual deepfake detection

Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake detection. InIEEE International Conference on Computer Vision, 2021. 2, 5

2021
[44]

Slim: Style-linguistics mismatch model for generalized au- dio deepfake detection.Advances in Neural Information Pro- cessing Systems, 2024

Yi Zhu, Surya Koppisetti, Trang Tran, and Gaurav Bharaj. Slim: Style-linguistics mismatch model for generalized au- dio deepfake detection.Advances in Neural Information Pro- cessing Systems, 2024. 2

2024
[45]

Cross-modality and within- modality regularization for audio-visual deepfake detection

Heqing Zou, Meng Shen, Yuchen Hu, Chen Chen, Eng Siong Chng, and Deepu Rajan. Cross-modality and within- modality regularization for audio-visual deepfake detection. InIEEE International Conference on Acoustics, Speech and Signal Processing, 2024. 2, 5 11

2024