pith. machine review for the scientific record. sign in

arxiv: 2604.12650 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.MM

Recognition: unknown

Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:49 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords listening deepfake detectiondeepfake detectionmotion inconsistencyaudio-guided fusionListenForge datasetMANetmultimodal forgery analysis
0
0 comments X

The pith

Existing deepfake detectors tuned for speaking faces fail on listening scenes, but a motion-aware network guided by audio succeeds on the first dedicated dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that deepfake detection has been limited to speaking scenarios and must now address listening states that appear in realistic interactive conversations. Attackers can alternate between falsifying speech and reactions to make scenarios more persuasive, yet current models show poor results when the subject is only listening. The authors introduce ListenForge, a dataset built from five listening head generation methods, and present MANet to detect subtle motion inconsistencies while using the speaker's audio to guide cross-modal analysis. Experiments confirm that speaking-focused detectors underperform here while MANet delivers markedly better accuracy. This shift opens detection to the full range of conversational behaviors rather than isolated speech.

Core claim

The central claim is that speaking-centric deepfake detection models perform poorly when applied to listening deepfakes, whereas MANet, a Motion-aware and Audio-guided Network, captures subtle motion inconsistencies in listener videos and leverages speaker audio semantics for cross-modal fusion, achieving significantly superior performance on the ListenForge dataset constructed from five listening head generation methods.

What carries the argument

MANet, the Motion-aware and Audio-guided Network that detects motion inconsistencies while using speaker audio to guide fusion across modalities.

If this is right

  • Speaking deepfake detection models will continue to underperform on listening scenarios in interactive settings.
  • MANet supplies an effective baseline that exploits motion inconsistencies and audio guidance for listening forgery detection.
  • The ListenForge dataset enables systematic study of multimodal forgeries that alternate between speaking and listening states.
  • Deepfake detection systems must incorporate both active speech and passive reaction analysis to cover realistic conversation flows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining listening and speaking detectors into a single pipeline could support continuous monitoring of full video conversations for deepfakes.
  • The motion and audio approach may extend to other non-verbal behaviors such as gestures or facial micro-expressions in forged interactions.
  • As listening synthesis improves, detection may shift emphasis toward longer temporal patterns or additional modalities like eye gaze.

Load-bearing premise

The assumption that current listening reaction syntheses contain detectable motion and cross-modal flaws that can be exploited before synthesis quality improves.

What would settle it

A new listening deepfake dataset generated with methods that produce head motions and reactions matching real listeners so closely that MANet loses its performance edge over speaking detectors would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.12650 by Fangda Wei, Jing Wang, Miao Liu, Xinyuan Qian.

Figure 1
Figure 1. Figure 1: A visualization comparison between common speak [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the ListenForge dataset. (a) The ratio [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the proposed MANet. (a) illustrates the overall framework. (b) illustrates the motion-aware module. (c) [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the spatial and channel attention. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the attention maps. The redder the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker's audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the task of Listening Deepfake Detection (LDD) for forged listening reactions in interactive settings, presents the ListenForge dataset constructed using five Listening Head Generation (LHG) methods, and proposes MANet, a Motion-aware and Audio-guided Network that captures motion inconsistencies while using speaker audio for cross-modal guidance. It claims that existing Speaking Deepfake Detection (SDD) models perform poorly on ListenForge while MANet achieves significantly superior results, arguing for a shift beyond speaking-centric deepfake detection paradigms.

Significance. If the empirical results hold under rigorous controls, the work would be significant for opening a new direction in multimodal deepfake detection focused on conversational listening states, where synthesis limitations may provide detection opportunities. ListenForge serves as a valuable first benchmark dataset, and MANet offers a specialized architecture that could inspire further multimodal fusion techniques. The overall contribution strengthens the case for domain-specific detectors in interactive scenarios.

major comments (1)
  1. [§4 (Experiments)] §4 (Experiments): The central claim that SDD models perform poorly in listening scenarios (and thus necessitate MANet and a new paradigm) is load-bearing but depends on a fair comparison. The manuscript must clarify whether SDD baselines were evaluated zero-shot or fine-tuned on ListenForge using the same training protocol as MANet; without adaptation results, poor performance may reflect domain shift between speaking and listening videos rather than fundamental inability to detect listening forgeries.
minor comments (2)
  1. [Abstract] Abstract: The abstract asserts superior performance of MANet and poor results from SDD models but provides no quantitative metrics, baseline names, dataset statistics, or error analysis, which would allow immediate assessment of the claims.
  2. [§3 (Proposed Method)] §3 (Proposed Method): The description of the audio-guided cross-modal fusion in MANet would benefit from an accompanying diagram or pseudocode to clarify how speaker audio semantics guide motion feature extraction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback highlights an important aspect of our experimental design that we will clarify and strengthen in the revised manuscript.

read point-by-point responses
  1. Referee: The central claim that SDD models perform poorly in listening scenarios (and thus necessitate MANet and a new paradigm) is load-bearing but depends on a fair comparison. The manuscript must clarify whether SDD baselines were evaluated zero-shot or fine-tuned on ListenForge using the same training protocol as MANet; without adaptation results, poor performance may reflect domain shift between speaking and listening videos rather than fundamental inability to detect listening forgeries.

    Authors: We appreciate this observation. In the current manuscript, the SDD baselines were evaluated zero-shot on ListenForge (without fine-tuning) to demonstrate the domain gap between speaking-centric training data and listening scenarios, which underpins our argument for a new LDD paradigm. To address the concern directly, we will revise §4 to explicitly state the evaluation protocol and add a new set of results in which the SDD models are fine-tuned on ListenForge using the identical training protocol, data splits, and hyperparameters as MANet. These additional experiments will show that even after adaptation, the SDD models remain substantially inferior to MANet, confirming that the performance gap is not solely attributable to domain shift. revision: yes

Circularity Check

0 steps flagged

No circularity: new task, dataset, and empirical comparisons are self-contained

full rationale

The paper introduces Listening Deepfake Detection as a new task, constructs the ListenForge dataset from five existing LHG methods, proposes MANet for motion-aware audio-guided detection, and reports empirical results showing SDD models perform poorly while MANet performs better. No equations, derivations, or fitted parameters are present in the provided text. Central claims rest on new data creation and direct performance comparisons rather than any self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation. The derivation chain is independent of its inputs by construction, with no steps that reduce to tautology or prior author work invoked as uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the assumption that synthesized listening videos contain detectable motion artifacts distinct from real reactions and that speaker audio provides useful guidance for fusion; these are domain assumptions in multimodal computer vision rather than derived results.

axioms (2)
  • domain assumption Synthesized listening reactions exhibit subtle motion inconsistencies detectable by neural networks
    Invoked in the design of the motion-aware component of MANet and the motivation for the LDD task.
  • domain assumption Speaker audio semantics can reliably guide cross-modal fusion for listener video analysis
    Core premise of the audio-guided fusion mechanism in MANet.
invented entities (1)
  • MANet no independent evidence
    purpose: Motion-aware and audio-guided network for detecting listening deepfakes
    Newly proposed architecture whose effectiveness is demonstrated only within the paper's experiments.

pith-pipeline@v0.9.0 · 5558 in / 1326 out tokens · 170301 ms · 2026-05-10T15:49:20.778356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 10 canonical work pages

  1. [1]

    Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. Mesonet: a compact facial video forgery detection network. In2018 IEEE in- ternational workshop on information forensics and security (WIFS). IEEE, 1–7

  2. [2]

    Nuha Aldausari, Arcot Sowmya, Nadine Marcus, and Gelareh Mohammadi. 2022. Video Generative Adversarial Networks: A Review.ACM Comput. Surv.55, 2, Article 30 (Jan. 2022), 25 pages. doi:10.1145/3487891

  3. [3]

    Tharun Anand, Siva Sankar, and Pravin Nair. 2025. Detecting localized deepfake manipulations using action unit-guided video representations. InProceedings of the Computer Vision and Pattern Recognition Conference. 4341–4351

  4. [4]

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33 (2020), 12449–12460

  5. [5]

    Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, and Michel Valstar. 2017. The NoXi database: multimodal recordings of mediated novice-expert interac- tions. InProceedings of the 19th ACM International Conference on Multimodal Interaction. 350–359

  6. [6]

    Jiayin Cai, Changlin Li, Xin Tao, Chun Yuan, and Yu-Wing Tai. 2022. Devit: Deformed vision transformers in video inpainting. InProceedings of the 30th ACM international conference on multimedia. 779–789

  7. [7]

    Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov. 2024. AV-Deepfake1M: A large-scale LLM- driven audio-visual deepfake dataset. InProceedings of the 32nd ACM International Conference on Multimedia. 7414–7423

  8. [8]

    Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat- Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. InProceedings of the IEEE conference on computer vision and pattern recognition. 5659–5667

  9. [9]

    Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, et al. 2021. Asvspoof 2021: Automatic speaker verification spoofing and counter- measures challenge evaluation plan.arXiv:2109.00535(2021)

  10. [10]

    Chao Feng, Ziyang Chen, and Andrew Owens. 2023. Self-supervised video forensics by audio-visual anomaly detection. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10491–10503

  11. [11]

    Will Feng, Anitha Kannan, Georgia Gkioxari, and C Lawrence Zitnick. 2017. Learn2smile: Learning non-verbal interaction through observation. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4131–4138

  12. [12]

    Shiming Ge, Fanzhao Lin, Chenyu Li, Daichi Zhang, Weiping Wang, and Dan Zeng. 2022. Deepfake video detection via predictive representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)18, 2s (2022), 1–21

  13. [13]

    Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, and Zhengqi Wen. 2025. Allm4add: Unlocking the capabilities of audio large language models for audio deepfake detection. InProceedings of the 33rd ACM International Conference on Multimedia. 11736–11745

  14. [14]

    Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, and Mike Zheng Shou. 2024. Delocate: detection and localization for deepfake videos with randomly-located tampered traces. InProceedings of the Thirty-Third Inter- national Joint Conference on Artificial Intelligence(Jeju, Korea)(IJCAI ’24). Article 648, 10 pages. doi:10.24963/ijcai.2024/648

  15. [15]

    Ailin Huang, Zhewei Huang, and Shuchang Zhou. 2022. Perceptual conver- sational head generation with regularized driver and enhanced renderer. In Proceedings of the 30th ACM international conference on multimedia. 7050–7054

  16. [16]

    Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis.Advances in neural information processing systems31 (2018)

  17. [17]

    Yingxin Lai, Hongyang Wang, Jing Yang, Xiangui Kang, Bin Li, Linlin Shen, and Zitong Yu. 2025. Gm-df: Generalized multi-scenario deepfake detection. In Proceedings of the 33rd ACM International Conference on Multimedia. 4300–4309

  18. [18]

    Jianjun Lei, Yalong Jia, Bo Peng, and Qingming Huang. 2019. Channel-wise tem- poral attention network for video action recognition. In2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 562–567

  19. [19]

    Meng Li, Beibei Liu, Yongjian Hu, and Yufei Wang. 2021. Exposing deepfake videos by tracking eye movements. In2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 5184–5189

  20. [20]

    Xiaodan Li, Yining Lang, Yuefeng Chen, Xiaofeng Mao, Yuan He, Shuhui Wang, Hui Xue, and Quan Lu. 2020. Sharp multiple instance learning for deepfake video detection. InProceedings of the 28th ACM international conference on multimedia. 1864–1872

  21. [21]

    Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, and Wenyuan Xu. 2024. Safeear: Content privacy-preserving audio deepfake detection. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 3585–3599

  22. [22]

    Baoping Liu, Bo Liu, Ming Ding, and Tianqing Zhu. 2026. ForgeFinder: Perceptive Multimodal Deepfake Detection via Multi-grained Forgery Localization.ACM Transactions on Multimedia Computing, Communications and Applications22, 1 (2026), 1–24

  23. [23]

    Miao Liu, Jing Wang, Xinyuan Qian, and Haizhou Li. 2024. Audio-Visual Tempo- ral Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss.IEEE Transactions on Circuits and Systems for Video Technology 34, 8 (2024), 6937–6948. doi:10.1109/TCSVT.2023.3326694

  24. [24]

    Miao Liu, Jing Wang, Xinyuan Qian, and Haizhou Li. 2024. Listenformer: Re- sponsive listening head generation with non-autoregressive transformers. In Proceedings of the 32nd ACM International Conference on Multimedia. 7094–7103

  25. [25]

    Ghazal Mazaheri and Amit K Roy-Chowdhury. 2022. Detection and localization of facial expression manipulations. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 1035–1045

  26. [26]

    Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to listen: Modeling non-deterministic dyadic facial motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20395–20405

  27. [27]

    Fan Nie, Jiangqun Ni, Jian Zhang, Bin Zhang, and Weizhe Zhang. 2024. FRADE: Forgery-aware audio-distilled multimodal learning for deepfake detection. In Proceedings of the 32nd ACM International Conference on Multimedia. 6297–6306

  28. [28]

    Behnaz Nojavanasghari, Yuchi Huang, and Saad Khan. 2018. Interactive genera- tive adversarial networks for facial expression generation in dyadic interactions. arXiv preprint arXiv:1801.09092(2018)

  29. [29]

    KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar

  30. [30]

    In Proceedings of the 28th ACM international conference on multimedia

    A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia. 484–492

  31. [31]

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF international conference on computer vision. 1–11

  32. [32]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV)115, 3 (2015), 211–252. doi:10. 1007/s11263-015-0816-y

  33. [33]

    Siyang Song, Micol Spitale, Cheng Luo, German Barquero, Cristina Palmero, Sergio Escalera, Michel Valstar, Tobias Baur, Fabien Ringeval, Elisabeth Andre, et al. 2023. React2023: The first multiple appropriate facial reaction generation challenge. InProceedings of the 31st ACM International Conference on Multimedia. 9620–9624

  34. [34]

    Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji, Kang- neng Zhou, Daiheng Gao, Liefeng Bo, and Xun Cao. 2025. Vividtalk: One-shot audio-driven talking head generation based on 3d hybrid prior. In2025 Interna- tional Conference on 3D Vision (3DV). IEEE, 713–722. MM ’26, November 10-14, 2026, Rio de Janeiro, Brazil Liu et al

  35. [35]

    Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi H

    Massimiliano Todisco, Xin Wang, Ville Vestman, Md. Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi H. Kinnunen, and Kong Aik Lee. 2019. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection. InInterspeech 2019. 1008–1012. doi:10.21437/Interspeech.2019-2249

  36. [36]

    Yan Wang, Qindong Sun, and Dongzhu Rong. 2025. Audio-Visual Asynchrony Mitigation: Cross-Modal Alignment and Feature Reconstruction for Deepfake Detection. InProceedings of the 33rd ACM International Conference on Multimedia. 11542–11551

  37. [37]

    Yifan Wang, Xuecheng Wu, Jia Zhang, Mohan Jing, Keda Lu, Jun Yu, Wen Su, Fang Gao, Qingsong Liu, Jianqing Sun, et al. 2024. Building robust video-level deepfake detection via audio-visual local-global interactions. InProceedings of the 32nd ACM International Conference on Multimedia. 11370–11376

  38. [38]

    Deressa Wodajo and Solomon Atnafu. 2021. Deepfake video detection using convolutional vision transformer.arXiv preprint arXiv:2102.11126(2021)

  39. [39]

    Junhao Xu, Jingjing Chen, Xue Song, Feng Han, Haijun Shan, and Yu-Gang Jiang

  40. [40]

    In Proceedings of the 32nd ACM International Conference on Multimedia

    Identity-driven multimedia forgery detection via reference assistance. In Proceedings of the 32nd ACM International Conference on Multimedia. 3887–3896

  41. [41]

    Yifan Xu, Sirui Zhao, Shifeng Liu, Tong Xu, and Enhong Chen. 2026. Emotion- ally Controllable Audio-driven Talking Face Generation.ACM Transactions on Multimedia Computing, Communications and Applications(2026)

  42. [42]

    Chen-Zhao Yang, Jun Ma, Shilin Wang, and Alan Wee-Chung Liew. 2020. Pre- venting deepfake attacks on speaker authentication by dynamic lip movement analysis.IEEE Transactions on Information Forensics and Security16 (2020), 1841– 1854

  43. [43]

    Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2023. Diffusion models: A comprehensive survey of methods and applications.ACM computing surveys56, 4 (2023), 1–39

  44. [44]

    Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, and Haizhou Li. 2022. ADD 2022: the first Audio Deep Synthesis Detection Challenge. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal...

  45. [45]

    Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, et al. 2023. Add 2023: the second audio deepfake detection challenge.arXiv:2305.13774(2023)

  46. [46]

    Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao. 2023. Audio deepfake detection: A survey.arXiv preprint arXiv:2308.14970(2023)

  47. [47]

    Jun Yu, Shenshen Du, Haoxiang Shi, Yiwei Zhang, Renbin Su, Zhongpeng Cai, and Lei Wang. 2023. Responsive Listening Head Synthesis with 3DMM and Dual-Stream Prediction Network. InProceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice. 137–143

  48. [48]

    Yibo Zhang, Weiguo Lin, and Junfeng Xu. 2024. Joint audio-visual attention with contrastive learning for more general deepfake detection.ACM Transactions on Multimedia Computing, Communications and Applications20, 5 (2024), 1–23

  49. [49]

    Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. 2022. Responsive listening head generation: a benchmark dataset and baseline. In European conference on computer vision. Springer, 124–142

  50. [50]

    Yipin Zhou and Ser-Nam Lim. 2021. Joint Audio-Visual Deepfake Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 14800–14809

  51. [51]

    Heqing Zou, Meng Shen, Yuchen Hu, Chen Chen, Eng Siong Chng, and Deepu Rajan. 2024. Cross-modality and within-modality regularization for audio-visual deepfake detection. InICASSP 2024-2024 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 4900–4904