Recognition: unknown
Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis
Pith reviewed 2026-05-10 15:49 UTC · model grok-4.3
The pith
Existing deepfake detectors tuned for speaking faces fail on listening scenes, but a motion-aware network guided by audio succeeds on the first dedicated dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that speaking-centric deepfake detection models perform poorly when applied to listening deepfakes, whereas MANet, a Motion-aware and Audio-guided Network, captures subtle motion inconsistencies in listener videos and leverages speaker audio semantics for cross-modal fusion, achieving significantly superior performance on the ListenForge dataset constructed from five listening head generation methods.
What carries the argument
MANet, the Motion-aware and Audio-guided Network that detects motion inconsistencies while using speaker audio to guide fusion across modalities.
If this is right
- Speaking deepfake detection models will continue to underperform on listening scenarios in interactive settings.
- MANet supplies an effective baseline that exploits motion inconsistencies and audio guidance for listening forgery detection.
- The ListenForge dataset enables systematic study of multimodal forgeries that alternate between speaking and listening states.
- Deepfake detection systems must incorporate both active speech and passive reaction analysis to cover realistic conversation flows.
Where Pith is reading between the lines
- Combining listening and speaking detectors into a single pipeline could support continuous monitoring of full video conversations for deepfakes.
- The motion and audio approach may extend to other non-verbal behaviors such as gestures or facial micro-expressions in forged interactions.
- As listening synthesis improves, detection may shift emphasis toward longer temporal patterns or additional modalities like eye gaze.
Load-bearing premise
The assumption that current listening reaction syntheses contain detectable motion and cross-modal flaws that can be exploited before synthesis quality improves.
What would settle it
A new listening deepfake dataset generated with methods that produce head motions and reactions matching real listeners so closely that MANet loses its performance edge over speaking detectors would falsify the claim.
Figures
read the original abstract
Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker's audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the task of Listening Deepfake Detection (LDD) for forged listening reactions in interactive settings, presents the ListenForge dataset constructed using five Listening Head Generation (LHG) methods, and proposes MANet, a Motion-aware and Audio-guided Network that captures motion inconsistencies while using speaker audio for cross-modal guidance. It claims that existing Speaking Deepfake Detection (SDD) models perform poorly on ListenForge while MANet achieves significantly superior results, arguing for a shift beyond speaking-centric deepfake detection paradigms.
Significance. If the empirical results hold under rigorous controls, the work would be significant for opening a new direction in multimodal deepfake detection focused on conversational listening states, where synthesis limitations may provide detection opportunities. ListenForge serves as a valuable first benchmark dataset, and MANet offers a specialized architecture that could inspire further multimodal fusion techniques. The overall contribution strengthens the case for domain-specific detectors in interactive scenarios.
major comments (1)
- [§4 (Experiments)] §4 (Experiments): The central claim that SDD models perform poorly in listening scenarios (and thus necessitate MANet and a new paradigm) is load-bearing but depends on a fair comparison. The manuscript must clarify whether SDD baselines were evaluated zero-shot or fine-tuned on ListenForge using the same training protocol as MANet; without adaptation results, poor performance may reflect domain shift between speaking and listening videos rather than fundamental inability to detect listening forgeries.
minor comments (2)
- [Abstract] Abstract: The abstract asserts superior performance of MANet and poor results from SDD models but provides no quantitative metrics, baseline names, dataset statistics, or error analysis, which would allow immediate assessment of the claims.
- [§3 (Proposed Method)] §3 (Proposed Method): The description of the audio-guided cross-modal fusion in MANet would benefit from an accompanying diagram or pseudocode to clarify how speaker audio semantics guide motion feature extraction.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The feedback highlights an important aspect of our experimental design that we will clarify and strengthen in the revised manuscript.
read point-by-point responses
-
Referee: The central claim that SDD models perform poorly in listening scenarios (and thus necessitate MANet and a new paradigm) is load-bearing but depends on a fair comparison. The manuscript must clarify whether SDD baselines were evaluated zero-shot or fine-tuned on ListenForge using the same training protocol as MANet; without adaptation results, poor performance may reflect domain shift between speaking and listening videos rather than fundamental inability to detect listening forgeries.
Authors: We appreciate this observation. In the current manuscript, the SDD baselines were evaluated zero-shot on ListenForge (without fine-tuning) to demonstrate the domain gap between speaking-centric training data and listening scenarios, which underpins our argument for a new LDD paradigm. To address the concern directly, we will revise §4 to explicitly state the evaluation protocol and add a new set of results in which the SDD models are fine-tuned on ListenForge using the identical training protocol, data splits, and hyperparameters as MANet. These additional experiments will show that even after adaptation, the SDD models remain substantially inferior to MANet, confirming that the performance gap is not solely attributable to domain shift. revision: yes
Circularity Check
No circularity: new task, dataset, and empirical comparisons are self-contained
full rationale
The paper introduces Listening Deepfake Detection as a new task, constructs the ListenForge dataset from five existing LHG methods, proposes MANet for motion-aware audio-guided detection, and reports empirical results showing SDD models perform poorly while MANet performs better. No equations, derivations, or fitted parameters are present in the provided text. Central claims rest on new data creation and direct performance comparisons rather than any self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation. The derivation chain is independent of its inputs by construction, with no steps that reduce to tautology or prior author work invoked as uniqueness theorems.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Synthesized listening reactions exhibit subtle motion inconsistencies detectable by neural networks
- domain assumption Speaker audio semantics can reliably guide cross-modal fusion for listener video analysis
invented entities (1)
-
MANet
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. Mesonet: a compact facial video forgery detection network. In2018 IEEE in- ternational workshop on information forensics and security (WIFS). IEEE, 1–7
2018
-
[2]
Nuha Aldausari, Arcot Sowmya, Nadine Marcus, and Gelareh Mohammadi. 2022. Video Generative Adversarial Networks: A Review.ACM Comput. Surv.55, 2, Article 30 (Jan. 2022), 25 pages. doi:10.1145/3487891
-
[3]
Tharun Anand, Siva Sankar, and Pravin Nair. 2025. Detecting localized deepfake manipulations using action unit-guided video representations. InProceedings of the Computer Vision and Pattern Recognition Conference. 4341–4351
2025
-
[4]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33 (2020), 12449–12460
2020
-
[5]
Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, and Michel Valstar. 2017. The NoXi database: multimodal recordings of mediated novice-expert interac- tions. InProceedings of the 19th ACM International Conference on Multimodal Interaction. 350–359
2017
-
[6]
Jiayin Cai, Changlin Li, Xin Tao, Chun Yuan, and Yu-Wing Tai. 2022. Devit: Deformed vision transformers in video inpainting. InProceedings of the 30th ACM international conference on multimedia. 779–789
2022
-
[7]
Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov. 2024. AV-Deepfake1M: A large-scale LLM- driven audio-visual deepfake dataset. InProceedings of the 32nd ACM International Conference on Multimedia. 7414–7423
2024
-
[8]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat- Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. InProceedings of the IEEE conference on computer vision and pattern recognition. 5659–5667
2017
-
[9]
Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, et al. 2021. Asvspoof 2021: Automatic speaker verification spoofing and counter- measures challenge evaluation plan.arXiv:2109.00535(2021)
-
[10]
Chao Feng, Ziyang Chen, and Andrew Owens. 2023. Self-supervised video forensics by audio-visual anomaly detection. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10491–10503
2023
-
[11]
Will Feng, Anitha Kannan, Georgia Gkioxari, and C Lawrence Zitnick. 2017. Learn2smile: Learning non-verbal interaction through observation. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4131–4138
2017
-
[12]
Shiming Ge, Fanzhao Lin, Chenyu Li, Daichi Zhang, Weiping Wang, and Dan Zeng. 2022. Deepfake video detection via predictive representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)18, 2s (2022), 1–21
2022
-
[13]
Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, and Zhengqi Wen. 2025. Allm4add: Unlocking the capabilities of audio large language models for audio deepfake detection. InProceedings of the 33rd ACM International Conference on Multimedia. 11736–11745
2025
-
[14]
Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, and Mike Zheng Shou. 2024. Delocate: detection and localization for deepfake videos with randomly-located tampered traces. InProceedings of the Thirty-Third Inter- national Joint Conference on Artificial Intelligence(Jeju, Korea)(IJCAI ’24). Article 648, 10 pages. doi:10.24963/ijcai.2024/648
-
[15]
Ailin Huang, Zhewei Huang, and Shuchang Zhou. 2022. Perceptual conver- sational head generation with regularized driver and enhanced renderer. In Proceedings of the 30th ACM international conference on multimedia. 7050–7054
2022
-
[16]
Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis.Advances in neural information processing systems31 (2018)
2018
-
[17]
Yingxin Lai, Hongyang Wang, Jing Yang, Xiangui Kang, Bin Li, Linlin Shen, and Zitong Yu. 2025. Gm-df: Generalized multi-scenario deepfake detection. In Proceedings of the 33rd ACM International Conference on Multimedia. 4300–4309
2025
-
[18]
Jianjun Lei, Yalong Jia, Bo Peng, and Qingming Huang. 2019. Channel-wise tem- poral attention network for video action recognition. In2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 562–567
2019
-
[19]
Meng Li, Beibei Liu, Yongjian Hu, and Yufei Wang. 2021. Exposing deepfake videos by tracking eye movements. In2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 5184–5189
2021
-
[20]
Xiaodan Li, Yining Lang, Yuefeng Chen, Xiaofeng Mao, Yuan He, Shuhui Wang, Hui Xue, and Quan Lu. 2020. Sharp multiple instance learning for deepfake video detection. InProceedings of the 28th ACM international conference on multimedia. 1864–1872
2020
-
[21]
Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, and Wenyuan Xu. 2024. Safeear: Content privacy-preserving audio deepfake detection. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 3585–3599
2024
-
[22]
Baoping Liu, Bo Liu, Ming Ding, and Tianqing Zhu. 2026. ForgeFinder: Perceptive Multimodal Deepfake Detection via Multi-grained Forgery Localization.ACM Transactions on Multimedia Computing, Communications and Applications22, 1 (2026), 1–24
2026
-
[23]
Miao Liu, Jing Wang, Xinyuan Qian, and Haizhou Li. 2024. Audio-Visual Tempo- ral Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss.IEEE Transactions on Circuits and Systems for Video Technology 34, 8 (2024), 6937–6948. doi:10.1109/TCSVT.2023.3326694
-
[24]
Miao Liu, Jing Wang, Xinyuan Qian, and Haizhou Li. 2024. Listenformer: Re- sponsive listening head generation with non-autoregressive transformers. In Proceedings of the 32nd ACM International Conference on Multimedia. 7094–7103
2024
-
[25]
Ghazal Mazaheri and Amit K Roy-Chowdhury. 2022. Detection and localization of facial expression manipulations. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 1035–1045
2022
-
[26]
Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to listen: Modeling non-deterministic dyadic facial motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20395–20405
2022
-
[27]
Fan Nie, Jiangqun Ni, Jian Zhang, Bin Zhang, and Weizhe Zhang. 2024. FRADE: Forgery-aware audio-distilled multimodal learning for deepfake detection. In Proceedings of the 32nd ACM International Conference on Multimedia. 6297–6306
2024
- [28]
-
[29]
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar
-
[30]
In Proceedings of the 28th ACM international conference on multimedia
A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia. 484–492
-
[31]
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF international conference on computer vision. 1–11
2019
-
[32]
Berg, and Li Fei-Fei
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV)115, 3 (2015), 211–252. doi:10. 1007/s11263-015-0816-y
2015
-
[33]
Siyang Song, Micol Spitale, Cheng Luo, German Barquero, Cristina Palmero, Sergio Escalera, Michel Valstar, Tobias Baur, Fabien Ringeval, Elisabeth Andre, et al. 2023. React2023: The first multiple appropriate facial reaction generation challenge. InProceedings of the 31st ACM International Conference on Multimedia. 9620–9624
2023
-
[34]
Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji, Kang- neng Zhou, Daiheng Gao, Liefeng Bo, and Xun Cao. 2025. Vividtalk: One-shot audio-driven talking head generation based on 3d hybrid prior. In2025 Interna- tional Conference on 3D Vision (3DV). IEEE, 713–722. MM ’26, November 10-14, 2026, Rio de Janeiro, Brazil Liu et al
2025
-
[35]
Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi H
Massimiliano Todisco, Xin Wang, Ville Vestman, Md. Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi H. Kinnunen, and Kong Aik Lee. 2019. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection. InInterspeech 2019. 1008–1012. doi:10.21437/Interspeech.2019-2249
-
[36]
Yan Wang, Qindong Sun, and Dongzhu Rong. 2025. Audio-Visual Asynchrony Mitigation: Cross-Modal Alignment and Feature Reconstruction for Deepfake Detection. InProceedings of the 33rd ACM International Conference on Multimedia. 11542–11551
2025
-
[37]
Yifan Wang, Xuecheng Wu, Jia Zhang, Mohan Jing, Keda Lu, Jun Yu, Wen Su, Fang Gao, Qingsong Liu, Jianqing Sun, et al. 2024. Building robust video-level deepfake detection via audio-visual local-global interactions. InProceedings of the 32nd ACM International Conference on Multimedia. 11370–11376
2024
- [38]
-
[39]
Junhao Xu, Jingjing Chen, Xue Song, Feng Han, Haijun Shan, and Yu-Gang Jiang
-
[40]
In Proceedings of the 32nd ACM International Conference on Multimedia
Identity-driven multimedia forgery detection via reference assistance. In Proceedings of the 32nd ACM International Conference on Multimedia. 3887–3896
-
[41]
Yifan Xu, Sirui Zhao, Shifeng Liu, Tong Xu, and Enhong Chen. 2026. Emotion- ally Controllable Audio-driven Talking Face Generation.ACM Transactions on Multimedia Computing, Communications and Applications(2026)
2026
-
[42]
Chen-Zhao Yang, Jun Ma, Shilin Wang, and Alan Wee-Chung Liew. 2020. Pre- venting deepfake attacks on speaker authentication by dynamic lip movement analysis.IEEE Transactions on Information Forensics and Security16 (2020), 1841– 1854
2020
-
[43]
Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2023. Diffusion models: A comprehensive survey of methods and applications.ACM computing surveys56, 4 (2023), 1–39
2023
-
[44]
Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, and Haizhou Li. 2022. ADD 2022: the first Audio Deep Synthesis Detection Challenge. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal...
- [45]
- [46]
-
[47]
Jun Yu, Shenshen Du, Haoxiang Shi, Yiwei Zhang, Renbin Su, Zhongpeng Cai, and Lei Wang. 2023. Responsive Listening Head Synthesis with 3DMM and Dual-Stream Prediction Network. InProceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice. 137–143
2023
-
[48]
Yibo Zhang, Weiguo Lin, and Junfeng Xu. 2024. Joint audio-visual attention with contrastive learning for more general deepfake detection.ACM Transactions on Multimedia Computing, Communications and Applications20, 5 (2024), 1–23
2024
-
[49]
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. 2022. Responsive listening head generation: a benchmark dataset and baseline. In European conference on computer vision. Springer, 124–142
2022
-
[50]
Yipin Zhou and Ser-Nam Lim. 2021. Joint Audio-Visual Deepfake Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 14800–14809
2021
-
[51]
Heqing Zou, Meng Shen, Yuchen Hu, Chen Chen, Eng Siong Chng, and Deepu Rajan. 2024. Cross-modality and within-modality regularization for audio-visual deepfake detection. InICASSP 2024-2024 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 4900–4904
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.