Recognition: unknown
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
Pith reviewed 2026-05-10 17:34 UTC · model grok-4.3
The pith
A new grand challenge proposes benchmarks to detect deepfakes across speech, music, singing, and sound effects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes the AT-ADD Grand Challenge with two tracks: Robust Speech Deepfake Detection, which tests detectors under real-world conditions and against state-of-the-art unseen speech generators, and All-Type Audio Deepfake Detection, which requires type-agnostic performance across speech, sound, singing, and music using new datasets and reproducible baselines.
What carries the argument
The dual-track evaluation structure that isolates robustness testing for speech from generalization requirements across all audio types.
Load-bearing premise
That standardized datasets and protocols will drive development of detectors capable of generalizing to unseen audio types and real-world distortions.
What would settle it
Challenge results in which leading detectors continue to show high error rates on non-speech audio or under common distortions such as compression and noise would show the proposed tracks have not produced the intended generalization.
Figures
read the original abstract
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026. It identifies limitations in existing speech-centric audio deepfake detection methods and outlines two tracks: (1) Robust Speech Deepfake Detection, evaluating under real-world distortions and unseen generation techniques, and (2) All-Type Audio Deepfake Detection, extending to heterogeneous audio including sound effects, singing, and music to promote type-agnostic generalization. The plan emphasizes standardized datasets, rigorous protocols, and reproducible baselines.
Significance. If implemented as described, the challenge could meaningfully advance the field by shifting focus from speech-specific artifacts to robust, generalizable detectors across audio types, supporting practical multimedia forensics applications. The emphasis on reproducible baselines and real-world scenarios is a constructive contribution to evaluation standards.
major comments (2)
- [Abstract / Track 2] Abstract and Track 2 description: the central claim that the tracks will drive 'type-agnostic generalization' and address 'restricted generalization to heterogeneous audio types' is not supported by any concrete specification of held-out audio categories, dataset composition, or cross-type evaluation metrics; without these, the evaluation plan cannot be assessed for its ability to test the stated goal.
- [Track 1] Track 1 description: the protocol for 'unseen, state-of-the-art speech generation methods' lacks detail on how unseen methods are selected or partitioned from training data, which is load-bearing for the robustness claim.
minor comments (1)
- [Datasets] The manuscript should include a dedicated section or table listing the exact datasets, their sizes, and sources to make the 'standardized datasets' claim verifiable.
Simulated Author's Rebuttal
Thank you for your detailed review and constructive feedback on our manuscript proposing the AT-ADD Grand Challenge. We appreciate the recognition of its potential significance for advancing audio deepfake detection. We address each major comment below and will revise the manuscript accordingly to provide the requested concrete specifications.
read point-by-point responses
-
Referee: [Abstract / Track 2] Abstract and Track 2 description: the central claim that the tracks will drive 'type-agnostic generalization' and address 'restricted generalization to heterogeneous audio types' is not supported by any concrete specification of held-out audio categories, dataset composition, or cross-type evaluation metrics; without these, the evaluation plan cannot be assessed for its ability to test the stated goal.
Authors: We agree that the current abstract and Track 2 description would benefit from explicit details to support the type-agnostic generalization claims. In the revised manuscript, we will add concrete specifications including: held-out audio categories (e.g., specific sound effect classes like environmental noises, singing styles such as operatic vs. pop vocals, and music genres like classical vs. electronic not present in training), dataset composition with exact training/test splits and type proportions, and cross-type evaluation metrics (e.g., per-type equal error rate and a generalization score across unseen types). These additions will allow direct assessment of the plan's ability to test the stated goals. revision: yes
-
Referee: [Track 1] Track 1 description: the protocol for 'unseen, state-of-the-art speech generation methods' lacks detail on how unseen methods are selected or partitioned from training data, which is load-bearing for the robustness claim.
Authors: We acknowledge that more detail is needed on the unseen methods protocol to substantiate the robustness claim. The revised manuscript will specify the selection criteria for state-of-the-art speech generation methods (e.g., recent ALLM-based synthesizers released after a cutoff date), the partitioning approach to ensure complete disjointness from training data (such as source-based or temporal separation), and how this setup evaluates generalization to emerging techniques. This will clarify the evaluation design. revision: yes
Circularity Check
No significant circularity; descriptive challenge proposal only
full rationale
This is an evaluation plan document proposing two challenge tracks, standardized datasets, and protocols for audio deepfake detection. It contains no derivations, equations, predictions, fitted parameters, or mathematical claims. Background statements about limitations of prior speech-centric methods are descriptive and do not reduce to any self-referential construction or self-citation chain. The forward-looking goals for generalization are aspirational design objectives rather than verifiable results that could be circular by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasac- chi, et al . 2023. Musiclm: Generating music from text.arXiv preprint arXiv:2301.11325(2023)
work page internal anchor Pith review arXiv 2023
-
[2]
Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gre- gor Weber. 2020. Common voice: A massively-multilingual speech corpus. In Proceedings of the twelfth language resources and evaluation conference. 4218– 4222
2020
- [3]
-
[4]
Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2024. MusicLDM: Enhancing Novelty in Text-to-Music Gen- eration Using Beat-Synchronous Mixup Strategies. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1206–1210
2024
-
[5]
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al . 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing16, 6 (2022), 1505–1518
2022
-
[6]
Xuanjun Chen, Haibin Wu, Roger Jang, and Hung-yi Lee. 2024. Singing V oice Graph Modeling for SingFake Detection. InProc. Interspeech 2024. 4843–4847
2024
- [7]
-
[8]
Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. 2024. Simple and controllable music generation. Advances in Neural Information Processing Systems36 (2024)
2024
- [9]
- [10]
-
[11]
Parker, C
Zach Evans, Julian D. Parker, C. J. Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. 2025. Stable Audio Open. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5
2025
-
[12]
Mahyar Gohari, Davide Salvi, Paolo Bestagini, and Nicola Adami. 2025. Audio Features Investigation for Singing V oice Deepfake Detection. InICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5
2025
- [13]
- [14]
-
[15]
Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, and Qiongqiong Wang. 2024. Speech Foundation Model Ensembles for the Controlled Singing V oice Deepfake Detection (CTRSVDD) Challenge 2024. In2024 IEEE Spoken Language Technology Workshop (SLT). 774–781. doi:10.1109/SLT61566.2024. 10832226
-
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid- ual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778
2016
- [17]
-
[18]
Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren
-
[19]
In Proceedings of ACM MM
Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In Proceedings of ACM MM. 2595–2605
-
[20]
Ito and L
K. Ito and L. Johnson. 2017. The LJ speech dataset.https://keithito. com/LJ- Speech-Dataset/(2017)
2017
-
[21]
Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. InProceedings of the ICASSP. 6367–6371
2022
-
[22]
Piotr Kawa, Marcin Plata, and Piotr Syga. 2023. Defense Against Adversarial Attacks on Audio DeepFake Detection. InProc. Interspeech 2023. 5276–5280
2023
- [23]
-
[24]
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 119–132
2019
-
[25]
Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-tts: A generative flow for text-to-speech via monotonic alignment search.Advances in Neural Information Processing Systems33 (2020), 8067–8077
2020
- [26]
-
[27]
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2022. AudioGen: Tex- tually Guided Audio Generation. InThe Eleventh International Conference on Learning Representations
2022
-
[28]
Kumar, Thibault Kumar, R., L
K. Kumar, Thibault Kumar, R., L. Gestin, W. Teoh, J. Sotelo, A. de Brébisson, Y . Bengio, and A. Courville. 2019. Melgan: Generative adversarial networks for conditional waveform synthesis.Advances in neural information processing systems32 (2019)
2019
-
[29]
Yann Lacombe, Vaibhav Srivastav, and Sahaj Gandhi. 2024. Parler-TTS. https: //github.com/huggingface/parler-tts
2024
-
[30]
Adrian Ła ´ncucki. 2021. FastPitch: Parallel Text-to-Speech with Pitch Prediction. InICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6588–6592
2021
- [31]
-
[32]
Yupei Li, Qiyang Sun, Hanqian Li, Lucia Specia, and Björn W Schuller. 2024. Detecting Machine-Generated Music with Explainability–A Challenge and Early Benchmarks.arXiv preprint arXiv:2412.13421(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, et al. 2024. MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training. InICLR
2024
- [34]
-
[35]
Audioldm: Text-to-audio generation with latent diffusion models,
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley. 2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.arXiv preprint arXiv:2301.12503(2023)
-
[36]
Plumbley
Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley. 2024. AudioLDM 2: Learning Holistic Audio Generation with Self-Supervised Pretraining.IEEE/ACM Transactions on Audio, Speech, and Language Processing32 (2024), 2871–2883
2024
-
[37]
Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2022. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. InProceedings of the AAAI conference on artificial intelligence, V ol. 36. 11020–11028
2022
- [38]
-
[39]
Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, et al . 2023. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Transactions on Audio, Speech, and Language Processing(2023)
2023
-
[40]
Nicolas Müller, Piotr Kawa, Wei-Herng Choong, Adriana Stan, Aditya Tirumala Bukkapatnam, Karla Pizzi, Alexander Wagner, and Philip Sperl. 2025. Replay Attacks Against Audio Deepfake Detection. InProc. Interspeech 2025. 2245– 2249
2025
-
[41]
Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi H Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, and Kong Aik Lee. 2021. ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech.IEEE Transactions on Biometrics, Behavior, and Identity Science3, 2 (2021), 252–265
2021
-
[42]
Aryan Nayak. 2025. Kokoro: An Accessible Text-to-Speech Application for Visually Impaired Students. Independent publication
2025
-
[43]
A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch- brenner, A. Senior, and K. Kavukcuoglu. 2016. Wavenet: A generative model for raw audio.arXiv preprint arXiv:1609.03499(2016)
work page internal anchor Pith review arXiv 2016
-
[44]
Zihan Pan, Tianchi Liu, Hardik B Sailor, and Qiongqiong Wang. 2024. Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection. InProc. Interspeech 2024. 2090–2094
2024
-
[45]
Octavian Pascu, Adriana Stan, Dan Oneata, Elisabeta Oneata, and Horia Cucu
-
[46]
InInterspeech 2024
Towards generalisable and calibrated audio deepfake detection with self- supervised representations. InInterspeech 2024. 4828–4832. doi:10.21437/ Interspeech.2024-1302
2024
-
[47]
Orchid Chetia Phukan, Gautam Kashyap, Arun Balaji Buduru, and Rajesh Sharma
-
[48]
InFindings of the Association for Computational Linguistics: NAACL 2024
Heterogeneity over Homogeneity: Investigating Multilingual Speech Pre- Trained Models for Detecting Audio Deepfake. InFindings of the Association for Computational Linguistics: NAACL 2024. 2496–2506. A T -ADD: All-T ype Audio Deepfake Detection Challenge Evaluation Plan MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil
2024
-
[49]
Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. 2021. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. InProceedings of the 38th International Conference on Machine Learning (Pro- ceedings of Machine Learning Research, Vol. 139). 8599–8608
2021
-
[50]
Prenger, R
R. Prenger, R. Valle, and B. Catanzaro. 2019. Waveglow: A flow-based generative network for speech synthesis. InICASSP 2019-2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3617–3621
2019
-
[51]
Mirco Ravanelli and Yoshua Bengio. 2018. Speaker recognition from raw wave- form with sincnet. In2018 IEEE spoken language technology workshop (SLT). IEEE, 1021–1028
2018
-
[52]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations. https://openreview.net/ forum?id=piLPYqxtWuA
2021
-
[53]
Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. PortaSpeech: Portable and High- Quality Generative Text-to-Speech. InAdvances in Neural Information Processing Systems, V ol. 34. 13963–13974
2021
-
[54]
Binzhu Sha, Xu Li, Zhiyong Wu, Ying Shan, and Helen Meng. 2024. Neural concatenative singing voice conversion: Rethinking concatenation-based approach for one-shot singing voice conversion. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 12577– 12581
2024
-
[55]
Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. InICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech...
2018
-
[56]
Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, and Shinji Watanabe. 2024. Singing V oice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing. InProc. Interspeech 2024. 1880–1884
2024
- [57]
-
[58]
svc-develop-team. 2023. so-vits-svc. https://github.com/svc-develop-team/so- vits-svc
2023
-
[59]
Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamag- ishi, and Nicholas Evans. 2022. Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation. InThe Speaker and Language Recognition Workshop (Odyssey 2022). ISCA
2022
-
[60]
Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massim- iliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, et al
- [61]
-
[62]
Yu Wang, Xinsheng Wang, Pengcheng Zhu, Jie Wu, Hanzhao Li, Heyang Xue, Yongmao Zhang, Lei Xie, and Mengxiao Bi. 2022. Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing V oice Synthesis. InProc. Interspeech 2022. 4242–4246
2022
-
[63]
Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, et al. 2025. Mixture of experts fusion for fake audio detection using frozen wav2vec 2.0. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5
2025
-
[64]
Zhaolin Wei, Dengpan Ye, Jiacheng Deng, and Yuhan Lin. 2025. From V oices to Beats: Enhancing Music Deepfake Detection by Identifying Forgeries in Back- ground. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5
2025
-
[65]
Yuankun Xie, Ruibo Fu, Zhiyong Wang, Xiaopeng Wang, Songjun Cao, Long Ma, Haonan Cheng, and Long Ye. 2026. Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception.Proceedings of the AAAI Conference on Artificial Intelligence(2026)
2026
- [66]
-
[67]
Yuxiong Xu, Bin Li, Weixiang Li, Sara Mandelli, Viola Negroni, and Sheng Li
-
[68]
InProceedings of the 33rd ACM International Conference on Multimedia
ALDEN: Dual-Level Disentanglement with Meta-learning for Generaliz- able Audio Deepfake Detection. InProceedings of the 33rd ACM International Conference on Multimedia. 7277–7286
-
[69]
Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, and Yujie Chen. 2026. Unifying Speech Editing Detec- tion and Content Localization via Prior-Enhanced Audio LLMs.arXiv preprint arXiv:2601.21463(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [70]
-
[71]
Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, et al . 2022. Add 2022: the first audio deep synthesis detection challenge. InProceedings of ICASSP. IEEE, 9216–9220
2022
-
[72]
Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chuyuan Zhang, Xiaohui Zhang, Zhao Yan, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, and Haizhou Li. 2023. ADD 2023: the Second Audio Deepfake Detection Challenge.ADD 2023: the Second Audio Deepfake Detection Challenge, accepted by IJCAI 2023 Workshop o...
2023
-
[73]
Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, and Mark D Plumbley. 2025. EnvSDD: Benchmarking Environmental Sound Deepfake Detection. InInterspeech 2025. 201–205. doi:10.21437/Interspeech. 2025-1143
-
[74]
Lichao Zhang, Ruiqi Li, Shoutong Wang, Liqun Deng, Jinglin Liu, Yi Ren, Jinzheng He, Rongjie Huang, Jieming Zhu, Xiao Chen, et al . 2022. M4Singer: a multi-style, multi-singer and musical score provided mandarin singing cor- pus. InProceedings of the 36th International Conference on Neural Information Processing Systems. 6914–6926
2022
-
[75]
Qishan Zhang, Shuangbing Wen, and Tao Hu. 2024. Audio deepfake detection with self-supervised XLS-R and SLS classifier. InACM Multimedia 2024
2024
-
[76]
Qishan Zhang, Shuangbing Wen, Fangke Yan, Tao Hu, and Jun Li. 2024. XWSB: A Blend System Utilizing XLS-R and Wavlm With SLS Classifier Detection System for SVDD 2024 Challenge. In2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 788–794
2024
-
[77]
Tong Zhang, Yihuan Huang, and Yanzhen Ren. 2025. EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection.arXiv preprint arXiv:2510.19414 (2025)
work page internal anchor Pith review arXiv 2025
- [78]
-
[79]
You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, and Zhiyao Duan. 2024. Svdd 2024: The inaugural singing voice deepfake detection challenge. In2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 782–787
2024
-
[80]
Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, and Chengzhi Mao. 2025. I Can Hear You: Selective Robust Train- ing for Deepfake Audio Detection. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=2GcR9bO620
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.