pith. machine review for the scientific record. sign in

arxiv: 2604.08184 · v1 · submitted 2026-04-09 · 💻 cs.SD · cs.AI

Recognition: unknown

AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

Guangtao Zhai, Haonan Cheng, Hengyan Huang, Jian Liu, Jiayi Zhou, Long Ye, Ruibo Fu, Tao Wang, Weiqiang Wang, Xiaopeng Wang, Xiaoxuan Guo, Xiaoying Huang, Yuankun Xie

Pith reviewed 2026-05-10 17:34 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords audio deepfake detectionall-type audiospeech deepfakemultimedia forensicsrobust detectiongeneralizationchallenge evaluationsynthetic audio
0
0 comments X

The pith

A new grand challenge proposes benchmarks to detect deepfakes across speech, music, singing, and sound effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current audio deepfake detection methods focus on speech and often fail when faced with other audio types or real-world distortions. The paper addresses this by introducing the AT-ADD challenge, which supplies standardized datasets and evaluation protocols. One track requires detectors to remain effective against unseen speech generation methods and practical distortions. The second track demands generalization to heterogeneous audio including sound effects, singing voices, and music. The overall aim is to produce type-agnostic detectors that support media verification and security.

Core claim

The paper establishes the AT-ADD Grand Challenge with two tracks: Robust Speech Deepfake Detection, which tests detectors under real-world conditions and against state-of-the-art unseen speech generators, and All-Type Audio Deepfake Detection, which requires type-agnostic performance across speech, sound, singing, and music using new datasets and reproducible baselines.

What carries the argument

The dual-track evaluation structure that isolates robustness testing for speech from generalization requirements across all audio types.

Load-bearing premise

That standardized datasets and protocols will drive development of detectors capable of generalizing to unseen audio types and real-world distortions.

What would settle it

Challenge results in which leading detectors continue to show high error rates on non-speech audio or under common distortions such as compression and noise would show the proposed tracks have not produced the intended generalization.

Figures

Figures reproduced from arXiv: 2604.08184 by Guangtao Zhai, Haonan Cheng, Hengyan Huang, Jian Liu, Jiayi Zhou, Long Ye, Ruibo Fu, Tao Wang, Weiqiang Wang, Xiaopeng Wang, Xiaoxuan Guo, Xiaoying Huang, Yuankun Xie.

Figure 1
Figure 1. Figure 1: AT-ADD challenge overview. Keywords Audio Deepfake Detection, Countermeasure, Audio Large Language Model 1 Introduction Recent advances in audio generation technologies, particularly Au￾dio Large Language Models (ALLMs), have significantly improved the realism, scalability, and accessibility of synthetic audio. Modern generative systems are now capable of producing high-fidelity audio across a wide range o… view at source ↗
read the original abstract

The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026. It identifies limitations in existing speech-centric audio deepfake detection methods and outlines two tracks: (1) Robust Speech Deepfake Detection, evaluating under real-world distortions and unseen generation techniques, and (2) All-Type Audio Deepfake Detection, extending to heterogeneous audio including sound effects, singing, and music to promote type-agnostic generalization. The plan emphasizes standardized datasets, rigorous protocols, and reproducible baselines.

Significance. If implemented as described, the challenge could meaningfully advance the field by shifting focus from speech-specific artifacts to robust, generalizable detectors across audio types, supporting practical multimedia forensics applications. The emphasis on reproducible baselines and real-world scenarios is a constructive contribution to evaluation standards.

major comments (2)
  1. [Abstract / Track 2] Abstract and Track 2 description: the central claim that the tracks will drive 'type-agnostic generalization' and address 'restricted generalization to heterogeneous audio types' is not supported by any concrete specification of held-out audio categories, dataset composition, or cross-type evaluation metrics; without these, the evaluation plan cannot be assessed for its ability to test the stated goal.
  2. [Track 1] Track 1 description: the protocol for 'unseen, state-of-the-art speech generation methods' lacks detail on how unseen methods are selected or partitioned from training data, which is load-bearing for the robustness claim.
minor comments (1)
  1. [Datasets] The manuscript should include a dedicated section or table listing the exact datasets, their sizes, and sources to make the 'standardized datasets' claim verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your detailed review and constructive feedback on our manuscript proposing the AT-ADD Grand Challenge. We appreciate the recognition of its potential significance for advancing audio deepfake detection. We address each major comment below and will revise the manuscript accordingly to provide the requested concrete specifications.

read point-by-point responses
  1. Referee: [Abstract / Track 2] Abstract and Track 2 description: the central claim that the tracks will drive 'type-agnostic generalization' and address 'restricted generalization to heterogeneous audio types' is not supported by any concrete specification of held-out audio categories, dataset composition, or cross-type evaluation metrics; without these, the evaluation plan cannot be assessed for its ability to test the stated goal.

    Authors: We agree that the current abstract and Track 2 description would benefit from explicit details to support the type-agnostic generalization claims. In the revised manuscript, we will add concrete specifications including: held-out audio categories (e.g., specific sound effect classes like environmental noises, singing styles such as operatic vs. pop vocals, and music genres like classical vs. electronic not present in training), dataset composition with exact training/test splits and type proportions, and cross-type evaluation metrics (e.g., per-type equal error rate and a generalization score across unseen types). These additions will allow direct assessment of the plan's ability to test the stated goals. revision: yes

  2. Referee: [Track 1] Track 1 description: the protocol for 'unseen, state-of-the-art speech generation methods' lacks detail on how unseen methods are selected or partitioned from training data, which is load-bearing for the robustness claim.

    Authors: We acknowledge that more detail is needed on the unseen methods protocol to substantiate the robustness claim. The revised manuscript will specify the selection criteria for state-of-the-art speech generation methods (e.g., recent ALLM-based synthesizers released after a cutoff date), the partitioning approach to ensure complete disjointness from training data (such as source-based or temporal separation), and how this setup evaluates generalization to emerging techniques. This will clarify the evaluation design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; descriptive challenge proposal only

full rationale

This is an evaluation plan document proposing two challenge tracks, standardized datasets, and protocols for audio deepfake detection. It contains no derivations, equations, predictions, fitted parameters, or mathematical claims. Background statements about limitations of prior speech-centric methods are descriptive and do not reduce to any self-referential construction or self-citation chain. The forward-looking goals for generalization are aspirational design objectives rather than verifiable results that could be circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The document is an organizational proposal for a detection challenge and contains no scientific derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5603 in / 1111 out tokens · 98394 ms · 2026-05-10T17:34:59.715932+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 25 canonical work pages · 5 internal anchors

  1. [1]

    Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasac- chi, et al . 2023. Musiclm: Generating music from text.arXiv preprint arXiv:2301.11325(2023)

  2. [2]

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gre- gor Weber. 2020. Common voice: A massively-multilingual speech corpus. In Proceedings of the twelfth language resources and evaluation conference. 4218– 4222

  3. [3]

    James Betker. 2023. Better Speech Synthesis Through Scaling.arXiv preprint arXiv:2305.07243(2023)

  4. [4]

    Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2024. MusicLDM: Enhancing Novelty in Text-to-Music Gen- eration Using Beat-Synchronous Mixup Strategies. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1206–1210

  5. [5]

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al . 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing16, 6 (2022), 1505–1518

  6. [6]

    Xuanjun Chen, Haibin Wu, Roger Jang, and Hung-yi Lee. 2024. Singing V oice Graph Modeling for SingFake Detection. InProc. Interspeech 2024. 4843–4847

  7. [7]

    Luca Comanducci, Paolo Bestagini, and Stefano Tubaro. 2024. Fakemusiccaps: a dataset for detection and attribution of synthetic music generated via text-to-music models.arXiv preprint arXiv:2409.10684(2024)

  8. [8]

    Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. 2024. Simple and controllable music generation. Advances in Neural Information Processing Systems36 (2024)

  9. [9]

    Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, and Lu Wang. 2025. IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-to-Speech System. arXiv preprint arXiv:2502.05512(2025)

  10. [10]

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, et al . 2024. CosyV oice: A Scalable Multilingual Zero-Shot Text-to- Speech Synthesizer Based on Supervised Semantic Tokens.arXiv preprint arXiv:2407.05407(2024)

  11. [11]

    Parker, C

    Zach Evans, Julian D. Parker, C. J. Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. 2025. Stable Audio Open. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  12. [12]

    Mahyar Gohari, Davide Salvi, Paolo Bestagini, and Nicola Adami. 2025. Audio Features Investigation for Singing V oice Deepfake Detection. InICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  13. [13]

    Xiaoxuan Guo, Hengyan Huang, Jiayi Zhou, Renhe Sun, Jian Liu, Haonan Cheng, Long Ye, and Qin Zhang. 2025. EnvSSLAM-FFN: Lightweight Layer-Fused System for ESDD 2026 Challenge.arXiv preprint arXiv:2512.20369(2025)

  14. [14]

    Xiaoxuan Guo, Yuankun Xie, Haonan Cheng, Jiayi Zhou, Jian Liu, Hengyan Huang, Long Ye, and Qin Zhang. 2026. Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection.arXiv preprint arXiv:2601.23066(2026)

  15. [15]

    Sailor, and Qiongqiong Wang

    Anmol Guragain, Tianchi Liu, Zihan Pan, Hardik B. Sailor, and Qiongqiong Wang. 2024. Speech Foundation Model Ensembles for the Controlled Singing V oice Deepfake Detection (CTRSVDD) Challenge 2024. In2024 IEEE Spoken Language Technology Workshop (SLT). 774–781. doi:10.1109/SLT61566.2024. 10832226

  16. [16]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid- ual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  17. [17]

    Rongjie Huang, Max W. Y . Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. 2022. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis.arXiv preprint arXiv:2204.09934(2022)

  18. [18]

    Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren

  19. [19]

    In Proceedings of ACM MM

    Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In Proceedings of ACM MM. 2595–2605

  20. [20]

    Ito and L

    K. Ito and L. Johnson. 2017. The LJ speech dataset.https://keithito. com/LJ- Speech-Dataset/(2017)

  21. [21]

    Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. InProceedings of the ICASSP. 6367–6371

  22. [22]

    Piotr Kawa, Marcin Plata, and Piotr Syga. 2023. Defense Against Adversarial Attacks on Audio DeepFake Detection. InProc. Interspeech 2023. 5276–5280

  23. [23]

    Yassine El Kheir, Youness Samih, Suraj Maharjan, Tim Polzehl, and Sebastian Möller. 2025. Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection.arXiv preprint arXiv:2502.03559(2025)

  24. [24]

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 119–132

  25. [25]

    Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-tts: A generative flow for text-to-speech via monotonic alignment search.Advances in Neural Information Processing Systems33 (2020), 8067–8077

  26. [26]

    Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, and Ankur Bapna. 2023. Libritts-r: A restored multi-speaker text-to-speech corpus.arXiv preprint arXiv:2305.18802(2023)

  27. [27]

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2022. AudioGen: Tex- tually Guided Audio Generation. InThe Eleventh International Conference on Learning Representations

  28. [28]

    Kumar, Thibault Kumar, R., L

    K. Kumar, Thibault Kumar, R., L. Gestin, W. Teoh, J. Sotelo, A. de Brébisson, Y . Bengio, and A. Courville. 2019. Melgan: Generative adversarial networks for conditional waveform synthesis.Advances in neural information processing systems32 (2019)

  29. [29]

    Yann Lacombe, Vaibhav Srivastav, and Sahaj Gandhi. 2024. Parler-TTS. https: //github.com/huggingface/parler-tts

  30. [30]

    Adrian Ła ´ncucki. 2021. FastPitch: Parallel Text-to-Speech with Pitch Prediction. InICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6588–6592

  31. [31]

    Yupei Li, Manuel Milling, Lucia Specia, and Björn W Schuller. 2024. From Audio Deepfake Detection to AI-Generated Music Detection–A Pathway and Overview. arXiv preprint arXiv:2412.00571(2024)

  32. [32]

    Yupei Li, Qiyang Sun, Hanqian Li, Lucia Specia, and Björn W Schuller. 2024. Detecting Machine-Generated Music with Explainability–A Challenge and Early Benchmarks.arXiv preprint arXiv:2412.13421(2024)

  33. [33]

    Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, et al. 2024. MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training. InICLR

  34. [34]

    Yinghao Aaron Li, Ali Zare, and Nima Mesgarani. 2021. StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding V oice Conversion.arXiv preprint arXiv:2107.10394(2021)

  35. [35]

    Audioldm: Text-to-audio generation with latent diffusion models,

    Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley. 2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.arXiv preprint arXiv:2301.12503(2023)

  36. [36]

    Plumbley

    Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley. 2024. AudioLDM 2: Learning Holistic Audio Generation with Self-Supervised Pretraining.IEEE/ACM Transactions on Audio, Speech, and Language Processing32 (2024), 2871–2883

  37. [37]

    Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2022. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. InProceedings of the AAAI conference on artificial intelligence, V ol. 36. 11020–11028

  38. [38]

    Songting Liu. 2024. Zero-Shot V oice Conversion with Diffusion Transformers. arXiv preprint arXiv:2411.09943(2024)

  39. [39]

    Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, et al . 2023. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild.IEEE/ACM Transactions on Audio, Speech, and Language Processing(2023)

  40. [40]

    Nicolas Müller, Piotr Kawa, Wei-Herng Choong, Adriana Stan, Aditya Tirumala Bukkapatnam, Karla Pizzi, Alexander Wagner, and Philip Sperl. 2025. Replay Attacks Against Audio Deepfake Detection. InProc. Interspeech 2025. 2245– 2249

  41. [41]

    Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi H Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, and Kong Aik Lee. 2021. ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech.IEEE Transactions on Biometrics, Behavior, and Identity Science3, 2 (2021), 252–265

  42. [42]

    Aryan Nayak. 2025. Kokoro: An Accessible Text-to-Speech Application for Visually Impaired Students. Independent publication

  43. [43]

    A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch- brenner, A. Senior, and K. Kavukcuoglu. 2016. Wavenet: A generative model for raw audio.arXiv preprint arXiv:1609.03499(2016)

  44. [44]

    Zihan Pan, Tianchi Liu, Hardik B Sailor, and Qiongqiong Wang. 2024. Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection. InProc. Interspeech 2024. 2090–2094

  45. [45]

    Octavian Pascu, Adriana Stan, Dan Oneata, Elisabeta Oneata, and Horia Cucu

  46. [46]

    InInterspeech 2024

    Towards generalisable and calibrated audio deepfake detection with self- supervised representations. InInterspeech 2024. 4828–4832. doi:10.21437/ Interspeech.2024-1302

  47. [47]

    Orchid Chetia Phukan, Gautam Kashyap, Arun Balaji Buduru, and Rajesh Sharma

  48. [48]

    InFindings of the Association for Computational Linguistics: NAACL 2024

    Heterogeneity over Homogeneity: Investigating Multilingual Speech Pre- Trained Models for Detecting Audio Deepfake. InFindings of the Association for Computational Linguistics: NAACL 2024. 2496–2506. A T -ADD: All-T ype Audio Deepfake Detection Challenge Evaluation Plan MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil

  49. [49]

    Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. 2021. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. InProceedings of the 38th International Conference on Machine Learning (Pro- ceedings of Machine Learning Research, Vol. 139). 8599–8608

  50. [50]

    Prenger, R

    R. Prenger, R. Valle, and B. Catanzaro. 2019. Waveglow: A flow-based generative network for speech synthesis. InICASSP 2019-2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3617–3621

  51. [51]

    Mirco Ravanelli and Yoshua Bengio. 2018. Speaker recognition from raw wave- form with sincnet. In2018 IEEE spoken language technology workshop (SLT). IEEE, 1021–1028

  52. [52]

    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations. https://openreview.net/ forum?id=piLPYqxtWuA

  53. [53]

    Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. PortaSpeech: Portable and High- Quality Generative Text-to-Speech. InAdvances in Neural Information Processing Systems, V ol. 34. 13963–13974

  54. [54]

    Binzhu Sha, Xu Li, Zhiyong Wu, Ying Shan, and Helen Meng. 2024. Neural concatenative singing voice conversion: Rethinking concatenation-based approach for one-shot singing voice conversion. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 12577– 12581

  55. [55]

    Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R

    Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. InICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech...

  56. [56]

    Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, and Shinji Watanabe. 2024. Singing V oice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing. InProc. Interspeech 2024. 1880–1884

  57. [57]

    Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. 2020. Aishell-3: A multi- speaker mandarin tts corpus and the baselines.arXiv preprint arXiv:2010.11567 (2020)

  58. [58]

    svc-develop-team. 2023. so-vits-svc. https://github.com/svc-develop-team/so- vits-svc

  59. [59]

    Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamag- ishi, and Nicholas Evans. 2022. Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation. InThe Speaker and Language Recognition Workshop (Odyssey 2022). ISCA

  60. [60]

    Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massim- iliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, et al

  61. [61]

    ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale.arXiv preprint arXiv:2408.08739(2024)

  62. [62]

    Yu Wang, Xinsheng Wang, Pengcheng Zhu, Jie Wu, Hanzhao Li, Heyang Xue, Yongmao Zhang, Lei Xie, and Mengxiao Bi. 2022. Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing V oice Synthesis. InProc. Interspeech 2022. 4242–4246

  63. [63]

    Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, et al. 2025. Mixture of experts fusion for fake audio detection using frozen wav2vec 2.0. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  64. [64]

    Zhaolin Wei, Dengpan Ye, Jiacheng Deng, and Yuhan Lin. 2025. From V oices to Beats: Enhancing Music Deepfake Detection by Identifying Forgeries in Back- ground. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  65. [65]

    Yuankun Xie, Ruibo Fu, Zhiyong Wang, Xiaopeng Wang, Songjun Cao, Long Ma, Haonan Cheng, and Long Ye. 2026. Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception.Proceedings of the AAAI Conference on Artificial Intelligence(2026)

  66. [66]

    Yuankun Xie, Xiaoxuan Guo, Jiayi Zhou, Tao Wang, Jian Liu, Ruibo Fu, Xiaopeng Wang, Haonan Cheng, and Long Ye. 2026. Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning.arXiv preprint arXiv:2601.02983(2026)

  67. [67]

    Yuxiong Xu, Bin Li, Weixiang Li, Sara Mandelli, Viola Negroni, and Sheng Li

  68. [68]

    InProceedings of the 33rd ACM International Conference on Multimedia

    ALDEN: Dual-Level Disentanglement with Meta-learning for Generaliz- able Audio Deepfake Detection. InProceedings of the 33rd ACM International Conference on Multimedia. 7277–7286

  69. [69]

    Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, and Yujie Chen. 2026. Unifying Speech Editing Detec- tion and Content Localization via Prior-Enhanced Audio LLMs.arXiv preprint arXiv:2601.21463(2026)

  70. [70]

    Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, et al. 2025. LLaSA: Scaling Train-Time and Inference-Time Compute for LLaMA-Based Speech Synthesis.arXiv preprint arXiv:2502.04128(2025)

  71. [71]

    Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, et al . 2022. Add 2022: the first audio deep synthesis detection challenge. InProceedings of ICASSP. IEEE, 9216–9220

  72. [72]

    Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chuyuan Zhang, Xiaohui Zhang, Zhao Yan, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, and Haizhou Li. 2023. ADD 2023: the Second Audio Deepfake Detection Challenge.ADD 2023: the Second Audio Deepfake Detection Challenge, accepted by IJCAI 2023 Workshop o...

  73. [73]

    Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, and Mark D Plumbley. 2025. EnvSDD: Benchmarking Environmental Sound Deepfake Detection. InInterspeech 2025. 201–205. doi:10.21437/Interspeech. 2025-1143

  74. [74]

    Lichao Zhang, Ruiqi Li, Shoutong Wang, Liqun Deng, Jinglin Liu, Yi Ren, Jinzheng He, Rongjie Huang, Jieming Zhu, Xiao Chen, et al . 2022. M4Singer: a multi-style, multi-singer and musical score provided mandarin singing cor- pus. InProceedings of the 36th International Conference on Neural Information Processing Systems. 6914–6926

  75. [75]

    Qishan Zhang, Shuangbing Wen, and Tao Hu. 2024. Audio deepfake detection with self-supervised XLS-R and SLS classifier. InACM Multimedia 2024

  76. [76]

    Qishan Zhang, Shuangbing Wen, Fangke Yan, Tao Hu, and Jun Li. 2024. XWSB: A Blend System Utilizing XLS-R and Wavlm With SLS Classifier Detection System for SVDD 2024 Challenge. In2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 788–794

  77. [77]

    Tong Zhang, Yihuan Huang, and Yanzhen Ren. 2025. EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection.arXiv preprint arXiv:2510.19414 (2025)

  78. [78]

    Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, and Ming Li. 2026. ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan.arXiv preprint arXiv:2601.07303(2026)

  79. [79]

    You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, and Zhiyao Duan. 2024. Svdd 2024: The inaugural singing voice deepfake detection challenge. In2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 782–787

  80. [80]

    Zirui Zhang, Wei Hao, Aroon Sankoh, William Lin, Emanuel Mendiola-Ortiz, Junfeng Yang, and Chengzhi Mao. 2025. I Can Hear You: Selective Robust Train- ing for Deepfake Audio Detection. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=2GcR9bO620

Showing first 80 references.