EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
Pith reviewed 2026-05-18 04:58 UTC · model grok-4.3
The pith
EchoFake dataset adds physical replay recordings to train speech deepfake detectors with better real-world generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a dataset containing both cutting-edge synthetic speech and its physical replays collected across devices and environments produces detection models that generalize more effectively, measured by reduced average equal error rates when the same models are evaluated on other datasets.
What carries the argument
EchoFake dataset of paired TTS speech and physical replay recordings under varied real-world device and environmental conditions.
If this is right
- Models trained on EchoFake exhibit lower average equal error rates across multiple evaluation datasets than models trained on existing collections.
- The approach directly targets the observed accuracy drop to 59.6 percent when prior models face replayed audio.
- The dataset supplies a more realistic training foundation for methods intended to operate against replay attacks in practical settings.
Where Pith is reading between the lines
- Systems protecting telephone or voice-authentication services could incorporate similar replay material to reduce successful low-cost impersonation attempts.
- Extending the collection to additional acoustic environments or device types would further test whether the generalization benefit scales.
- Detection pipelines might combine EchoFake-style replay data with other attack vectors such as voice conversion to cover a wider range of real threats.
Load-bearing premise
The collected replay recordings from chosen devices and settings stand in for the replay attacks that detection systems will actually encounter once deployed.
What would settle it
Train models on EchoFake, then test them on a new collection of replayed audio made with entirely different devices, rooms, and speakers; if the equal error rate improvement disappears, the claimed generalization gain is refuted.
read the original abstract
The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EchoFake, a dataset exceeding 120 hours of audio from over 13,000 speakers that includes both zero-shot TTS-generated speech and physical replay recordings collected with varied devices and real-world environmental settings. It reports that models trained on prior datasets suffer severe degradation (average accuracy dropping to 59.6%) when evaluated on replayed audio, while models trained on EchoFake exhibit lower average EERs across datasets, indicating improved generalization for practical speech deepfake detection in scenarios such as telephone fraud.
Significance. If the physical replay component is shown to be representative of real-world attack conditions, EchoFake could serve as a useful benchmark resource for developing more robust anti-spoofing systems, directly addressing the documented gap between lab performance and practical replay threats.
major comments (2)
- Abstract: the claim that models trained on EchoFake achieve lower average EERs across datasets is presented without any numerical EER values, the identity or number of evaluation datasets, or statistical significance tests, which are required to support the central generalization result.
- Data collection description: the physical replay recordings are stated to have been gathered 'under varied devices and real-world environmental settings,' yet no quantitative characterization (e.g., acoustic feature overlap, channel impulse response statistics, or cross-corpus EER comparisons) is supplied to establish that these recordings capture the distortions encountered in deployment scenarios such as telephone-band filtering or fraud-call noise profiles.
minor comments (1)
- Abstract: the reported 59.6% accuracy drop on replayed audio should specify the exact evaluation set, the baseline models, and whether speaker or device overlap was controlled.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have addressed each major comment point by point below and will revise the manuscript to incorporate additional details that strengthen the presentation of our results and dataset characterization.
read point-by-point responses
-
Referee: Abstract: the claim that models trained on EchoFake achieve lower average EERs across datasets is presented without any numerical EER values, the identity or number of evaluation datasets, or statistical significance tests, which are required to support the central generalization result.
Authors: We agree that the abstract would be strengthened by including more specific supporting details for the generalization claim. In the revised manuscript, we will update the abstract to report representative numerical EER values from our cross-dataset experiments, specify the identity and number of evaluation datasets, and note the consistent improvements across the three baseline models evaluated. Full statistical details and any significance testing will remain in the main experimental section, with a brief reference added to the abstract where space allows. revision: yes
-
Referee: Data collection description: the physical replay recordings are stated to have been gathered 'under varied devices and real-world environmental settings,' yet no quantitative characterization (e.g., acoustic feature overlap, channel impulse response statistics, or cross-corpus EER comparisons) is supplied to establish that these recordings capture the distortions encountered in deployment scenarios such as telephone-band filtering or fraud-call noise profiles.
Authors: We acknowledge the value of quantitative characterization to better demonstrate representativeness for real-world conditions. In the revised manuscript, we will expand the data collection section to include quantitative metrics such as acoustic feature distributions and overlap statistics, channel impulse response characterizations for the devices and environments used, and additional cross-corpus EER comparisons illustrating performance on replayed audio relative to other datasets. These additions will directly address relevance to scenarios like telephone fraud. revision: yes
Circularity Check
No significant circularity; dataset paper rests on new data collection
full rationale
The paper introduces EchoFake as a new dataset of >120 hours of TTS and physical replay audio collected under varied devices and environments, then reports that models trained on it yield lower average EERs on cross-dataset evaluation. No equations, fitted parameters, or self-citation chains are used to derive any prediction or uniqueness result. The central claim is an empirical observation from training and testing on the collected data, which is externally falsifiable and does not reduce to a self-definition or prior self-citation by construction. This is the expected non-finding for a dataset contribution paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Machine learning classifiers can learn acoustic differences between real and synthetic speech even after replay distortions.
Forward citations
Cited by 1 Pith paper
-
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.
Reference graph
Works this paper leans on
-
[1]
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
INTRODUCTION The recent advances in zero-shot text-to-speech (TTS) and large-scale audio language models (ALM) have dramatically lowered the bar- rier to generating high-quality synthetic speech. With only a few seconds of reference audio, these models can convincingly clone a speaker’s voice, producing speech that is perceptually indistinguish- able from...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
RELA TED WORK 2.1. Audio Deepfake Detection Audio deepfake detection aims to distinguish bona fide utterances from spoofed or synthesized ones. Existing approaches can be broadly categorized into pipeline-based and end-to-end paradigms. 1) Pipeline-based detectors [3, 4, 5] typically adopt a two-stage strategy: handcrafted or pretrained features—such as m...
work page 2019
-
[3]
#utt” denotes the number of utterances, “#spk
ECHOFAKE DA TASET To tackle the mismatch between lab-generated synthetic datasets and real-world spoofing scenarios involving replay attacks, we construct EchoFake, a new dataset that broadens the scope of audio deepfake detection beyond synthetic samples alone. EchoFake consists of four subsets: training, development, closed- set evaluation, and open-set...
work page 2019
-
[4]
EXPERIMENTS 4.1. Experimental Setup We evaluate the proposed dataset using three representative baseline systems: RawNet2, AASIST, and Wav2Vec2. For RawNet2, the model is trained for 100 epochs with a batch size of 64, an initial learning rate of 10−4. For AASIST, the model is trained for 60 epochs with a batch size of 32, an initial learning rate of10−4....
-
[5]
CONCLUTION In this work, we present EchoFake, a novel and comprehensive dataset designed to advance ADD system development under realistic condi- tions. By integrating both zero-shot TTS speech and diverse physical replay recordings, EchoFake captures spoofing patterns overlooked in existing datasets. Evaluations on EchoFake reveal that current models suf...
-
[6]
Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,
Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al., “Asvspoof 2021: accelerating progress in spoofed and deep- fake speech detection,”arXiv preprint arXiv:2109.00537, 2021
-
[7]
Does audio deep- fake detection generalize?,
Nicolas M M ¨uller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, and Konstantin B¨ottinger, “Does audio deep- fake detection generalize?,”arXiv preprint arXiv:2203.16263, 2022
-
[8]
STC Antispoofing Systems for the ASVspoof2019 Challenge
Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina V olkova, Artem Gorlanov, and Alexandr Kozlov, “Stc antispoofing systems for the asvspoof2019 challenge,”arXiv preprint arXiv:1904.05576, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[9]
A comparative study on recent neural spoofing countermeasures for synthetic speech detection,
Xin Wang and Junichi Yamagishi, “A comparative study on recent neural spoofing countermeasures for synthetic speech detection,” inInterspeech, 2021
work page 2021
-
[10]
Attention back-end for automatic speaker ver- ification with multiple enrollment utterances,
Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, and Ju- nichi Yamagishi, “Attention back-end for automatic speaker ver- ification with multiple enrollment utterances,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6717–6721
work page 2022
-
[11]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”Advances in neural infor- mation processing systems, vol. 33, pp. 12449–12460, 2020
work page 2020
-
[12]
To- wards end-to-end synthetic speech detection,
Guang Hua, Andrew Beng Jin Teoh, and Haijian Zhang, “To- wards end-to-end synthetic speech detection,”IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021
work page 2021
-
[13]
Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,
Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” inICASSP 2022-2022 IEEE interna- tional conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 6367–6371
work page 2022
-
[14]
End-to-end anti-spoofing with rawnet2,
Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher, “End-to-end anti-spoofing with rawnet2,” inICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2021, pp. 6369–6373
work page 2021
-
[15]
Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,
Xin Wang, Junichi Yamagishi, Massimiliano Todisco, H´ector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,”Computer Speech & Language, vol. 64, pp. 101114, 2020
work page 2019
-
[16]
Add 2022: The first audio deep synthesis detection challenge,
Jian Yi, Shuai Bai, and et al., “Add 2022: The first audio deep synthesis detection challenge,” inProc. ICASSP, 2022
work page 2022
-
[17]
ASVspoof 5: Crowdsourced speech data at scale,
Xin Wang, H´ector Delgado, Hemlata Tak, Jee-weon Jung, Hye- jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, et al., “Asvspoof 5: Crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” arXiv preprint arXiv:2408.08739, 2024
-
[18]
Xtts: a massively mul- tilingual zero-shot text-to-speech model,
Edresson Casanova, Kelly Davis, Eren G¨olge, G¨orkem G¨oknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, et al., “Xtts: a massively mul- tilingual zero-shot text-to-speech model,”arXiv preprint arXiv:2406.04904, 2024
-
[19]
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,”arXiv preprint arXiv:2410.06885, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,
Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, and Furu Wei, “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022, pp. 2294–2308
work page 2022
-
[21]
Llasa: Scaling train -time and inference-time compute for llama-based speech synthesis,
Zhen Ye and Xinfa Zhu et al., “Llasa: Scaling train -time and inference-time compute for llama-based speech synthesis,” ArXiv, 2025, Introduces single-layer VQ codec + Transformer architecture
work page 2025
-
[22]
Fish-speech: Lever- aging large language models for advanced multilingual text-to- speech synthesis,
Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing, “Fish-speech: Lever- aging large language models for advanced multilingual text-to- speech synthesis,”arXiv preprint arXiv:2411.01156, 2024
-
[23]
Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mis- chler, and Nima Mesgarani, “Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” inAdvances in Neural In- formation Processing Systems (NeurIPS), 2023, Style diffusion + SLM discriminator
work page 2023
-
[24]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, et al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024, Introduces FSQ and chunk-aware flow matching for real-time streaming
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Wei Deng et al., “Indextts: An industrial-level control- lable and efficient zero-shot text-to-speech,” arXiv preprint arXiv:2502.05512, 2025, Improves XTTS + Tortoise with conformer encoder and BigVGAN2
-
[26]
Maskgct: Zero-shot text-to- speech with masked generative Codec Transformer,
Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Zhizheng Wu, et al., “Maskgct: Zero-shot text-to- speech with masked generative Codec Transformer,” inICLR, 2025, Fully non-autoregressive, mask-and-predict modeling with SSL tokens
work page 2025
-
[27]
Open- voice: Versatile instant voice cloning,
Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun, “Open- voice: Versatile instant voice cloning,”arXiv preprint arXiv:2312.01479, 2023
-
[28]
Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications
Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu, “Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications,”arXiv preprint arXiv:2409.03283, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.