EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

Tong Zhang; Yanzhen Ren; Yihuan Huang

arxiv: 2510.19414 · v2 · submitted 2025-10-22 · 📡 eess.AS · cs.AI· cs.SD

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

Tong Zhang , Yihuan Huang , Yanzhen Ren This is my paper

Pith reviewed 2026-05-18 04:58 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.SD

keywords speech deepfake detectionreplay attacksanti-spoofinggeneralizationphysical replayTTS datasetequal error rate

0 comments

The pith

EchoFake dataset adds physical replay recordings to train speech deepfake detectors with better real-world generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current detectors lose most of their accuracy when synthetic speech is played back through ordinary devices in ordinary rooms, falling to roughly 60 percent on replayed audio. The paper creates EchoFake by generating modern zero-shot text-to-speech fakes and then capturing those fakes as they are replayed through many different loudspeakers and microphones under varied acoustic conditions, yielding more than 120 hours from over 13,000 speakers. When detection models are trained on this mixture instead of prior datasets, they record lower average equal error rates on multiple test collections. A sympathetic reader would see this as evidence that including realistic replay material during training closes the gap between lab performance and practical deployment against low-cost replay attacks.

Core claim

The paper establishes that a dataset containing both cutting-edge synthetic speech and its physical replays collected across devices and environments produces detection models that generalize more effectively, measured by reduced average equal error rates when the same models are evaluated on other datasets.

What carries the argument

EchoFake dataset of paired TTS speech and physical replay recordings under varied real-world device and environmental conditions.

If this is right

Models trained on EchoFake exhibit lower average equal error rates across multiple evaluation datasets than models trained on existing collections.
The approach directly targets the observed accuracy drop to 59.6 percent when prior models face replayed audio.
The dataset supplies a more realistic training foundation for methods intended to operate against replay attacks in practical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems protecting telephone or voice-authentication services could incorporate similar replay material to reduce successful low-cost impersonation attempts.
Extending the collection to additional acoustic environments or device types would further test whether the generalization benefit scales.
Detection pipelines might combine EchoFake-style replay data with other attack vectors such as voice conversion to cover a wider range of real threats.

Load-bearing premise

The collected replay recordings from chosen devices and settings stand in for the replay attacks that detection systems will actually encounter once deployed.

What would settle it

Train models on EchoFake, then test them on a new collection of replayed audio made with entirely different devices, rooms, and speakers; if the equal error rate improvement disappears, the claimed generalization gain is refuted.

read the original abstract

The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EchoFake adds a sizable replay-augmented dataset that targets a real practical gap in speech deepfake detection, but the collection and results details stay thin.

read the letter

The punchline is that this paper releases EchoFake, a dataset of more than 120 hours from over 13,000 speakers that mixes zero-shot TTS deepfakes with physical replay recordings gathered under varied devices and real-world settings. The authors note that models trained on prior datasets drop to 59.6 percent accuracy on replayed audio and report that training on EchoFake yields lower average EERs across other test sets, pointing to better generalization for practical threats like telephone fraud.

Referee Report

2 major / 1 minor

Summary. The paper introduces EchoFake, a dataset exceeding 120 hours of audio from over 13,000 speakers that includes both zero-shot TTS-generated speech and physical replay recordings collected with varied devices and real-world environmental settings. It reports that models trained on prior datasets suffer severe degradation (average accuracy dropping to 59.6%) when evaluated on replayed audio, while models trained on EchoFake exhibit lower average EERs across datasets, indicating improved generalization for practical speech deepfake detection in scenarios such as telephone fraud.

Significance. If the physical replay component is shown to be representative of real-world attack conditions, EchoFake could serve as a useful benchmark resource for developing more robust anti-spoofing systems, directly addressing the documented gap between lab performance and practical replay threats.

major comments (2)

Abstract: the claim that models trained on EchoFake achieve lower average EERs across datasets is presented without any numerical EER values, the identity or number of evaluation datasets, or statistical significance tests, which are required to support the central generalization result.
Data collection description: the physical replay recordings are stated to have been gathered 'under varied devices and real-world environmental settings,' yet no quantitative characterization (e.g., acoustic feature overlap, channel impulse response statistics, or cross-corpus EER comparisons) is supplied to establish that these recordings capture the distortions encountered in deployment scenarios such as telephone-band filtering or fraud-call noise profiles.

minor comments (1)

Abstract: the reported 59.6% accuracy drop on replayed audio should specify the exact evaluation set, the baseline models, and whether speaker or device overlap was controlled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have addressed each major comment point by point below and will revise the manuscript to incorporate additional details that strengthen the presentation of our results and dataset characterization.

read point-by-point responses

Referee: Abstract: the claim that models trained on EchoFake achieve lower average EERs across datasets is presented without any numerical EER values, the identity or number of evaluation datasets, or statistical significance tests, which are required to support the central generalization result.

Authors: We agree that the abstract would be strengthened by including more specific supporting details for the generalization claim. In the revised manuscript, we will update the abstract to report representative numerical EER values from our cross-dataset experiments, specify the identity and number of evaluation datasets, and note the consistent improvements across the three baseline models evaluated. Full statistical details and any significance testing will remain in the main experimental section, with a brief reference added to the abstract where space allows. revision: yes
Referee: Data collection description: the physical replay recordings are stated to have been gathered 'under varied devices and real-world environmental settings,' yet no quantitative characterization (e.g., acoustic feature overlap, channel impulse response statistics, or cross-corpus EER comparisons) is supplied to establish that these recordings capture the distortions encountered in deployment scenarios such as telephone-band filtering or fraud-call noise profiles.

Authors: We acknowledge the value of quantitative characterization to better demonstrate representativeness for real-world conditions. In the revised manuscript, we will expand the data collection section to include quantitative metrics such as acoustic feature distributions and overlap statistics, channel impulse response characterizations for the devices and environments used, and additional cross-corpus EER comparisons illustrating performance on replayed audio relative to other datasets. These additions will directly address relevance to scenarios like telephone fraud. revision: yes

Circularity Check

0 steps flagged

No significant circularity; dataset paper rests on new data collection

full rationale

The paper introduces EchoFake as a new dataset of >120 hours of TTS and physical replay audio collected under varied devices and environments, then reports that models trained on it yield lower average EERs on cross-dataset evaluation. No equations, fitted parameters, or self-citation chains are used to derive any prediction or uniqueness result. The central claim is an empirical observation from training and testing on the collected data, which is externally falsifiable and does not reduce to a self-definition or prior self-citation by construction. This is the expected non-finding for a dataset contribution paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions from audio machine learning rather than introducing new free parameters, axioms, or invented entities.

axioms (1)

domain assumption Machine learning classifiers can learn acoustic differences between real and synthetic speech even after replay distortions.
Implicit in the baseline model evaluations and generalization claims.

pith-pipeline@v0.9.0 · 5718 in / 1133 out tokens · 45935 ms · 2026-05-18T04:58:15.016711+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
cs.SD 2026-04 unverdicted novelty 3.0

AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

INTRODUCTION The recent advances in zero-shot text-to-speech (TTS) and large-scale audio language models (ALM) have dramatically lowered the bar- rier to generating high-quality synthetic speech. With only a few seconds of reference audio, these models can convincingly clone a speaker’s voice, producing speech that is perceptually indistinguish- able from...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Audio Deepfake Detection Audio deepfake detection aims to distinguish bona fide utterances from spoofed or synthesized ones

RELA TED WORK 2.1. Audio Deepfake Detection Audio deepfake detection aims to distinguish bona fide utterances from spoofed or synthesized ones. Existing approaches can be broadly categorized into pipeline-based and end-to-end paradigms. 1) Pipeline-based detectors [3, 4, 5] typically adopt a two-stage strategy: handcrafted or pretrained features—such as m...

work page 2019
[3]

#utt” denotes the number of utterances, “#spk

ECHOFAKE DA TASET To tackle the mismatch between lab-generated synthetic datasets and real-world spoofing scenarios involving replay attacks, we construct EchoFake, a new dataset that broadens the scope of audio deepfake detection beyond synthetic samples alone. EchoFake consists of four subsets: training, development, closed- set evaluation, and open-set...

work page 2019
[4]

Experimental Setup We evaluate the proposed dataset using three representative baseline systems: RawNet2, AASIST, and Wav2Vec2

EXPERIMENTS 4.1. Experimental Setup We evaluate the proposed dataset using three representative baseline systems: RawNet2, AASIST, and Wav2Vec2. For RawNet2, the model is trained for 100 epochs with a batch size of 64, an initial learning rate of 10−4. For AASIST, the model is trained for 60 epochs with a batch size of 32, an initial learning rate of10−4....

work page
[5]

By integrating both zero-shot TTS speech and diverse physical replay recordings, EchoFake captures spoofing patterns overlooked in existing datasets

CONCLUTION In this work, we present EchoFake, a novel and comprehensive dataset designed to advance ADD system development under realistic condi- tions. By integrating both zero-shot TTS speech and diverse physical replay recordings, EchoFake captures spoofing patterns overlooked in existing datasets. Evaluations on EchoFake reveal that current models suf...

work page
[6]

Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,

Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al., “Asvspoof 2021: accelerating progress in spoofed and deep- fake speech detection,”arXiv preprint arXiv:2109.00537, 2021

work page arXiv 2021
[7]

Does audio deep- fake detection generalize?,

Nicolas M M ¨uller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, and Konstantin B¨ottinger, “Does audio deep- fake detection generalize?,”arXiv preprint arXiv:2203.16263, 2022

work page arXiv 2022
[8]

STC Antispoofing Systems for the ASVspoof2019 Challenge

Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina V olkova, Artem Gorlanov, and Alexandr Kozlov, “Stc antispoofing systems for the asvspoof2019 challenge,”arXiv preprint arXiv:1904.05576, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[9]

A comparative study on recent neural spoofing countermeasures for synthetic speech detection,

Xin Wang and Junichi Yamagishi, “A comparative study on recent neural spoofing countermeasures for synthetic speech detection,” inInterspeech, 2021

work page 2021
[10]

Attention back-end for automatic speaker ver- ification with multiple enrollment utterances,

Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, and Ju- nichi Yamagishi, “Attention back-end for automatic speaker ver- ification with multiple enrollment utterances,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6717–6721

work page 2022
[11]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”Advances in neural infor- mation processing systems, vol. 33, pp. 12449–12460, 2020

work page 2020
[12]

To- wards end-to-end synthetic speech detection,

Guang Hua, Andrew Beng Jin Teoh, and Haijian Zhang, “To- wards end-to-end synthetic speech detection,”IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021

work page 2021
[13]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” inICASSP 2022-2022 IEEE interna- tional conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 6367–6371

work page 2022
[14]

End-to-end anti-spoofing with rawnet2,

Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher, “End-to-end anti-spoofing with rawnet2,” inICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2021, pp. 6369–6373

work page 2021
[15]

Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

Xin Wang, Junichi Yamagishi, Massimiliano Todisco, H´ector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,”Computer Speech & Language, vol. 64, pp. 101114, 2020

work page 2019
[16]

Add 2022: The first audio deep synthesis detection challenge,

Jian Yi, Shuai Bai, and et al., “Add 2022: The first audio deep synthesis detection challenge,” inProc. ICASSP, 2022

work page 2022
[17]

ASVspoof 5: Crowdsourced speech data at scale,

Xin Wang, H´ector Delgado, Hemlata Tak, Jee-weon Jung, Hye- jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, et al., “Asvspoof 5: Crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” arXiv preprint arXiv:2408.08739, 2024

work page arXiv 2024
[18]

Xtts: a massively mul- tilingual zero-shot text-to-speech model,

Edresson Casanova, Kelly Davis, Eren G¨olge, G¨orkem G¨oknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, et al., “Xtts: a massively mul- tilingual zero-shot text-to-speech model,”arXiv preprint arXiv:2406.04904, 2024

work page arXiv 2024
[19]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,”arXiv preprint arXiv:2410.06885, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,

Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, and Furu Wei, “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022, pp. 2294–2308

work page 2022
[21]

Llasa: Scaling train -time and inference-time compute for llama-based speech synthesis,

Zhen Ye and Xinfa Zhu et al., “Llasa: Scaling train -time and inference-time compute for llama-based speech synthesis,” ArXiv, 2025, Introduces single-layer VQ codec + Transformer architecture

work page 2025
[22]

Fish-speech: Lever- aging large language models for advanced multilingual text-to- speech synthesis,

Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing, “Fish-speech: Lever- aging large language models for advanced multilingual text-to- speech synthesis,”arXiv preprint arXiv:2411.01156, 2024

work page arXiv 2024
[23]

Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mis- chler, and Nima Mesgarani, “Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” inAdvances in Neural In- formation Processing Systems (NeurIPS), 2023, Style diffusion + SLM discriminator

work page 2023
[24]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, et al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024, Introduces FSQ and chunk-aware flow matching for real-time streaming

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Wei Deng et al., “Indextts: An industrial-level control- lable and efficient zero-shot text-to-speech,” arXiv preprint arXiv:2502.05512, 2025, Improves XTTS + Tortoise with conformer encoder and BigVGAN2

work page arXiv 2025
[26]

Maskgct: Zero-shot text-to- speech with masked generative Codec Transformer,

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Zhizheng Wu, et al., “Maskgct: Zero-shot text-to- speech with masked generative Codec Transformer,” inICLR, 2025, Fully non-autoregressive, mask-and-predict modeling with SSL tokens

work page 2025
[27]

Open- voice: Versatile instant voice cloning,

Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun, “Open- voice: Versatile instant voice cloning,”arXiv preprint arXiv:2312.01479, 2023

work page arXiv 2023
[28]

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu, “Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications,”arXiv preprint arXiv:2409.03283, 2024

work page arXiv 2024

[1] [1]

EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

INTRODUCTION The recent advances in zero-shot text-to-speech (TTS) and large-scale audio language models (ALM) have dramatically lowered the bar- rier to generating high-quality synthetic speech. With only a few seconds of reference audio, these models can convincingly clone a speaker’s voice, producing speech that is perceptually indistinguish- able from...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Audio Deepfake Detection Audio deepfake detection aims to distinguish bona fide utterances from spoofed or synthesized ones

RELA TED WORK 2.1. Audio Deepfake Detection Audio deepfake detection aims to distinguish bona fide utterances from spoofed or synthesized ones. Existing approaches can be broadly categorized into pipeline-based and end-to-end paradigms. 1) Pipeline-based detectors [3, 4, 5] typically adopt a two-stage strategy: handcrafted or pretrained features—such as m...

work page 2019

[3] [3]

#utt” denotes the number of utterances, “#spk

ECHOFAKE DA TASET To tackle the mismatch between lab-generated synthetic datasets and real-world spoofing scenarios involving replay attacks, we construct EchoFake, a new dataset that broadens the scope of audio deepfake detection beyond synthetic samples alone. EchoFake consists of four subsets: training, development, closed- set evaluation, and open-set...

work page 2019

[4] [4]

Experimental Setup We evaluate the proposed dataset using three representative baseline systems: RawNet2, AASIST, and Wav2Vec2

EXPERIMENTS 4.1. Experimental Setup We evaluate the proposed dataset using three representative baseline systems: RawNet2, AASIST, and Wav2Vec2. For RawNet2, the model is trained for 100 epochs with a batch size of 64, an initial learning rate of 10−4. For AASIST, the model is trained for 60 epochs with a batch size of 32, an initial learning rate of10−4....

work page

[5] [5]

By integrating both zero-shot TTS speech and diverse physical replay recordings, EchoFake captures spoofing patterns overlooked in existing datasets

CONCLUTION In this work, we present EchoFake, a novel and comprehensive dataset designed to advance ADD system development under realistic condi- tions. By integrating both zero-shot TTS speech and diverse physical replay recordings, EchoFake captures spoofing patterns overlooked in existing datasets. Evaluations on EchoFake reveal that current models suf...

work page

[6] [6]

Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,

Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al., “Asvspoof 2021: accelerating progress in spoofed and deep- fake speech detection,”arXiv preprint arXiv:2109.00537, 2021

work page arXiv 2021

[7] [7]

Does audio deep- fake detection generalize?,

Nicolas M M ¨uller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, and Konstantin B¨ottinger, “Does audio deep- fake detection generalize?,”arXiv preprint arXiv:2203.16263, 2022

work page arXiv 2022

[8] [8]

STC Antispoofing Systems for the ASVspoof2019 Challenge

Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina V olkova, Artem Gorlanov, and Alexandr Kozlov, “Stc antispoofing systems for the asvspoof2019 challenge,”arXiv preprint arXiv:1904.05576, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[9] [9]

A comparative study on recent neural spoofing countermeasures for synthetic speech detection,

Xin Wang and Junichi Yamagishi, “A comparative study on recent neural spoofing countermeasures for synthetic speech detection,” inInterspeech, 2021

work page 2021

[10] [10]

Attention back-end for automatic speaker ver- ification with multiple enrollment utterances,

Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, and Ju- nichi Yamagishi, “Attention back-end for automatic speaker ver- ification with multiple enrollment utterances,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6717–6721

work page 2022

[11] [11]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,”Advances in neural infor- mation processing systems, vol. 33, pp. 12449–12460, 2020

work page 2020

[12] [12]

To- wards end-to-end synthetic speech detection,

Guang Hua, Andrew Beng Jin Teoh, and Haijian Zhang, “To- wards end-to-end synthetic speech detection,”IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021

work page 2021

[13] [13]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” inICASSP 2022-2022 IEEE interna- tional conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 6367–6371

work page 2022

[14] [14]

End-to-end anti-spoofing with rawnet2,

Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher, “End-to-end anti-spoofing with rawnet2,” inICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2021, pp. 6369–6373

work page 2021

[15] [15]

Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

Xin Wang, Junichi Yamagishi, Massimiliano Todisco, H´ector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,”Computer Speech & Language, vol. 64, pp. 101114, 2020

work page 2019

[16] [16]

Add 2022: The first audio deep synthesis detection challenge,

Jian Yi, Shuai Bai, and et al., “Add 2022: The first audio deep synthesis detection challenge,” inProc. ICASSP, 2022

work page 2022

[17] [17]

ASVspoof 5: Crowdsourced speech data at scale,

Xin Wang, H´ector Delgado, Hemlata Tak, Jee-weon Jung, Hye- jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, et al., “Asvspoof 5: Crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” arXiv preprint arXiv:2408.08739, 2024

work page arXiv 2024

[18] [18]

Xtts: a massively mul- tilingual zero-shot text-to-speech model,

Edresson Casanova, Kelly Davis, Eren G¨olge, G¨orkem G¨oknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, et al., “Xtts: a massively mul- tilingual zero-shot text-to-speech model,”arXiv preprint arXiv:2406.04904, 2024

work page arXiv 2024

[19] [19]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,”arXiv preprint arXiv:2410.06885, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,

Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, and Furu Wei, “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022, pp. 2294–2308

work page 2022

[21] [21]

Llasa: Scaling train -time and inference-time compute for llama-based speech synthesis,

Zhen Ye and Xinfa Zhu et al., “Llasa: Scaling train -time and inference-time compute for llama-based speech synthesis,” ArXiv, 2025, Introduces single-layer VQ codec + Transformer architecture

work page 2025

[22] [22]

Fish-speech: Lever- aging large language models for advanced multilingual text-to- speech synthesis,

Shijia Liao, Yuxuan Wang, Tianyu Li, Yifan Cheng, Ruoyi Zhang, Rongzhi Zhou, and Yijin Xing, “Fish-speech: Lever- aging large language models for advanced multilingual text-to- speech synthesis,”arXiv preprint arXiv:2411.01156, 2024

work page arXiv 2024

[23] [23]

Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mis- chler, and Nima Mesgarani, “Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” inAdvances in Neural In- formation Processing Systems (NeurIPS), 2023, Style diffusion + SLM discriminator

work page 2023

[24] [24]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, et al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024, Introduces FSQ and chunk-aware flow matching for real-time streaming

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Wei Deng et al., “Indextts: An industrial-level control- lable and efficient zero-shot text-to-speech,” arXiv preprint arXiv:2502.05512, 2025, Improves XTTS + Tortoise with conformer encoder and BigVGAN2

work page arXiv 2025

[26] [26]

Maskgct: Zero-shot text-to- speech with masked generative Codec Transformer,

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Zhizheng Wu, et al., “Maskgct: Zero-shot text-to- speech with masked generative Codec Transformer,” inICLR, 2025, Fully non-autoregressive, mask-and-predict modeling with SSL tokens

work page 2025

[27] [27]

Open- voice: Versatile instant voice cloning,

Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun, “Open- voice: Versatile instant voice cloning,”arXiv preprint arXiv:2312.01479, 2023

work page arXiv 2023

[28] [28]

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu, “Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications,”arXiv preprint arXiv:2409.03283, 2024

work page arXiv 2024