Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
Pith reviewed 2026-05-21 02:28 UTC · model grok-4.3
The pith
Aggregating and filtering public speech data enables TTS models that match those trained on millions of hours of proprietary data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By creating Raon-OpenTTS-Pool from publicly available corpora and web recordings, applying model-based filtering to obtain Raon-OpenTTS-Core, and training DiT-based models on it, the work shows that Raon-OpenTTS-1B achieves a word error rate of 1.78% and speaker similarity of 0.749 on Seed-TTS-Eval while ranking first on both metrics for CV3-Hard-EN, matching the performance of models trained on several million hours of proprietary data.
What carries the argument
The model-based filtering pipeline that derives the high-quality Raon-OpenTTS-Core subset from the aggregated Raon-OpenTTS-Pool of 615K hours of public speech data.
If this is right
- Open-weight TTS models can reach competitive levels of naturalness and accuracy without access to proprietary speech corpora.
- Releasing the data pool, filtering pipeline, training code, and checkpoints enables full reproducibility of the results.
- The new Raon-OpenTTS-Eval benchmark supports structured testing of TTS robustness in clean, noisy, in-the-wild, and expressive conditions.
- Scaling DiT-based TTS models to 1B parameters with filtered public data produces strong results on both word error rate and speaker similarity.
Where Pith is reading between the lines
- The same aggregation and filtering approach could be tested on public corpora in languages other than English to create comparable open datasets.
- Prioritizing filtered data quality over raw volume may prove useful for training other generative audio models beyond TTS.
- Community extensions of the released resources could test whether further scaling or mixing with smaller additional datasets improves results.
- The work leaves open whether unfiltered larger public pools could achieve similar performance if training compute is increased accordingly.
Load-bearing premise
The model-based filtering pipeline selects a high-quality subset from public data that maintains diversity and avoids introducing biases that harm TTS performance.
What would settle it
Reproducing the training of Raon-OpenTTS-1B on the released Core dataset and obtaining word error rates or speaker similarity scores that fall substantially below the reported values or those of the compared proprietary models would indicate the claim does not hold.
read the original abstract
Recent advances in text-to-speech (TTS) models show impressive speech naturalness and quality, yet the role of large-scale open data in driving this progress remains underexplored. In this work, we introduce Raon-OpenTTS, an open TTS model that performs competitively with state-of-the-art closed-data TTS models, and Raon-OpenTTS-Pool, a large-scale open dataset for reproducible TTS training. Raon-OpenTTS-Pool consists of 615K hours of 240M speech segments aggregated from publicly available English speech corpora and web-sourced recordings. With a model-based filtering pipeline applied to Raon-OpenTTS-Pool, we derive Raon-OpenTTS-Core, a curated, high-quality subset of 510K hours and 194M speech segments. Using Raon-OpenTTS-Core, we train Raon-OpenTTS, a series of diffusion transformer (DiT)-based TTS models from 0.3B to 1B parameters. On multiple benchmarks, Raon-OpenTTS-1B shows comparable performance to state-of-the-art models such as Qwen3-TTS and CosyVoice 3, which are trained on several million hours of proprietary speech data. Notably, on Seed-TTS-Eval, Raon-OpenTTS-1B achieves a word error rate (WER) of 1.78% and a speaker similarity (SIM) of 0.749, ranking second on WER and first on SIM among recent open-weight TTS baselines. On CV3-Hard-EN, Raon-OpenTTS-1B achieves a WER of 6.15% and a SIM of 0.775, ranking first on both metrics. Furthermore, to support robust evaluation, we introduce Raon-OpenTTS-Eval, a structured benchmark for assessing TTS robustness across diverse acoustic conditions including clean, noisy, in-the-wild, and expressive speech. On Raon-OpenTTS-Eval, Raon-OpenTTS-1B achieves the best average WER and SIM among all evaluated models, and the second-best human preference, as measured by comparative mean opinion score (CMOS). Our data pool, filtering pipeline, training code, and checkpoints are publicly available at https://github.com/krafton-ai/RAON-OpenTTS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Raon-OpenTTS-Pool, a 615K-hour aggregation of public English speech data, from which a model-based filtering pipeline yields the 510K-hour Raon-OpenTTS-Core subset. Diffusion transformer TTS models (0.3B–1B parameters) are trained on this core set. The 1B model is reported to match or exceed closed-source systems (Qwen3-TTS, CosyVoice 3) on Seed-TTS-Eval (WER 1.78%, SIM 0.749) and CV3-Hard-EN (WER 6.15%, SIM 0.775), with an additional robustness benchmark (Raon-OpenTTS-Eval) showing strong average WER/SIM and second-best CMOS. Data, filtering code, training code, and checkpoints are released.
Significance. If the filtering step produces an unbiased high-quality subset that preserves diversity, the result would demonstrate that large-scale open data can reach parity with proprietary-scale training, a meaningful step toward reproducible TTS research. The explicit public release of the 510K-hour pool, the filtering pipeline, training code, and model checkpoints is a concrete strength that directly supports follow-on work.
major comments (2)
- [Dataset curation] Dataset curation section: The model-based filtering pipeline that reduces Raon-OpenTTS-Pool (615K hours) to Raon-OpenTTS-Core (510K hours) is described only at the level of “model-based pipeline.” No architecture for the filter model, decision thresholds, or post-filter statistics (speaker-embedding entropy, SNR distribution, accent coverage, prosodic variance) are supplied. Because the headline comparability claims rest on the assumption that this subset remains diverse and unbiased, the omission is load-bearing for the central “open data suffices” narrative.
- [Training and evaluation] Training and evaluation sections: Hyperparameters for the 1B model (learning rate schedule, batch size, total steps, regularization) are not reported, nor are statistical tests or confidence intervals on the WER and SIM numbers. Without these, it is impossible to verify that the reported parity with Qwen3-TTS and CosyVoice 3 is robust rather than an artifact of a single run or favorable test conditions.
minor comments (1)
- [Abstract] Abstract and §4: The claim that comparison models were trained on “several million hours” would benefit from explicit citations or references to the known data scales of Qwen3-TTS and CosyVoice 3.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for acknowledging the significance of our open data and model releases. We address each major comment below and will revise the manuscript to enhance transparency and reproducibility.
read point-by-point responses
-
Referee: [Dataset curation] Dataset curation section: The model-based filtering pipeline that reduces Raon-OpenTTS-Pool (615K hours) to Raon-OpenTTS-Core (510K hours) is described only at the level of “model-based pipeline.” No architecture for the filter model, decision thresholds, or post-filter statistics (speaker-embedding entropy, SNR distribution, accent coverage, prosodic variance) are supplied. Because the headline comparability claims rest on the assumption that this subset remains diverse and unbiased, the omission is load-bearing for the central “open data suffices” narrative.
Authors: We agree that additional detail on the filtering pipeline would strengthen the manuscript. The current description is intentionally high-level, with the full implementation (including model architecture, thresholds, and processing steps) provided in the publicly released code. In the revision we will expand the Dataset Curation section with a concise description of the filter model, the decision thresholds employed, and quantitative post-filter statistics covering speaker-embedding diversity, SNR distribution, accent coverage, and prosodic variance. These additions will be drawn directly from the released pipeline and our internal analysis logs, preserving the original results while improving transparency. revision: yes
-
Referee: [Training and evaluation] Training and evaluation sections: Hyperparameters for the 1B model (learning rate schedule, batch size, total steps, regularization) are not reported, nor are statistical tests or confidence intervals on the WER and SIM numbers. Without these, it is impossible to verify that the reported parity with Qwen3-TTS and CosyVoice 3 is robust rather than an artifact of a single run or favorable test conditions.
Authors: We acknowledge that explicit hyperparameter reporting and statistical analysis would allow readers to better assess robustness. In the revised manuscript we will add a dedicated training-details subsection (or table) listing the learning-rate schedule, batch size, total steps, optimizer settings, and regularization techniques used for the 1B model. We will also report results from multiple independent runs together with confidence intervals and the results of appropriate statistical tests comparing Raon-OpenTTS-1B against the closed-source baselines. These additions will be included without changing the reported point estimates. revision: yes
Circularity Check
No circularity: empirical results from external benchmarks
full rationale
The paper aggregates public corpora into Raon-OpenTTS-Pool, applies a model-based filter to obtain Raon-OpenTTS-Core, trains DiT-based models, and reports WER/SIM/CMOS scores on independent external benchmarks (Seed-TTS-Eval, CV3-Hard-EN, Raon-OpenTTS-Eval). These metrics are measured directly on held-out test data and compared to other published models; no quantity is defined in terms of itself, no prediction is a fitted parameter renamed, and no load-bearing premise reduces to a self-citation chain. The derivation chain is therefore self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion transformer architectures are suitable for high-quality text-to-speech synthesis.
Reference graph
Works this paper leans on
-
[1]
The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems
A. Adigwe, N. Tits, K. E. Haddad, S. Ostadabbas, and T. Dutoit. The emotional voices database: To- wards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
P. Anastassiou, J. Chen, J. Chen, Y. Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430,
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
- [4]
-
[5]
Z. Du, Y. Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y. Yang, C. Gao, H. Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Z. Du, C. Gao, Y. Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shi, et al. Cosyvoice 3: To- wards in-the-wild speech generation via scaling-up and post-training. arXiv preprint arXiv:2505.17589 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
M. Gaido, S. Papi, L. Bentivogli, A. Brutti, M. Cettolo, R. Gretter, M. Matassoni, M. Nabih, and M. Negri. Mosel: 950,000 hours of speech data for open-source speech foundation model training on eu languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13934–13947,
work page 2024
- [8]
-
[9]
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
R. Grossman, T. Park, K. Dhawan, A. Titus, S. Zhi, Y. Shchadilova, W. Wang, J. Balam, and B. Gins- burg. Spgispeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription. arXiv preprint arXiv:2508.05554 ,
-
[11]
H. He, Z. Shang, C. Wang, X. Li, Y. Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT) , pages 885–890. IEEE,
work page 2024
-
[12]
H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, et al. Qwen3-tts technical report. arXiv preprint arXiv:2601.15621 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
W. Kang, X. Yang, Z. Yao, F. Kuang, Y. Yang, L. Guo, L. Lin, and D. Povey. Libriheavy: A 50,000 hours asr corpus with punctuation casing and context. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 10991–10995. IEEE,
work page 2024
-
[14]
Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y. Zhang, W. Han, and A. Bapna. Libritts-r: A restored multi-speaker text-to-speech corpus. arXiv preprint arXiv:2305.18802,
-
[15]
R. Langman, X. Yang, P. Neekhara, S. Hussain, E. Casanova, E. Bakhturina, and J. Li. Hifitts-2: A large-scale high bandwidth speech dataset. arXiv preprint arXiv:2506.04152 ,
- [16]
-
[17]
A. H. Liu, A. Tacnet, A. Ehrenberg, A. Lo, C.-Y. Sun, G. Lample, H. Lagarde, J.-M. Delignon, J. Kim, J. Harvill, et al. Voxtral tts. arXiv preprint arXiv:2603.25551 ,
work page internal anchor Pith review Pith/arXiv arXiv
- [18]
- [19]
- [20]
-
[21]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE,
work page 2015
-
[22]
G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 ,
work page internal anchor Pith review Pith/arXiv arXiv
- [23]
-
[24]
URL https://arxiv.org/abs/2212.04356. C. K. Reddy, V. Gopal, and R. Cutler. Dnsmos: A non-intrusive perceptual objective speech quality met- ric to evaluate noise suppressors. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6493–6497. IEEE,
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
15 Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learn- ing and interpretation. arXiv preprint arXiv:2101.00390 ,
-
[26]
URL https://arxiv.org/abs/2301.02111. Y. Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. arXiv preprint arXiv:2409.00750 ,
work page internal anchor Pith review Pith/arXiv arXiv
- [27]
-
[28]
G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna. L2-arctic: A non-native english speech corpus. In Interspeech 2018 , pages 2783–2787,
work page 2018
-
[29]
doi: 10.21437/Interspeech.2018-1110. S. Zhou, Y. Zhou, Y. He, X. Zhou, J. Wang, W. Deng, and J. Shu. Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pages 35139–35148,
- [30]
-
[31]
16 Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech A. Model Architecture Details Table 11 summarizes the architectural configurations of the two model variants: Raon-OpenTTS-0.3B and Raon-OpenTTS-1B. The 0.3B model adopts the original configuration of F5-TTS without modifica- tion. For the larger variants, we scale up the model capacity by in...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.