pith. sign in

arxiv: 2605.20830 · v1 · pith:UJQ5BD55new · submitted 2026-05-20 · 📡 eess.AS

Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

Pith reviewed 2026-05-21 02:28 UTC · model grok-4.3

classification 📡 eess.AS
keywords text-to-speechopen datasetsdiffusion transformerspeech synthesisdata filteringTTS benchmarksrobust evaluationpublic speech data
0
0 comments X

The pith

Aggregating and filtering public speech data enables TTS models that match those trained on millions of hours of proprietary data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that public English speech corpora totaling 615K hours can be aggregated into Raon-OpenTTS-Pool and then filtered via a model-based pipeline into a 510K-hour high-quality Core subset. Training diffusion transformer models up to 1B parameters on this Core yields performance comparable to leading closed models on word error rate, speaker similarity, and robustness across clean, noisy, and expressive conditions. A sympathetic reader would care because the result points to a path for high-quality, reproducible text-to-speech that does not depend on restricted private datasets.

Core claim

By creating Raon-OpenTTS-Pool from publicly available corpora and web recordings, applying model-based filtering to obtain Raon-OpenTTS-Core, and training DiT-based models on it, the work shows that Raon-OpenTTS-1B achieves a word error rate of 1.78% and speaker similarity of 0.749 on Seed-TTS-Eval while ranking first on both metrics for CV3-Hard-EN, matching the performance of models trained on several million hours of proprietary data.

What carries the argument

The model-based filtering pipeline that derives the high-quality Raon-OpenTTS-Core subset from the aggregated Raon-OpenTTS-Pool of 615K hours of public speech data.

If this is right

  • Open-weight TTS models can reach competitive levels of naturalness and accuracy without access to proprietary speech corpora.
  • Releasing the data pool, filtering pipeline, training code, and checkpoints enables full reproducibility of the results.
  • The new Raon-OpenTTS-Eval benchmark supports structured testing of TTS robustness in clean, noisy, in-the-wild, and expressive conditions.
  • Scaling DiT-based TTS models to 1B parameters with filtered public data produces strong results on both word error rate and speaker similarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same aggregation and filtering approach could be tested on public corpora in languages other than English to create comparable open datasets.
  • Prioritizing filtered data quality over raw volume may prove useful for training other generative audio models beyond TTS.
  • Community extensions of the released resources could test whether further scaling or mixing with smaller additional datasets improves results.
  • The work leaves open whether unfiltered larger public pools could achieve similar performance if training compute is increased accordingly.

Load-bearing premise

The model-based filtering pipeline selects a high-quality subset from public data that maintains diversity and avoids introducing biases that harm TTS performance.

What would settle it

Reproducing the training of Raon-OpenTTS-1B on the released Core dataset and obtaining word error rates or speaker similarity scores that fall substantially below the reported values or those of the compared proprietary models would indicate the claim does not hold.

read the original abstract

Recent advances in text-to-speech (TTS) models show impressive speech naturalness and quality, yet the role of large-scale open data in driving this progress remains underexplored. In this work, we introduce Raon-OpenTTS, an open TTS model that performs competitively with state-of-the-art closed-data TTS models, and Raon-OpenTTS-Pool, a large-scale open dataset for reproducible TTS training. Raon-OpenTTS-Pool consists of 615K hours of 240M speech segments aggregated from publicly available English speech corpora and web-sourced recordings. With a model-based filtering pipeline applied to Raon-OpenTTS-Pool, we derive Raon-OpenTTS-Core, a curated, high-quality subset of 510K hours and 194M speech segments. Using Raon-OpenTTS-Core, we train Raon-OpenTTS, a series of diffusion transformer (DiT)-based TTS models from 0.3B to 1B parameters. On multiple benchmarks, Raon-OpenTTS-1B shows comparable performance to state-of-the-art models such as Qwen3-TTS and CosyVoice 3, which are trained on several million hours of proprietary speech data. Notably, on Seed-TTS-Eval, Raon-OpenTTS-1B achieves a word error rate (WER) of 1.78% and a speaker similarity (SIM) of 0.749, ranking second on WER and first on SIM among recent open-weight TTS baselines. On CV3-Hard-EN, Raon-OpenTTS-1B achieves a WER of 6.15% and a SIM of 0.775, ranking first on both metrics. Furthermore, to support robust evaluation, we introduce Raon-OpenTTS-Eval, a structured benchmark for assessing TTS robustness across diverse acoustic conditions including clean, noisy, in-the-wild, and expressive speech. On Raon-OpenTTS-Eval, Raon-OpenTTS-1B achieves the best average WER and SIM among all evaluated models, and the second-best human preference, as measured by comparative mean opinion score (CMOS). Our data pool, filtering pipeline, training code, and checkpoints are publicly available at https://github.com/krafton-ai/RAON-OpenTTS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Raon-OpenTTS-Pool, a 615K-hour aggregation of public English speech data, from which a model-based filtering pipeline yields the 510K-hour Raon-OpenTTS-Core subset. Diffusion transformer TTS models (0.3B–1B parameters) are trained on this core set. The 1B model is reported to match or exceed closed-source systems (Qwen3-TTS, CosyVoice 3) on Seed-TTS-Eval (WER 1.78%, SIM 0.749) and CV3-Hard-EN (WER 6.15%, SIM 0.775), with an additional robustness benchmark (Raon-OpenTTS-Eval) showing strong average WER/SIM and second-best CMOS. Data, filtering code, training code, and checkpoints are released.

Significance. If the filtering step produces an unbiased high-quality subset that preserves diversity, the result would demonstrate that large-scale open data can reach parity with proprietary-scale training, a meaningful step toward reproducible TTS research. The explicit public release of the 510K-hour pool, the filtering pipeline, training code, and model checkpoints is a concrete strength that directly supports follow-on work.

major comments (2)
  1. [Dataset curation] Dataset curation section: The model-based filtering pipeline that reduces Raon-OpenTTS-Pool (615K hours) to Raon-OpenTTS-Core (510K hours) is described only at the level of “model-based pipeline.” No architecture for the filter model, decision thresholds, or post-filter statistics (speaker-embedding entropy, SNR distribution, accent coverage, prosodic variance) are supplied. Because the headline comparability claims rest on the assumption that this subset remains diverse and unbiased, the omission is load-bearing for the central “open data suffices” narrative.
  2. [Training and evaluation] Training and evaluation sections: Hyperparameters for the 1B model (learning rate schedule, batch size, total steps, regularization) are not reported, nor are statistical tests or confidence intervals on the WER and SIM numbers. Without these, it is impossible to verify that the reported parity with Qwen3-TTS and CosyVoice 3 is robust rather than an artifact of a single run or favorable test conditions.
minor comments (1)
  1. [Abstract] Abstract and §4: The claim that comparison models were trained on “several million hours” would benefit from explicit citations or references to the known data scales of Qwen3-TTS and CosyVoice 3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for acknowledging the significance of our open data and model releases. We address each major comment below and will revise the manuscript to enhance transparency and reproducibility.

read point-by-point responses
  1. Referee: [Dataset curation] Dataset curation section: The model-based filtering pipeline that reduces Raon-OpenTTS-Pool (615K hours) to Raon-OpenTTS-Core (510K hours) is described only at the level of “model-based pipeline.” No architecture for the filter model, decision thresholds, or post-filter statistics (speaker-embedding entropy, SNR distribution, accent coverage, prosodic variance) are supplied. Because the headline comparability claims rest on the assumption that this subset remains diverse and unbiased, the omission is load-bearing for the central “open data suffices” narrative.

    Authors: We agree that additional detail on the filtering pipeline would strengthen the manuscript. The current description is intentionally high-level, with the full implementation (including model architecture, thresholds, and processing steps) provided in the publicly released code. In the revision we will expand the Dataset Curation section with a concise description of the filter model, the decision thresholds employed, and quantitative post-filter statistics covering speaker-embedding diversity, SNR distribution, accent coverage, and prosodic variance. These additions will be drawn directly from the released pipeline and our internal analysis logs, preserving the original results while improving transparency. revision: yes

  2. Referee: [Training and evaluation] Training and evaluation sections: Hyperparameters for the 1B model (learning rate schedule, batch size, total steps, regularization) are not reported, nor are statistical tests or confidence intervals on the WER and SIM numbers. Without these, it is impossible to verify that the reported parity with Qwen3-TTS and CosyVoice 3 is robust rather than an artifact of a single run or favorable test conditions.

    Authors: We acknowledge that explicit hyperparameter reporting and statistical analysis would allow readers to better assess robustness. In the revised manuscript we will add a dedicated training-details subsection (or table) listing the learning-rate schedule, batch size, total steps, optimizer settings, and regularization techniques used for the 1B model. We will also report results from multiple independent runs together with confidence intervals and the results of appropriate statistical tests comparing Raon-OpenTTS-1B against the closed-source baselines. These additions will be included without changing the reported point estimates. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from external benchmarks

full rationale

The paper aggregates public corpora into Raon-OpenTTS-Pool, applies a model-based filter to obtain Raon-OpenTTS-Core, trains DiT-based models, and reports WER/SIM/CMOS scores on independent external benchmarks (Seed-TTS-Eval, CV3-Hard-EN, Raon-OpenTTS-Eval). These metrics are measured directly on held-out test data and compared to other published models; no quantity is defined in terms of itself, no prediction is a fitted parameter renamed, and no load-bearing premise reduces to a self-citation chain. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that public corpora can be aggregated and filtered to match proprietary data quality, plus standard assumptions about diffusion model training for speech.

axioms (1)
  • domain assumption Diffusion transformer architectures are suitable for high-quality text-to-speech synthesis.
    The paper adopts DiT-based TTS without deriving its suitability from first principles.

pith-pipeline@v0.9.0 · 6023 in / 1150 out tokens · 37178 ms · 2026-05-21T02:28:42.792602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 10 internal anchors

  1. [1]

    The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

    A. Adigwe, N. Tits, K. E. Haddad, S. Ostadabbas, and T. Dutoit. The emotional voices database: To- wards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514 ,

  2. [2]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    P. Anastassiou, J. Chen, J. Chen, Y. Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430,

  3. [3]

    G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909 ,

  4. [4]

    Y. Chen, S. Zheng, H. Wang, L. Cheng, Q. Chen, and J. Qi. An enhanced res2net with local and global feature fusion for speaker verification. arXiv preprint arXiv:2305.12838 ,

  5. [5]

    Z. Du, Y. Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y. Yang, C. Gao, H. Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117 ,

  6. [6]

    Z. Du, C. Gao, Y. Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shi, et al. Cosyvoice 3: To- wards in-the-wild speech generation via scaling-up and post-training. arXiv preprint arXiv:2505.17589 ,

  7. [7]

    Gaido, S

    M. Gaido, S. Papi, L. Bentivogli, A. Brutti, M. Cettolo, R. Gretter, M. Matassoni, M. Nabih, and M. Negri. Mosel: 950,000 hours of speech data for open-source speech foundation model training on eu languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13934–13947,

  8. [8]

    Galvez, G

    D. Galvez, G. Diamos, J. Ciro, J. F. Cerón, K. Achorn, A. Gopi, D. Kanter, M. Lam, M. Mazumder, and V. J. Reddi. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage. arXiv preprint arXiv:2111.09344 ,

  9. [9]

    L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 ,

  10. [10]

    Grossman, T

    R. Grossman, T. Park, K. Dhawan, A. Titus, S. Zhi, Y. Shchadilova, W. Wang, J. Balam, and B. Gins- burg. Spgispeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription. arXiv preprint arXiv:2508.05554 ,

  11. [11]

    H. He, Z. Shang, C. Wang, X. Li, Y. Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT) , pages 885–890. IEEE,

  12. [12]

    H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, et al. Qwen3-tts technical report. arXiv preprint arXiv:2601.15621 ,

  13. [13]

    W. Kang, X. Yang, Z. Yao, F. Kuang, Y. Yang, L. Guo, L. Lin, and D. Povey. Libriheavy: A 50,000 hours asr corpus with punctuation casing and context. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 10991–10995. IEEE,

  14. [14]

    Koizumi, H

    Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y. Zhang, W. Han, and A. Bapna. Libritts-r: A restored multi-speaker text-to-speech corpus. arXiv preprint arXiv:2305.18802,

  15. [15]

    Langman, X

    R. Langman, X. Yang, P. Neekhara, S. Hussain, E. Casanova, E. Bakhturina, and J. Li. Hifitts-2: A large-scale high bandwidth speech dataset. arXiv preprint arXiv:2506.04152 ,

  16. [16]

    K. Lee, D. W. Kim, J. Kim, S. Chung, and J. Cho. Ditto-tts: Diffusion transformers for scalable text-to-speech without domain-specific factors. arXiv preprint arXiv:2406.11427 ,

  17. [17]

    A. H. Liu, A. Tacnet, A. Ehrenberg, A. Lo, C.-Y. Sun, G. Lample, H. Lagarde, J.-M. Delignon, J. Kim, J. Harvill, et al. Voxtral tts. arXiv preprint arXiv:2603.25551 ,

  18. [18]

    H. Ngo, M. Deitke, M. Bartelds, S. Pratt, J. Gardner, M. Jordan, and L. Schmidt. Olmoasr: Open models and data for training robust speech recognition models. arXiv preprint arXiv:2508.20869 ,

  19. [19]

    T. A. Nguyen, W.-N. Hsu, A. d’A virro, B. Shi, I. Gat, M. Fazel-Zarani, T. Remez, J. Copet, G. Synnaeve, M. Hassid, et al. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. arXiv preprint arXiv:2308.05725 ,

  20. [20]

    P. K. O’Neill, V. Lavrukhin, S. Majumdar, V. Noroozi, Y. Zhang, O. Kuchaiev, J. Balam, Y. Dovzhenko, K. Freyberg, M. D. Shulman, et al. Spgispeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition. arXiv preprint arXiv:2104.02014 ,

  21. [21]

    Panayotov, G

    V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE,

  22. [22]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 ,

  23. [23]

    Y. Peng, J. Tian, W. Chen, S. Arora, B. Yan, Y. Sudo, M. Shakeel, K. Choi, J. Shi, X. Chang, et al. Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer. arXiv preprint arXiv:2401.16658 ,

  24. [24]

    URL https://arxiv.org/abs/2212.04356. C. K. Reddy, V. Gopal, and R. Cutler. Dnsmos: A non-intrusive perceptual objective speech quality met- ric to evaluate noise suppressors. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6493–6497. IEEE,

  25. [25]

    15 Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learn- ing and interpretation. arXiv preprint arXiv:2101.00390 ,

  26. [26]

    URL https://arxiv.org/abs/2301.02111. Y. Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. arXiv preprint arXiv:2409.00750 ,

  27. [27]

    Z. Ye, X. Zhu, C.-M. Chan, X. Wang, X. Tan, J. Lei, Y. Peng, H. Liu, Y. Jin, Z. Dai, et al. Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis. arXiv preprint arXiv:2502.04128,

  28. [28]

    G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna. L2-arctic: A non-native english speech corpus. In Interspeech 2018 , pages 2783–2787,

  29. [29]

    doi: 10.21437/Interspeech.2018-1110. S. Zhou, Y. Zhou, Y. He, X. Zhou, J. Wang, W. Deng, and J. Shu. Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, pages 35139–35148,

  30. [30]

    Y. Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Li, et al. Voxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning. arXiv preprint arXiv:2509.24650,

  31. [31]

    Model Architecture Details Table 11 summarizes the architectural configurations of the two model variants: Raon-OpenTTS-0.3B and Raon-OpenTTS-1B

    16 Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech A. Model Architecture Details Table 11 summarizes the architectural configurations of the two model variants: Raon-OpenTTS-0.3B and Raon-OpenTTS-1B. The 0.3B model adopts the original configuration of F5-TTS without modifica- tion. For the larger variants, we scale up the model capacity by in...