pith. sign in

arxiv: 2606.11681 · v2 · pith:2AG3QPC4new · submitted 2026-06-10 · 💻 cs.CL · cs.SD

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

Pith reviewed 2026-06-27 10:05 UTC · model grok-4.3

classification 💻 cs.CL cs.SD
keywords multilingual TTStext encoderRomanizationspeech token predictionUR-BERTgrapheme-to-phonemezero-shot generalization
0
0 comments X

The pith

UR-BERT builds TTS text encoders for 495 languages by unifying scripts through Romanization and adding speech token prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UR-BERT as a text encoder for massively multilingual TTS. It replaces limited grapheme-to-phoneme converters with a shared Romanization step that handles 495 languages and adds a speech token prediction task to learn better phonetic alignments during training. Experiments show that TTS systems using this encoder outperform recent baselines across many languages and data conditions while also working on languages never seen in training. Traditional approaches stop at roughly 100 languages because they depend on per-language phonetic resources that do not exist everywhere. A reader would care because the method removes a major barrier to building speech output for far more languages and speakers.

Core claim

UR-BERT processes Romanized versions of text from 495 languages and is trained with an added objective that predicts speech tokens; TTS systems built on this encoder deliver higher quality than prior text encoders across resource levels and generalize to unseen languages.

What carries the argument

Universal Romanization of input text paired with a speech token prediction training objective that produces speech-aware phonetic representations.

If this is right

  • TTS development becomes possible for languages that lack any G2P resources or tools.
  • Performance gains hold across both high-resource and low-resource language settings.
  • The shared representation supports zero-shot use on languages absent from training data.
  • The encoder acquires phonetic knowledge in a data-efficient way through the token prediction task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Romanization-plus-prediction pattern could apply to other speech tasks such as recognition where script diversity is a bottleneck.
  • Further gains might appear if the method is combined with larger pretrained language models for even lower-resource cases.
  • Limits would surface on languages where basic Romanization drops critical features like tone or vowel length.

Load-bearing premise

Romanization of text from any script must keep enough phonetic detail to support accurate pronunciation without large losses across all 495 languages.

What would settle it

A side-by-side pronunciation accuracy test on languages with tones or ambiguous Romanization forms, comparing error rates from UR-BERT TTS against a language-specific G2P baseline.

Figures

Figures reproduced from arXiv: 2606.11681 by Eekgyun Ahn, Hong-Goo Kang, Sangmin Lee, Woongjib Choi.

Figure 1
Figure 1. Figure 1: Overview of the UR-BERT showing pretraining and finetuning stage. 2. Related Work To extend monolingual text embeddings to multilingual TTS encoders, recent work has adopted BERT-style pretraining for text representations. An early effort in this direction is multilin￾gual PLBERT (m-PLBERT), introduced in the StyleTTS2 [24] framework.2 Following the original PL-BERT [23] design, m-PLBERT pretrains the text… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the CTC-based speech-text alignment. Unlike conventional TTS systems that require clean, cu￾rated speech data, our approach leverages large-scale ASR speech–text pairs by injecting speech-derived supervision into the text encoder through three steps: (1) extracting speech rep￾resentations from S3M, (2) aligning them to character-level text using forced alignment, and (3) discretizing the al… view at source ↗
Figure 3
Figure 3. Figure 3: MOS evaluation setup: (a) detailed instructions for quality assessment, and (b) a snapshot of the evaluation interface. A. Details on the Pretraining Dataset Preprocessing. We removed samples whose transcriptions con￾tained digits or parenthetical expressions. Digits may corre￾spond to different pronunciations across languages, leading to token–pronunciation mismatches, while parenthetical content is incon… view at source ↗
read the original abstract

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes UR-BERT, a text encoder for TTS that unifies 495 languages via shared Romanization (bypassing language-specific G2P) and adds an auxiliary speech-token prediction objective to encourage phonetic representations. It claims that TTS systems built on this encoder consistently outperform recent text-encoder baselines across languages and resource levels while generalizing to unseen languages.

Significance. If the results hold, the work would meaningfully advance scaling of TTS to hundreds of languages by removing dependence on scarce G2P resources; the combination of Romanization and speech-token prediction could offer a practical route to data-efficient multilingual encoders.

major comments (2)
  1. [Abstract] Abstract: the central claim of consistent outperformance and generalization across 495 languages is stated without any quantitative metrics, baseline specifications, dataset sizes, or error breakdowns, leaving the empirical support for the headline result invisible even at the summary level.
  2. [Experiments] The universality claim rests on Romanization preserving sufficient phonetic contrast; no described ablation isolates pronunciation error rates on languages whose orthographies encode features absent from standard Romanization (tones, gemination, vowel harmony), which is load-bearing for whether performance gains are driven by the method or by easier languages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which highlight opportunities to strengthen the presentation of our empirical results and to more explicitly address potential limitations of Romanization. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of consistent outperformance and generalization across 495 languages is stated without any quantitative metrics, baseline specifications, dataset sizes, or error breakdowns, leaving the empirical support for the headline result invisible even at the summary level.

    Authors: We agree that the abstract would be more informative with concrete quantitative support. In the revised manuscript we will expand the abstract to report key metrics, including the number of languages evaluated, average relative improvements over the strongest baselines (with specific baseline names), dataset scale, and a brief note on generalization results to unseen languages. revision: yes

  2. Referee: [Experiments] The universality claim rests on Romanization preserving sufficient phonetic contrast; no described ablation isolates pronunciation error rates on languages whose orthographies encode features absent from standard Romanization (tones, gemination, vowel harmony), which is load-bearing for whether performance gains are driven by the method or by easier languages.

    Authors: This observation is correct and points to a genuine gap in our current analysis. Our experiments report aggregate TTS performance across the full set of 495 languages (including many with tonal and other non-Roman features) and show strong zero-shot generalization, but we do not provide a targeted ablation that isolates pronunciation error rates on the subset of languages whose orthographies contain features poorly captured by standard Romanization. We will add an explicit limitations paragraph acknowledging this and will include a qualitative discussion of language subsets; however, we do not have the per-feature pronunciation error annotations required for a quantitative ablation at this time. revision: partial

Circularity Check

0 steps flagged

No circularity detected; claims rest on external empirical comparisons

full rationale

The paper presents UR-BERT as a methodological proposal (Romanization unification plus auxiliary speech-token objective) whose value is asserted via direct experimental outperformance against recent text-encoder baselines on 495 languages and unseen-language generalization. No equations, fitted parameters, or self-citations are shown that would make any reported prediction equivalent to its own inputs by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Romanization maintains phonetic utility across 495 languages; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Romanization can be applied uniformly to 495 languages while maintaining phonetic utility for TTS.
    This premise enables the scaling claim from ~100 to 495 languages.

pith-pipeline@v0.9.1-grok · 5671 in / 1059 out tokens · 31699 ms · 2026-06-27T10:05:05.036888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 1 canonical work pages

  1. [1]

    Introduction Neural text-to-speech (TTS) systems have achieved substantial progress across languages and speaking styles. Most recent ap- proaches adopt encoder–decoder architectures, in which the en- coder produces linguistic representations that are transformed into acoustic features or speech waveforms by a decoder. While decoder models have advanced r...

  2. [2]

    Related Work To extend monolingual text embeddings to multilingual TTS encoders, recent work has adopted BERT-style pretraining for text representations. An early effort in this direction is multilin- gual PLBERT (m-PLBERT), introduced in the StyleTTS2 [24] framework.2 Following the original PL-BERT [23] design, m-PLBERT pretrains the text encoder on phon...

  3. [3]

    Architecture Overview The key distinctions of the proposed UR-BERT lie in its lan- guage scalability and training objectives

    Proposed Method 3.1. Architecture Overview The key distinctions of the proposed UR-BERT lie in its lan- guage scalability and training objectives. UR-BERT adopts Romanization as a unified text representation, enabling scal- able modeling across diverse writing systems without reliance on G2P systems. It is pretrained on speech–text paired data spanning 49...

  4. [4]

    Experiments 4.1. Pretraining We construct the pretraining corpus by combining three ASR datasets: FLEURS [43], which spans 102 read-speech lan- guages; Common V oice [44], a crowdsourced dataset covering 131 languages; and the Omnilingual ASR corpus [37], which 3https://huggingface.co/facebook/omniASR-W2V-300M Table 1:Comparison of baselines and UR-BERT. ...

  5. [5]

    Performance on High-Resource Languages Table 2 presents TTS evaluation results on high-resource lan- guages

    Results 5.1. Performance on High-Resource Languages Table 2 presents TTS evaluation results on high-resource lan- guages. While VITS achieves strong baseline performance in these settings, incorporating UR-BERT consistently improves both subjective and objective metrics across all evaluated lan- guages. In contrast, m-PLBERT exhibits notable performance d...

  6. [6]

    Conclusion In this paper, we propose UR-BERT, a multilingual and multi- modal pretrained text encoder for text-to-speech applications. By adopting Romanization as a unified text representation, UR-BERT overcomes the language coverage limitations of conventional G2P pipelines and enables scalable pretraining across 495 languages. To further enhance phoneti...

  7. [7]

    No generative AI tools were used in the development of research ideas, analytical procedures, or the creation of any substantive scientific content

    Generative AI Use Disclosure All co-authors attest that generative AI tools were employed exclusively to refine human-authored text and to support LaTeX formatting of the manuscript, including tables and figures. No generative AI tools were used in the development of research ideas, analytical procedures, or the creation of any substantive scientific cont...

  8. [8]

    Acknowledgement This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government Ministry of Science and ICT (MSIT) (RS-2026-25468664)

  9. [9]

    Tacotron: Towards end-to-end speech synthesis,

    Y . Wanget al., “Tacotron: Towards end-to-end speech synthesis,” inProc. INTERSPEECH, 2017, pp. 4006–4010

  10. [10]

    FastSpeech 2: Fast and high-quality end-to-end text to speech,

    Y . Renet al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” inProceedings of the International Conference on Learning Representations, 2021

  11. [11]

    Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,

    J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” inAd- vances in Neural Information Processing Systems, vol. 33, 2020, pp. 8067–8077

  12. [12]

    Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

    J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” inPro- ceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540

  13. [13]

    Grad-TTS: A diffusion probabilistic model for text-to- speech,

    V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudi- nov, “Grad-TTS: A diffusion probabilistic model for text-to- speech,” inProceedings of the International Conference on Ma- chine Learning. PMLR, 2021, pp. 8599–8608

  14. [14]

    Diff- TTS: A denoising diffusion model for text-to-speech,

    M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff- TTS: A denoising diffusion model for text-to-speech,” inProc. INTERSPEECH, 2021, pp. 3605–3609

  15. [15]

    Matcha-TTS: A fast tts architecture with conditional flow match- ing,

    S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha-TTS: A fast tts architecture with conditional flow match- ing,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 341–11 345

  16. [16]

    F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

    Y . Chenet al., “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the Annual Meet- ing of the Association for Computational Linguistics (ACL), 2025, pp. 6255–6271

  17. [17]

    Neural codec language models are zero-shot text to speech synthesizers,

    C. Wanget al., “Neural codec language models are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

  18. [18]

    Speak, read and prompt: High-fidelity text- to-speech with minimal supervision,

    E. Kharitonovet al., “Speak, read and prompt: High-fidelity text- to-speech with minimal supervision,”Transactions of the Associa- tion for Computational Linguistics, vol. 11, pp. 1703–1718, 2023

  19. [19]

    BERT: Pre- training of deep bidirectional transformers for language under- standing,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language under- standing,” inProceedings of the Conference of the North Amer- ican Chapter of the Association for Computational Linguistics (NAACL), 2019, pp. 4171–4186

  20. [20]

    ALBERT: A lite bert for self-supervised learning of language representations,

    Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Sori- cut, “ALBERT: A lite bert for self-supervised learning of language representations,” inProceedings of the International Conference on Learning Representations, 2020

  21. [21]

    A robustly optimized BERT pre-training approach with post-training,

    L. Zhuang, L. Wayne, S. Ya, and Z. Jun, “A robustly optimized BERT pre-training approach with post-training,” inProceedings of the 20th Chinese National Conference on Computational Lin- guistics. Huhhot, China: Chinese Information Processing Soci- ety of China, Aug. 2021, pp. 1218–1227

  22. [22]

    wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460

  23. [23]

    HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

  24. [24]

    WavLM: Large-scale self- supervised pre-training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  25. [25]

    Pre-trained text embeddings for enhanced text-to- speech synthesis,

    T. Hayashi, S. Watanabe, T. Toda, K. Takeda, S. Toshniwal, and K. Livescu, “Pre-trained text embeddings for enhanced text-to- speech synthesis,” inProc. INTERSPEECH, 2019, pp. 4430– 4434

  26. [26]

    Improving the prosody of RNN-based English text-to-speech synthesis by incorporating a BERT model,

    T. Kenter, M. Sharma, and R. Clark, “Improving the prosody of RNN-based English text-to-speech synthesis by incorporating a BERT model,” inProc. INTERSPEECH, 2020, pp. 4412–4416

  27. [27]

    Improving prosody with linguistic and BERT derived features in multi-speaker based Mandarin Chinese neural TTS,

    Y . Xiao, L. He, H. Ming, and F. K. Soong, “Improving prosody with linguistic and BERT derived features in multi-speaker based Mandarin Chinese neural TTS,” inProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6704–6708

  28. [28]

    Im- proving prosody modelling with cross-utterance bert embeddings for end-to-end speech synthesis,

    G. Xu, W. Song, Z. Zhang, C. Zhang, X. He, and B. Zhou, “Im- proving prosody modelling with cross-utterance bert embeddings for end-to-end speech synthesis,” inProceedings of the IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP), 2021, pp. 6079–6083

  29. [29]

    PnG BERT: Aug- mented BERT on phonemes and graphemes for neural TTS,

    Y . Jia, H. Zen, J. Shen, Y . Zhang, and Y . Wu, “PnG BERT: Aug- mented BERT on phonemes and graphemes for neural TTS,” in Proc. INTERSPEECH, 2021, pp. 151–155

  30. [30]

    Mixed-Phoneme BERT: Improving BERT with mixed phoneme and sup-phoneme representations for text to speech,

    G. Zhanget al., “Mixed-Phoneme BERT: Improving BERT with mixed phoneme and sup-phoneme representations for text to speech,” inProc. INTERSPEECH, 2022, pp. 456–460

  31. [31]

    Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme pre- dictions,

    Y . A. Li, C. Han, X. Jiang, and N. Mesgarani, “Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme pre- dictions,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1– 5

  32. [32]

    StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,

    Y . A. Li, C. Han, V . Raghavan, G. Mischler, and N. Mesgarani, “StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 19 594–19 621

  33. [33]

    XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,

    L. The Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,” inProc. INTERSPEECH, 2023, pp. 5506–5510

  34. [34]

    Phonemizer: Text to phones transcrip- tion for multiple languages in python,

    M. Bernard and H. Titeux, “Phonemizer: Text to phones transcrip- tion for multiple languages in python,”Journal of Open Source Software, vol. 6, no. 68, p. 3958, 2021

  35. [35]

    ByT5 model for massively multilingual grapheme-to-phoneme conversion,

    J. Zhu, C. Zhang, and D. Jurgens, “ByT5 model for massively multilingual grapheme-to-phoneme conversion,” inProc. INTER- SPEECH, 2022, pp. 446–450

  36. [36]

    Attention is all you need,

    A. Vaswaniet al., “Attention is all you need,” inAdvances in Neu- ral Information Processing Systems, vol. 30, 2017

  37. [37]

    Out-of-the-box universal Romanization tool uroman,

    U. Hermjakob, J. May, and K. Knight, “Out-of-the-box universal Romanization tool uroman,” inProceedings of the Annual Meet- ing of the Association for Computational Linguistics (ACL), Sys- tem Demonstrations, 2018, pp. 13–18

  38. [38]

    XTTS: a massively multilingual zero-shot text-to-speech model,

    E. Casanovaet al., “XTTS: a massively multilingual zero-shot text-to-speech model,” inProc. INTERSPEECH, 2024, pp. 4978– 4982

  39. [39]

    Scaling speech technology to 1,000+ languages,

    V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024

  40. [40]

    LAMA-UT: Language agnostic multilingual ASR through orthography unification and language-specific transliteration,

    S. Lee, W. Chung, and H.-G. Kang, “LAMA-UT: Language agnostic multilingual ASR through orthography unification and language-specific transliteration,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, 2025, pp. 24 393–24 401

  41. [41]

    Cambridge University Press, 1999

    International Phonetic Association,Handbook of the Interna- tional Phonetic Association: A guide to the use of the Interna- tional Phonetic Alphabet. Cambridge University Press, 1999

  42. [42]

    Unsupervised cross-lingual representation learning for speech recognition,

    A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” inProc. INTERSPEECH, 2021, pp. 2426–2430

  43. [43]

    XLS-R: Self-supervised cross-lingual speech rep- resentation learning at scale,

    A. Babuet al., “XLS-R: Self-supervised cross-lingual speech rep- resentation learning at scale,” inProc. INTERSPEECH, 2022, pp. 2278–2282

  44. [44]

    Towards robust speech representation learning for thousands of languages,

    W. Chenet al., “Towards robust speech representation learning for thousands of languages,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 10 205–10 224

  45. [45]

    Omnilingual ASR: Open-source multilin- gual speech recognition for 1600+ languages,

    G. Kerenet al., “Omnilingual ASR: Open-source multilin- gual speech recognition for 1600+ languages,”arXiv preprint arXiv:2511.09690, 2025

  46. [46]

    Layer-wise analysis of a self-supervised speech representation model,

    A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921

  47. [47]

    Comparative layer-wise analy- sis of self-supervised speech models,

    A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analy- sis of self-supervised speech models,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2023, pp. 1–5

  48. [48]

    SELM: Speech enhancement using discrete to- kens and language models,

    Z. Wanget al., “SELM: Speech enhancement using discrete to- kens and language models,” inProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 561–11 565

  49. [49]

    Differentiable K-means for fully-optimized discrete token-based asr,

    K. Onda, Y . Kashiwagi, E. Tsunoo, H. Futami, and S. Watanabe, “Differentiable K-means for fully-optimized discrete token-based asr,” inProc. INTERSPEECH, 2025, pp. 1223–1227

  50. [50]

    Geometric constraints on human speech sound inventories,

    E. Dunbar and E. Dupoux, “Geometric constraints on human speech sound inventories,”Frontiers in Psychology, vol. 7, p. 1061, 2016

  51. [51]

    FLEURS: Few-shot learning evaluation of universal representations of speech,

    A. Conneauet al., “FLEURS: Few-shot learning evaluation of universal representations of speech,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT), 2022, pp. 798– 805

  52. [52]

    Common V oice: A massively-multilingual speech corpus,

    R. Ardilaet al., “Common V oice: A massively-multilingual speech corpus,” inProceedings of the Language Resources and Evaluation Conference (LREC), 2020, pp. 4218–4222

  53. [53]

    Decoupled weight decay regulariza- tion,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,” inProceedings of the International Conference on Learning Representations, 2019

  54. [54]

    data2vec: A general framework for self-supervised learning in speech, vision and language,

    A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” inProceedings of the International Conference on Machine Learning, vol. 162, 2022, pp. 1298–1312

  55. [55]

    The LJ Speech Dataset,

    K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito. com/LJ-Speech-Dataset/, 2017

  56. [56]

    11354328

    T. M ¨uller and D. Kreutz, “Thorsten-V oice Dataset 2022.10,” Nov. 2022. [Online]. Available: https://doi.org/10.5281/zenodo. 7265581

  57. [57]

    AISHELL-3: A multi- speaker Mandarin TTS corpus and the baselines,

    Y . Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A multi- speaker Mandarin TTS corpus and the baselines,”arXiv preprint arXiv:2010.11567, 2020

  58. [58]

    A step-by-step process for building TTS voices using open source data and framework for Bangla, Ja- vanese, Khmer, Nepali, Sinhala, and Sundanese,

    K. Sodimanaet al., “A step-by-step process for building TTS voices using open source data and framework for Bangla, Ja- vanese, Khmer, Nepali, Sinhala, and Sundanese,” inProceedings of the International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), 2018, pp. 66–70

  59. [59]

    High-quality sinhalese multi-speaker TTS corpus,

    Google, Inc., “High-quality sinhalese multi-speaker TTS corpus,” https://www.openslr.org/30/, 2016

  60. [60]

    Rapid development of TTS corpora for four South African languages,

    D. van Niekerket al., “Rapid development of TTS corpora for four South African languages,” inProc. INTERSPEECH, 2017, pp. 2178–2182

  61. [61]

    The V oiceMOS Challenge 2024: Beyond speech quality prediction,

    W.-C. Huanget al., “The V oiceMOS Challenge 2024: Beyond speech quality prediction,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 803–810

  62. [62]

    The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

    K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 818–824. Appendix (a) Instructions provided to the participants. (b)...