UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

Eekgyun Ahn; Hong-Goo Kang; Sangmin Lee; Woongjib Choi

arxiv: 2606.11681 · v2 · pith:2AG3QPC4new · submitted 2026-06-10 · 💻 cs.CL · cs.SD

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

Sangmin Lee , Eekgyun Ahn , Woongjib Choi , Hong-Goo Kang This is my paper

Pith reviewed 2026-06-27 10:05 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords multilingual TTStext encoderRomanizationspeech token predictionUR-BERTgrapheme-to-phonemezero-shot generalization

0 comments

The pith

UR-BERT builds TTS text encoders for 495 languages by unifying scripts through Romanization and adding speech token prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UR-BERT as a text encoder for massively multilingual TTS. It replaces limited grapheme-to-phoneme converters with a shared Romanization step that handles 495 languages and adds a speech token prediction task to learn better phonetic alignments during training. Experiments show that TTS systems using this encoder outperform recent baselines across many languages and data conditions while also working on languages never seen in training. Traditional approaches stop at roughly 100 languages because they depend on per-language phonetic resources that do not exist everywhere. A reader would care because the method removes a major barrier to building speech output for far more languages and speakers.

Core claim

UR-BERT processes Romanized versions of text from 495 languages and is trained with an added objective that predicts speech tokens; TTS systems built on this encoder deliver higher quality than prior text encoders across resource levels and generalize to unseen languages.

What carries the argument

Universal Romanization of input text paired with a speech token prediction training objective that produces speech-aware phonetic representations.

If this is right

TTS development becomes possible for languages that lack any G2P resources or tools.
Performance gains hold across both high-resource and low-resource language settings.
The shared representation supports zero-shot use on languages absent from training data.
The encoder acquires phonetic knowledge in a data-efficient way through the token prediction task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Romanization-plus-prediction pattern could apply to other speech tasks such as recognition where script diversity is a bottleneck.
Further gains might appear if the method is combined with larger pretrained language models for even lower-resource cases.
Limits would surface on languages where basic Romanization drops critical features like tone or vowel length.

Load-bearing premise

Romanization of text from any script must keep enough phonetic detail to support accurate pronunciation without large losses across all 495 languages.

What would settle it

A side-by-side pronunciation accuracy test on languages with tones or ambiguous Romanization forms, comparing error rates from UR-BERT TTS against a language-specific G2P baseline.

Figures

Figures reproduced from arXiv: 2606.11681 by Eekgyun Ahn, Hong-Goo Kang, Sangmin Lee, Woongjib Choi.

**Figure 1.** Figure 1: Overview of the UR-BERT showing pretraining and finetuning stage. 2. Related Work To extend monolingual text embeddings to multilingual TTS encoders, recent work has adopted BERT-style pretraining for text representations. An early effort in this direction is multilingual PLBERT (m-PLBERT), introduced in the StyleTTS2 [24] framework.2 Following the original PL-BERT [23] design, m-PLBERT pretrains the text… view at source ↗

**Figure 2.** Figure 2: Illustration of the CTC-based speech-text alignment. Unlike conventional TTS systems that require clean, curated speech data, our approach leverages large-scale ASR speech–text pairs by injecting speech-derived supervision into the text encoder through three steps: (1) extracting speech representations from S3M, (2) aligning them to character-level text using forced alignment, and (3) discretizing the al… view at source ↗

**Figure 3.** Figure 3: MOS evaluation setup: (a) detailed instructions for quality assessment, and (b) a snapshot of the evaluation interface. A. Details on the Pretraining Dataset Preprocessing. We removed samples whose transcriptions contained digits or parenthetical expressions. Digits may correspond to different pronunciations across languages, leading to token–pronunciation mismatches, while parenthetical content is incon… view at source ↗

read the original abstract

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UR-BERT claims to scale TTS encoders to 495 languages via Romanization plus speech token prediction, but the abstract supplies zero numbers so the performance claims stay untested.

read the letter

The main thing to know is that this paper tries to get past the G2P bottleneck by Romanizing all input scripts into one shared form and training the encoder with an added speech token prediction loss. That combination is what they present as the route to 495 languages.

What stands out as new is the specific use of universal Romanization for TTS text encoders paired with the auxiliary objective to push phonetic awareness in a data-efficient way. The paper does a clear job naming the practical limit of existing G2P resources and showing why a single representation could help coverage.

The soft spot is the complete lack of numbers in the abstract. It states consistent outperformance and generalization to unseen languages but gives no metrics, baselines, language counts, or error breakdowns. Without those, it is impossible to judge whether the method actually works. The stress-test point about Romanization losing tones, vowel length, or aspiration is worth checking directly: if the experiments do not isolate pronunciation errors on languages where those contrasts matter or compare against language-specific G2P where it exists, the scaling result could rest on easier cases rather than true universality.

This is for people building multilingual TTS systems who need to move beyond the current 100-language ceiling. A reader focused on low-resource speech synthesis would get value from the method description and any released data or code, provided the full paper includes the missing quantitative checks. The thinking is straightforward and engages the real constraint in the literature.

I would send it to peer review if the full version contains solid experiments and ablations; the core problem is worth referee time even if the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The paper proposes UR-BERT, a text encoder for TTS that unifies 495 languages via shared Romanization (bypassing language-specific G2P) and adds an auxiliary speech-token prediction objective to encourage phonetic representations. It claims that TTS systems built on this encoder consistently outperform recent text-encoder baselines across languages and resource levels while generalizing to unseen languages.

Significance. If the results hold, the work would meaningfully advance scaling of TTS to hundreds of languages by removing dependence on scarce G2P resources; the combination of Romanization and speech-token prediction could offer a practical route to data-efficient multilingual encoders.

major comments (2)

[Abstract] Abstract: the central claim of consistent outperformance and generalization across 495 languages is stated without any quantitative metrics, baseline specifications, dataset sizes, or error breakdowns, leaving the empirical support for the headline result invisible even at the summary level.
[Experiments] The universality claim rests on Romanization preserving sufficient phonetic contrast; no described ablation isolates pronunciation error rates on languages whose orthographies encode features absent from standard Romanization (tones, gemination, vowel harmony), which is load-bearing for whether performance gains are driven by the method or by easier languages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which highlight opportunities to strengthen the presentation of our empirical results and to more explicitly address potential limitations of Romanization. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of consistent outperformance and generalization across 495 languages is stated without any quantitative metrics, baseline specifications, dataset sizes, or error breakdowns, leaving the empirical support for the headline result invisible even at the summary level.

Authors: We agree that the abstract would be more informative with concrete quantitative support. In the revised manuscript we will expand the abstract to report key metrics, including the number of languages evaluated, average relative improvements over the strongest baselines (with specific baseline names), dataset scale, and a brief note on generalization results to unseen languages. revision: yes
Referee: [Experiments] The universality claim rests on Romanization preserving sufficient phonetic contrast; no described ablation isolates pronunciation error rates on languages whose orthographies encode features absent from standard Romanization (tones, gemination, vowel harmony), which is load-bearing for whether performance gains are driven by the method or by easier languages.

Authors: This observation is correct and points to a genuine gap in our current analysis. Our experiments report aggregate TTS performance across the full set of 495 languages (including many with tonal and other non-Roman features) and show strong zero-shot generalization, but we do not provide a targeted ablation that isolates pronunciation error rates on the subset of languages whose orthographies contain features poorly captured by standard Romanization. We will add an explicit limitations paragraph acknowledging this and will include a qualitative discussion of language subsets; however, we do not have the per-feature pronunciation error annotations required for a quantitative ablation at this time. revision: partial

Circularity Check

0 steps flagged

No circularity detected; claims rest on external empirical comparisons

full rationale

The paper presents UR-BERT as a methodological proposal (Romanization unification plus auxiliary speech-token objective) whose value is asserted via direct experimental outperformance against recent text-encoder baselines on 495 languages and unseen-language generalization. No equations, fitted parameters, or self-citations are shown that would make any reported prediction equivalent to its own inputs by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Romanization maintains phonetic utility across 495 languages; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Romanization can be applied uniformly to 495 languages while maintaining phonetic utility for TTS.
This premise enables the scaling claim from ~100 to 495 languages.

pith-pipeline@v0.9.1-grok · 5671 in / 1059 out tokens · 31699 ms · 2026-06-27T10:05:05.036888+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 1 canonical work pages

[1]

Introduction Neural text-to-speech (TTS) systems have achieved substantial progress across languages and speaking styles. Most recent ap- proaches adopt encoder–decoder architectures, in which the en- coder produces linguistic representations that are transformed into acoustic features or speech waveforms by a decoder. While decoder models have advanced r...

Pith/arXiv arXiv 2026
[2]

Related Work To extend monolingual text embeddings to multilingual TTS encoders, recent work has adopted BERT-style pretraining for text representations. An early effort in this direction is multilin- gual PLBERT (m-PLBERT), introduced in the StyleTTS2 [24] framework.2 Following the original PL-BERT [23] design, m-PLBERT pretrains the text encoder on phon...
[3]

Architecture Overview The key distinctions of the proposed UR-BERT lie in its lan- guage scalability and training objectives

Proposed Method 3.1. Architecture Overview The key distinctions of the proposed UR-BERT lie in its lan- guage scalability and training objectives. UR-BERT adopts Romanization as a unified text representation, enabling scal- able modeling across diverse writing systems without reliance on G2P systems. It is pretrained on speech–text paired data spanning 49...
[4]

Experiments 4.1. Pretraining We construct the pretraining corpus by combining three ASR datasets: FLEURS [43], which spans 102 read-speech lan- guages; Common V oice [44], a crowdsourced dataset covering 131 languages; and the Omnilingual ASR corpus [37], which 3https://huggingface.co/facebook/omniASR-W2V-300M Table 1:Comparison of baselines and UR-BERT. ...
[5]

Performance on High-Resource Languages Table 2 presents TTS evaluation results on high-resource lan- guages

Results 5.1. Performance on High-Resource Languages Table 2 presents TTS evaluation results on high-resource lan- guages. While VITS achieves strong baseline performance in these settings, incorporating UR-BERT consistently improves both subjective and objective metrics across all evaluated lan- guages. In contrast, m-PLBERT exhibits notable performance d...
[6]

Conclusion In this paper, we propose UR-BERT, a multilingual and multi- modal pretrained text encoder for text-to-speech applications. By adopting Romanization as a unified text representation, UR-BERT overcomes the language coverage limitations of conventional G2P pipelines and enables scalable pretraining across 495 languages. To further enhance phoneti...
[7]

No generative AI tools were used in the development of research ideas, analytical procedures, or the creation of any substantive scientific content

Generative AI Use Disclosure All co-authors attest that generative AI tools were employed exclusively to refine human-authored text and to support LaTeX formatting of the manuscript, including tables and figures. No generative AI tools were used in the development of research ideas, analytical procedures, or the creation of any substantive scientific cont...
[8]

Acknowledgement This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government Ministry of Science and ICT (MSIT) (RS-2026-25468664)

2026
[9]

Tacotron: Towards end-to-end speech synthesis,

Y . Wanget al., “Tacotron: Towards end-to-end speech synthesis,” inProc. INTERSPEECH, 2017, pp. 4006–4010

2017
[10]

FastSpeech 2: Fast and high-quality end-to-end text to speech,

Y . Renet al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” inProceedings of the International Conference on Learning Representations, 2021

2021
[11]

Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,

J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” inAd- vances in Neural Information Processing Systems, vol. 33, 2020, pp. 8067–8077

2020
[12]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” inPro- ceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540

2021
[13]

Grad-TTS: A diffusion probabilistic model for text-to- speech,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudi- nov, “Grad-TTS: A diffusion probabilistic model for text-to- speech,” inProceedings of the International Conference on Ma- chine Learning. PMLR, 2021, pp. 8599–8608

2021
[14]

Diff- TTS: A denoising diffusion model for text-to-speech,

M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff- TTS: A denoising diffusion model for text-to-speech,” inProc. INTERSPEECH, 2021, pp. 3605–3609

2021
[15]

Matcha-TTS: A fast tts architecture with conditional flow match- ing,

S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha-TTS: A fast tts architecture with conditional flow match- ing,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 341–11 345

2024
[16]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chenet al., “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the Annual Meet- ing of the Association for Computational Linguistics (ACL), 2025, pp. 6255–6271

2025
[17]

Neural codec language models are zero-shot text to speech synthesizers,

C. Wanget al., “Neural codec language models are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

Pith/arXiv arXiv 2023
[18]

Speak, read and prompt: High-fidelity text- to-speech with minimal supervision,

E. Kharitonovet al., “Speak, read and prompt: High-fidelity text- to-speech with minimal supervision,”Transactions of the Associa- tion for Computational Linguistics, vol. 11, pp. 1703–1718, 2023

2023
[19]

BERT: Pre- training of deep bidirectional transformers for language under- standing,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language under- standing,” inProceedings of the Conference of the North Amer- ican Chapter of the Association for Computational Linguistics (NAACL), 2019, pp. 4171–4186

2019
[20]

ALBERT: A lite bert for self-supervised learning of language representations,

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Sori- cut, “ALBERT: A lite bert for self-supervised learning of language representations,” inProceedings of the International Conference on Learning Representations, 2020

2020
[21]

A robustly optimized BERT pre-training approach with post-training,

L. Zhuang, L. Wayne, S. Ya, and Z. Jun, “A robustly optimized BERT pre-training approach with post-training,” inProceedings of the 20th Chinese National Conference on Computational Lin- guistics. Huhhot, China: Chinese Information Processing Soci- ety of China, Aug. 2021, pp. 1218–1227

2021
[22]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460

2020
[23]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

2021
[24]

WavLM: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[25]

Pre-trained text embeddings for enhanced text-to- speech synthesis,

T. Hayashi, S. Watanabe, T. Toda, K. Takeda, S. Toshniwal, and K. Livescu, “Pre-trained text embeddings for enhanced text-to- speech synthesis,” inProc. INTERSPEECH, 2019, pp. 4430– 4434

2019
[26]

Improving the prosody of RNN-based English text-to-speech synthesis by incorporating a BERT model,

T. Kenter, M. Sharma, and R. Clark, “Improving the prosody of RNN-based English text-to-speech synthesis by incorporating a BERT model,” inProc. INTERSPEECH, 2020, pp. 4412–4416

2020
[27]

Improving prosody with linguistic and BERT derived features in multi-speaker based Mandarin Chinese neural TTS,

Y . Xiao, L. He, H. Ming, and F. K. Soong, “Improving prosody with linguistic and BERT derived features in multi-speaker based Mandarin Chinese neural TTS,” inProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6704–6708

2020
[28]

Im- proving prosody modelling with cross-utterance bert embeddings for end-to-end speech synthesis,

G. Xu, W. Song, Z. Zhang, C. Zhang, X. He, and B. Zhou, “Im- proving prosody modelling with cross-utterance bert embeddings for end-to-end speech synthesis,” inProceedings of the IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP), 2021, pp. 6079–6083

2021
[29]

PnG BERT: Aug- mented BERT on phonemes and graphemes for neural TTS,

Y . Jia, H. Zen, J. Shen, Y . Zhang, and Y . Wu, “PnG BERT: Aug- mented BERT on phonemes and graphemes for neural TTS,” in Proc. INTERSPEECH, 2021, pp. 151–155

2021
[30]

Mixed-Phoneme BERT: Improving BERT with mixed phoneme and sup-phoneme representations for text to speech,

G. Zhanget al., “Mixed-Phoneme BERT: Improving BERT with mixed phoneme and sup-phoneme representations for text to speech,” inProc. INTERSPEECH, 2022, pp. 456–460

2022
[31]

Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme pre- dictions,

Y . A. Li, C. Han, X. Jiang, and N. Mesgarani, “Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme pre- dictions,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1– 5

2023
[32]

StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,

Y . A. Li, C. Han, V . Raghavan, G. Mischler, and N. Mesgarani, “StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 19 594–19 621

2023
[33]

XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,

L. The Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,” inProc. INTERSPEECH, 2023, pp. 5506–5510

2023
[34]

Phonemizer: Text to phones transcrip- tion for multiple languages in python,

M. Bernard and H. Titeux, “Phonemizer: Text to phones transcrip- tion for multiple languages in python,”Journal of Open Source Software, vol. 6, no. 68, p. 3958, 2021

2021
[35]

ByT5 model for massively multilingual grapheme-to-phoneme conversion,

J. Zhu, C. Zhang, and D. Jurgens, “ByT5 model for massively multilingual grapheme-to-phoneme conversion,” inProc. INTER- SPEECH, 2022, pp. 446–450

2022
[36]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,” inAdvances in Neu- ral Information Processing Systems, vol. 30, 2017

2017
[37]

Out-of-the-box universal Romanization tool uroman,

U. Hermjakob, J. May, and K. Knight, “Out-of-the-box universal Romanization tool uroman,” inProceedings of the Annual Meet- ing of the Association for Computational Linguistics (ACL), Sys- tem Demonstrations, 2018, pp. 13–18

2018
[38]

XTTS: a massively multilingual zero-shot text-to-speech model,

E. Casanovaet al., “XTTS: a massively multilingual zero-shot text-to-speech model,” inProc. INTERSPEECH, 2024, pp. 4978– 4982

2024
[39]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024

2024
[40]

LAMA-UT: Language agnostic multilingual ASR through orthography unification and language-specific transliteration,

S. Lee, W. Chung, and H.-G. Kang, “LAMA-UT: Language agnostic multilingual ASR through orthography unification and language-specific transliteration,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, 2025, pp. 24 393–24 401

2025
[41]

Cambridge University Press, 1999

International Phonetic Association,Handbook of the Interna- tional Phonetic Association: A guide to the use of the Interna- tional Phonetic Alphabet. Cambridge University Press, 1999

1999
[42]

Unsupervised cross-lingual representation learning for speech recognition,

A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” inProc. INTERSPEECH, 2021, pp. 2426–2430

2021
[43]

XLS-R: Self-supervised cross-lingual speech rep- resentation learning at scale,

A. Babuet al., “XLS-R: Self-supervised cross-lingual speech rep- resentation learning at scale,” inProc. INTERSPEECH, 2022, pp. 2278–2282

2022
[44]

Towards robust speech representation learning for thousands of languages,

W. Chenet al., “Towards robust speech representation learning for thousands of languages,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 10 205–10 224

2024
[45]

Omnilingual ASR: Open-source multilin- gual speech recognition for 1600+ languages,

G. Kerenet al., “Omnilingual ASR: Open-source multilin- gual speech recognition for 1600+ languages,”arXiv preprint arXiv:2511.09690, 2025

arXiv 2025
[46]

Layer-wise analysis of a self-supervised speech representation model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921

2021
[47]

Comparative layer-wise analy- sis of self-supervised speech models,

A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analy- sis of self-supervised speech models,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2023, pp. 1–5

2023
[48]

SELM: Speech enhancement using discrete to- kens and language models,

Z. Wanget al., “SELM: Speech enhancement using discrete to- kens and language models,” inProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 561–11 565

2024
[49]

Differentiable K-means for fully-optimized discrete token-based asr,

K. Onda, Y . Kashiwagi, E. Tsunoo, H. Futami, and S. Watanabe, “Differentiable K-means for fully-optimized discrete token-based asr,” inProc. INTERSPEECH, 2025, pp. 1223–1227

2025
[50]

Geometric constraints on human speech sound inventories,

E. Dunbar and E. Dupoux, “Geometric constraints on human speech sound inventories,”Frontiers in Psychology, vol. 7, p. 1061, 2016

2016
[51]

FLEURS: Few-shot learning evaluation of universal representations of speech,

A. Conneauet al., “FLEURS: Few-shot learning evaluation of universal representations of speech,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT), 2022, pp. 798– 805

2022
[52]

Common V oice: A massively-multilingual speech corpus,

R. Ardilaet al., “Common V oice: A massively-multilingual speech corpus,” inProceedings of the Language Resources and Evaluation Conference (LREC), 2020, pp. 4218–4222

2020
[53]

Decoupled weight decay regulariza- tion,

I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,” inProceedings of the International Conference on Learning Representations, 2019

2019
[54]

data2vec: A general framework for self-supervised learning in speech, vision and language,

A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” inProceedings of the International Conference on Machine Learning, vol. 162, 2022, pp. 1298–1312

2022
[55]

The LJ Speech Dataset,

K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito. com/LJ-Speech-Dataset/, 2017

2017
[56]

11354328

T. M ¨uller and D. Kreutz, “Thorsten-V oice Dataset 2022.10,” Nov. 2022. [Online]. Available: https://doi.org/10.5281/zenodo. 7265581

work page doi:10.5281/zenodo 2022
[57]

AISHELL-3: A multi- speaker Mandarin TTS corpus and the baselines,

Y . Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A multi- speaker Mandarin TTS corpus and the baselines,”arXiv preprint arXiv:2010.11567, 2020

arXiv 2010
[58]

A step-by-step process for building TTS voices using open source data and framework for Bangla, Ja- vanese, Khmer, Nepali, Sinhala, and Sundanese,

K. Sodimanaet al., “A step-by-step process for building TTS voices using open source data and framework for Bangla, Ja- vanese, Khmer, Nepali, Sinhala, and Sundanese,” inProceedings of the International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), 2018, pp. 66–70

2018
[59]

High-quality sinhalese multi-speaker TTS corpus,

Google, Inc., “High-quality sinhalese multi-speaker TTS corpus,” https://www.openslr.org/30/, 2016

2016
[60]

Rapid development of TTS corpora for four South African languages,

D. van Niekerket al., “Rapid development of TTS corpora for four South African languages,” inProc. INTERSPEECH, 2017, pp. 2178–2182

2017
[61]

The V oiceMOS Challenge 2024: Beyond speech quality prediction,

W.-C. Huanget al., “The V oiceMOS Challenge 2024: Beyond speech quality prediction,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 803–810

2024
[62]

The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 818–824. Appendix (a) Instructions provided to the participants. (b)...

2024

[1] [1]

Introduction Neural text-to-speech (TTS) systems have achieved substantial progress across languages and speaking styles. Most recent ap- proaches adopt encoder–decoder architectures, in which the en- coder produces linguistic representations that are transformed into acoustic features or speech waveforms by a decoder. While decoder models have advanced r...

Pith/arXiv arXiv 2026

[2] [2]

Related Work To extend monolingual text embeddings to multilingual TTS encoders, recent work has adopted BERT-style pretraining for text representations. An early effort in this direction is multilin- gual PLBERT (m-PLBERT), introduced in the StyleTTS2 [24] framework.2 Following the original PL-BERT [23] design, m-PLBERT pretrains the text encoder on phon...

[3] [3]

Architecture Overview The key distinctions of the proposed UR-BERT lie in its lan- guage scalability and training objectives

Proposed Method 3.1. Architecture Overview The key distinctions of the proposed UR-BERT lie in its lan- guage scalability and training objectives. UR-BERT adopts Romanization as a unified text representation, enabling scal- able modeling across diverse writing systems without reliance on G2P systems. It is pretrained on speech–text paired data spanning 49...

[4] [4]

Experiments 4.1. Pretraining We construct the pretraining corpus by combining three ASR datasets: FLEURS [43], which spans 102 read-speech lan- guages; Common V oice [44], a crowdsourced dataset covering 131 languages; and the Omnilingual ASR corpus [37], which 3https://huggingface.co/facebook/omniASR-W2V-300M Table 1:Comparison of baselines and UR-BERT. ...

[5] [5]

Performance on High-Resource Languages Table 2 presents TTS evaluation results on high-resource lan- guages

Results 5.1. Performance on High-Resource Languages Table 2 presents TTS evaluation results on high-resource lan- guages. While VITS achieves strong baseline performance in these settings, incorporating UR-BERT consistently improves both subjective and objective metrics across all evaluated lan- guages. In contrast, m-PLBERT exhibits notable performance d...

[6] [6]

Conclusion In this paper, we propose UR-BERT, a multilingual and multi- modal pretrained text encoder for text-to-speech applications. By adopting Romanization as a unified text representation, UR-BERT overcomes the language coverage limitations of conventional G2P pipelines and enables scalable pretraining across 495 languages. To further enhance phoneti...

[7] [7]

No generative AI tools were used in the development of research ideas, analytical procedures, or the creation of any substantive scientific content

Generative AI Use Disclosure All co-authors attest that generative AI tools were employed exclusively to refine human-authored text and to support LaTeX formatting of the manuscript, including tables and figures. No generative AI tools were used in the development of research ideas, analytical procedures, or the creation of any substantive scientific cont...

[8] [8]

Acknowledgement This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government Ministry of Science and ICT (MSIT) (RS-2026-25468664)

2026

[9] [9]

Tacotron: Towards end-to-end speech synthesis,

Y . Wanget al., “Tacotron: Towards end-to-end speech synthesis,” inProc. INTERSPEECH, 2017, pp. 4006–4010

2017

[10] [10]

FastSpeech 2: Fast and high-quality end-to-end text to speech,

Y . Renet al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” inProceedings of the International Conference on Learning Representations, 2021

2021

[11] [11]

Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,

J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” inAd- vances in Neural Information Processing Systems, vol. 33, 2020, pp. 8067–8077

2020

[12] [12]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” inPro- ceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540

2021

[13] [13]

Grad-TTS: A diffusion probabilistic model for text-to- speech,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudi- nov, “Grad-TTS: A diffusion probabilistic model for text-to- speech,” inProceedings of the International Conference on Ma- chine Learning. PMLR, 2021, pp. 8599–8608

2021

[14] [14]

Diff- TTS: A denoising diffusion model for text-to-speech,

M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff- TTS: A denoising diffusion model for text-to-speech,” inProc. INTERSPEECH, 2021, pp. 3605–3609

2021

[15] [15]

Matcha-TTS: A fast tts architecture with conditional flow match- ing,

S. Mehta, R. Tu, J. Beskow, ´E. Sz ´ekely, and G. E. Henter, “Matcha-TTS: A fast tts architecture with conditional flow match- ing,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 341–11 345

2024

[16] [16]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chenet al., “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the Annual Meet- ing of the Association for Computational Linguistics (ACL), 2025, pp. 6255–6271

2025

[17] [17]

Neural codec language models are zero-shot text to speech synthesizers,

C. Wanget al., “Neural codec language models are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

Pith/arXiv arXiv 2023

[18] [18]

Speak, read and prompt: High-fidelity text- to-speech with minimal supervision,

E. Kharitonovet al., “Speak, read and prompt: High-fidelity text- to-speech with minimal supervision,”Transactions of the Associa- tion for Computational Linguistics, vol. 11, pp. 1703–1718, 2023

2023

[19] [19]

BERT: Pre- training of deep bidirectional transformers for language under- standing,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language under- standing,” inProceedings of the Conference of the North Amer- ican Chapter of the Association for Computational Linguistics (NAACL), 2019, pp. 4171–4186

2019

[20] [20]

ALBERT: A lite bert for self-supervised learning of language representations,

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Sori- cut, “ALBERT: A lite bert for self-supervised learning of language representations,” inProceedings of the International Conference on Learning Representations, 2020

2020

[21] [21]

A robustly optimized BERT pre-training approach with post-training,

L. Zhuang, L. Wayne, S. Ya, and Z. Jun, “A robustly optimized BERT pre-training approach with post-training,” inProceedings of the 20th Chinese National Conference on Computational Lin- guistics. Huhhot, China: Chinese Information Processing Soci- ety of China, Aug. 2021, pp. 1218–1227

2021

[22] [22]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460

2020

[23] [23]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

2021

[24] [24]

WavLM: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[25] [25]

Pre-trained text embeddings for enhanced text-to- speech synthesis,

T. Hayashi, S. Watanabe, T. Toda, K. Takeda, S. Toshniwal, and K. Livescu, “Pre-trained text embeddings for enhanced text-to- speech synthesis,” inProc. INTERSPEECH, 2019, pp. 4430– 4434

2019

[26] [26]

Improving the prosody of RNN-based English text-to-speech synthesis by incorporating a BERT model,

T. Kenter, M. Sharma, and R. Clark, “Improving the prosody of RNN-based English text-to-speech synthesis by incorporating a BERT model,” inProc. INTERSPEECH, 2020, pp. 4412–4416

2020

[27] [27]

Improving prosody with linguistic and BERT derived features in multi-speaker based Mandarin Chinese neural TTS,

Y . Xiao, L. He, H. Ming, and F. K. Soong, “Improving prosody with linguistic and BERT derived features in multi-speaker based Mandarin Chinese neural TTS,” inProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6704–6708

2020

[28] [28]

Im- proving prosody modelling with cross-utterance bert embeddings for end-to-end speech synthesis,

G. Xu, W. Song, Z. Zhang, C. Zhang, X. He, and B. Zhou, “Im- proving prosody modelling with cross-utterance bert embeddings for end-to-end speech synthesis,” inProceedings of the IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP), 2021, pp. 6079–6083

2021

[29] [29]

PnG BERT: Aug- mented BERT on phonemes and graphemes for neural TTS,

Y . Jia, H. Zen, J. Shen, Y . Zhang, and Y . Wu, “PnG BERT: Aug- mented BERT on phonemes and graphemes for neural TTS,” in Proc. INTERSPEECH, 2021, pp. 151–155

2021

[30] [30]

Mixed-Phoneme BERT: Improving BERT with mixed phoneme and sup-phoneme representations for text to speech,

G. Zhanget al., “Mixed-Phoneme BERT: Improving BERT with mixed phoneme and sup-phoneme representations for text to speech,” inProc. INTERSPEECH, 2022, pp. 456–460

2022

[31] [31]

Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme pre- dictions,

Y . A. Li, C. Han, X. Jiang, and N. Mesgarani, “Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme pre- dictions,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1– 5

2023

[32] [32]

StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,

Y . A. Li, C. Han, V . Raghavan, G. Mischler, and N. Mesgarani, “StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 19 594–19 621

2023

[33] [33]

XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,

L. The Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,” inProc. INTERSPEECH, 2023, pp. 5506–5510

2023

[34] [34]

Phonemizer: Text to phones transcrip- tion for multiple languages in python,

M. Bernard and H. Titeux, “Phonemizer: Text to phones transcrip- tion for multiple languages in python,”Journal of Open Source Software, vol. 6, no. 68, p. 3958, 2021

2021

[35] [35]

ByT5 model for massively multilingual grapheme-to-phoneme conversion,

J. Zhu, C. Zhang, and D. Jurgens, “ByT5 model for massively multilingual grapheme-to-phoneme conversion,” inProc. INTER- SPEECH, 2022, pp. 446–450

2022

[36] [36]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,” inAdvances in Neu- ral Information Processing Systems, vol. 30, 2017

2017

[37] [37]

Out-of-the-box universal Romanization tool uroman,

U. Hermjakob, J. May, and K. Knight, “Out-of-the-box universal Romanization tool uroman,” inProceedings of the Annual Meet- ing of the Association for Computational Linguistics (ACL), Sys- tem Demonstrations, 2018, pp. 13–18

2018

[38] [38]

XTTS: a massively multilingual zero-shot text-to-speech model,

E. Casanovaet al., “XTTS: a massively multilingual zero-shot text-to-speech model,” inProc. INTERSPEECH, 2024, pp. 4978– 4982

2024

[39] [39]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024

2024

[40] [40]

LAMA-UT: Language agnostic multilingual ASR through orthography unification and language-specific transliteration,

S. Lee, W. Chung, and H.-G. Kang, “LAMA-UT: Language agnostic multilingual ASR through orthography unification and language-specific transliteration,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, 2025, pp. 24 393–24 401

2025

[41] [41]

Cambridge University Press, 1999

International Phonetic Association,Handbook of the Interna- tional Phonetic Association: A guide to the use of the Interna- tional Phonetic Alphabet. Cambridge University Press, 1999

1999

[42] [42]

Unsupervised cross-lingual representation learning for speech recognition,

A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” inProc. INTERSPEECH, 2021, pp. 2426–2430

2021

[43] [43]

XLS-R: Self-supervised cross-lingual speech rep- resentation learning at scale,

A. Babuet al., “XLS-R: Self-supervised cross-lingual speech rep- resentation learning at scale,” inProc. INTERSPEECH, 2022, pp. 2278–2282

2022

[44] [44]

Towards robust speech representation learning for thousands of languages,

W. Chenet al., “Towards robust speech representation learning for thousands of languages,” inProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 10 205–10 224

2024

[45] [45]

Omnilingual ASR: Open-source multilin- gual speech recognition for 1600+ languages,

G. Kerenet al., “Omnilingual ASR: Open-source multilin- gual speech recognition for 1600+ languages,”arXiv preprint arXiv:2511.09690, 2025

arXiv 2025

[46] [46]

Layer-wise analysis of a self-supervised speech representation model,

A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921

2021

[47] [47]

Comparative layer-wise analy- sis of self-supervised speech models,

A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analy- sis of self-supervised speech models,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2023, pp. 1–5

2023

[48] [48]

SELM: Speech enhancement using discrete to- kens and language models,

Z. Wanget al., “SELM: Speech enhancement using discrete to- kens and language models,” inProceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 561–11 565

2024

[49] [49]

Differentiable K-means for fully-optimized discrete token-based asr,

K. Onda, Y . Kashiwagi, E. Tsunoo, H. Futami, and S. Watanabe, “Differentiable K-means for fully-optimized discrete token-based asr,” inProc. INTERSPEECH, 2025, pp. 1223–1227

2025

[50] [50]

Geometric constraints on human speech sound inventories,

E. Dunbar and E. Dupoux, “Geometric constraints on human speech sound inventories,”Frontiers in Psychology, vol. 7, p. 1061, 2016

2016

[51] [51]

FLEURS: Few-shot learning evaluation of universal representations of speech,

A. Conneauet al., “FLEURS: Few-shot learning evaluation of universal representations of speech,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT), 2022, pp. 798– 805

2022

[52] [52]

Common V oice: A massively-multilingual speech corpus,

R. Ardilaet al., “Common V oice: A massively-multilingual speech corpus,” inProceedings of the Language Resources and Evaluation Conference (LREC), 2020, pp. 4218–4222

2020

[53] [53]

Decoupled weight decay regulariza- tion,

I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,” inProceedings of the International Conference on Learning Representations, 2019

2019

[54] [54]

data2vec: A general framework for self-supervised learning in speech, vision and language,

A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” inProceedings of the International Conference on Machine Learning, vol. 162, 2022, pp. 1298–1312

2022

[55] [55]

The LJ Speech Dataset,

K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito. com/LJ-Speech-Dataset/, 2017

2017

[56] [56]

11354328

T. M ¨uller and D. Kreutz, “Thorsten-V oice Dataset 2022.10,” Nov. 2022. [Online]. Available: https://doi.org/10.5281/zenodo. 7265581

work page doi:10.5281/zenodo 2022

[57] [57]

AISHELL-3: A multi- speaker Mandarin TTS corpus and the baselines,

Y . Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A multi- speaker Mandarin TTS corpus and the baselines,”arXiv preprint arXiv:2010.11567, 2020

arXiv 2010

[58] [58]

A step-by-step process for building TTS voices using open source data and framework for Bangla, Ja- vanese, Khmer, Nepali, Sinhala, and Sundanese,

K. Sodimanaet al., “A step-by-step process for building TTS voices using open source data and framework for Bangla, Ja- vanese, Khmer, Nepali, Sinhala, and Sundanese,” inProceedings of the International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), 2018, pp. 66–70

2018

[59] [59]

High-quality sinhalese multi-speaker TTS corpus,

Google, Inc., “High-quality sinhalese multi-speaker TTS corpus,” https://www.openslr.org/30/, 2016

2016

[60] [60]

Rapid development of TTS corpora for four South African languages,

D. van Niekerket al., “Rapid development of TTS corpora for four South African languages,” inProc. INTERSPEECH, 2017, pp. 2178–2182

2017

[61] [61]

The V oiceMOS Challenge 2024: Beyond speech quality prediction,

W.-C. Huanget al., “The V oiceMOS Challenge 2024: Beyond speech quality prediction,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 803–810

2024

[62] [62]

The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 818–824. Appendix (a) Instructions provided to the participants. (b)...

2024