pith. sign in

arxiv: 2606.21343 · v1 · pith:6VC3KMX3new · submitted 2026-06-19 · 📡 eess.AS · cs.CL· cs.SD

An Evaluation Framework for Text-to-Speech Voice Reconstruction

Pith reviewed 2026-06-26 13:14 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD
keywords text-to-speechvoice reconstructionevaluation frameworkbest worst scalingspeaker similarityintelligibilityspeech disorderszero-shot TTS
0
0 comments X

The pith

An evaluation framework using Best Worst Scaling and a dual-reference measure reliably assesses TTS voice reconstruction where Mean Opinion Scores fall short.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a better way to judge text-to-speech systems that try to restore voices for people with speech disorders. These systems aim to keep the original speaker's identity while making speech clearer. Standard rating methods like Mean Opinion Score often miss the mark, especially with very hard-to-understand speakers. The new approach uses Best Worst Scaling in real-life situations for human judgments and a special objective measure to balance clarity and identity. Tests on many systems and speakers confirm it aligns well with the goal of voice reconstruction.

Core claim

The framework consists of subjective evaluation via Best Worst Scaling with situational framing to assess perceived intelligibility and speaker identity, paired with an objective dual-reference distributional measure that captures the trade-off between intelligibility and speaker identity. This approach addresses the limitations of Mean Opinion Scores, which fail to predict reconstruction success for highly unintelligible speakers. Evaluation across 17 zero-shot TTS systems and 193 speakers demonstrates its reliability and task alignment.

What carries the argument

The central mechanism is the combination of Best Worst Scaling (BWS) with situational framing for subjective ratings and a novel dual-reference distributional measure for objective assessment of the intelligibility-speaker identity trade-off in voice reconstruction.

If this is right

  • The framework enables more sensitive comparison of zero-shot TTS systems for voice reconstruction.
  • It highlights the shortcomings of standard MOS for unintelligible speakers.
  • Results from 193 speakers support its use as a task-aligned evaluation method.
  • Objective and subjective components together provide a balanced view of reconstruction quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could guide development of TTS systems optimized for specific disorder types.
  • Similar methods might apply to evaluating other speech synthesis tasks like accent conversion.
  • Real-world user testing with actual patients could further validate the framework's practical utility.

Load-bearing premise

That the introduced Best Worst Scaling and dual-reference measure more accurately reflect successful voice reconstruction than traditional Mean Opinion Scores, particularly for unintelligible speakers.

What would settle it

Finding that Mean Opinion Scores from listeners correlate more strongly with actual communication success in real scenarios than the new framework's ratings would challenge the framework's superiority.

Figures

Figures reproduced from arXiv: 2606.21343 by Ariadna Sanchez, Christoph Minixhofer, Korin Richmond, Ondrej Klejch, Peter Bell, Simon King.

Figure 1
Figure 1. Figure 1: Subjective Best Worst Scaling (BWS) worth estimates with 95% confidence intervals, with log-worth scores relative to the original recording. Dotted lines represent systems without statistically significant difference to the recording (p ≥ 0.05). Figures (a, b) show all speakers, and (c, d) low intelligibility speakers. that we had used AI to reconstruct how the speaker sounded be￾fore they developed the sp… view at source ↗
read the original abstract

Voice reconstruction using Text-to-Speech (TTS) offers a communication method for people with speech disorders, which aims to retain their speaker identity while improving intelligibility. Previous work generally relies on Mean Opinion Score (MOS) to evaluate naturalness and speaker similarity, but this has limited sensitivity and reliability. We propose an evaluation framework with subjective and objective components. Subjectively, we evaluate perceived intelligibility and speaker identity using Best Worst Scaling (BWS) with situational framing. Objectively, we demonstrate that standard measures fail to predict reconstruction success for highly unintelligible speakers, so we introduce a novel dual-reference distributional measure to assess the trade-off between intelligibility and speaker identity. By evaluating the output of 17 zero-shot TTS systems for 193 speakers, we show that our framework provides a reliable and task-aligned approach for assessing voice reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes an evaluation framework for text-to-speech (TTS) voice reconstruction intended to retain speaker identity while improving intelligibility for individuals with speech disorders. It critiques standard Mean Opinion Score (MOS) measures for limited sensitivity and reliability, introduces Best Worst Scaling (BWS) with situational framing for subjective assessment of perceived intelligibility and speaker identity, and presents a novel dual-reference distributional objective measure to capture the intelligibility-speaker identity trade-off. The framework is evaluated on outputs from 17 zero-shot TTS systems for 193 speakers, with the claim that it provides a reliable and task-aligned approach.

Significance. If the empirical results hold, the framework could meaningfully advance evaluation practices in assistive speech technology by offering metrics better aligned with the reconstruction task than MOS. The scale of the evaluation (17 systems, 193 speakers) supplies independent empirical grounding for the task-alignment assertion and represents a strength of the work.

minor comments (2)
  1. The abstract states that standard measures fail for highly unintelligible speakers and that the new objective measure addresses this, but a brief quantitative summary of the failure (e.g., correlation values or prediction error) would strengthen the motivation paragraph.
  2. Clarify the precise formulation of the dual-reference distributional measure (e.g., how the two references are combined and what distance or divergence is used) in the methods section to allow replication.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the accurate summary of the proposed framework, and the recommendation for minor revision. The scale of the evaluation (17 systems, 193 speakers) is indeed a strength, and we are pleased that the task-alignment of the metrics was recognized.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper motivates limitations of MOS for voice reconstruction evaluation, then introduces BWS with situational framing for subjective assessment and a dual-reference distributional objective measure. It validates the framework via independent large-scale evaluation on outputs from 17 zero-shot TTS systems across 193 speakers. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on external empirical results rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that existing MOS-based measures are insufficient for unintelligible speakers and that the proposed methods address this; no free parameters or invented entities with independent evidence are detailed.

axioms (1)
  • domain assumption Mean Opinion Score has limited sensitivity and reliability for evaluating voice reconstruction of highly unintelligible speakers
    Explicitly stated in the abstract as the motivation for the new framework
invented entities (1)
  • dual-reference distributional measure no independent evidence
    purpose: To assess the trade-off between intelligibility and speaker identity
    Introduced as a novel objective component in the abstract

pith-pipeline@v0.9.1-grok · 5688 in / 1175 out tokens · 40655 ms · 2026-06-26T13:14:24.645807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 1 canonical work pages

  1. [1]

    Eventually, some of these people will become users of a voice output communica- tion aid (VOCA), which often uses Text-to-Speech (TTS)

    Introduction Speech disorders, caused by neurological conditions, can im- pact someone’s ability to communicate. Eventually, some of these people will become users of a voice output communica- tion aid (VOCA), which often uses Text-to-Speech (TTS). V oice reconstruction is the task of creating a personalised TTS voice for speakers whose speech is already ...

  2. [2]

    This approach is difficult to apply to voice reconstruction due to the lack of pre-condition reference data [13, 14]

    and shown correlation with listener preferences [11–13]. This approach is difficult to apply to voice reconstruction due to the lack of pre-condition reference data [13, 14]. Despite widespread use, the ability of objective measures to predict lis- tener preferences for voice reconstruction is unexplored. In previous work, TTS systems trained solely on ty...

  3. [3]

    Most previous work evaluatesnaturalnesswith MOS [2, 3, 15]

    Related work TTS voice reconstruction has mostly been evaluated for natu- ralness, intelligibility, and similarity to the target speaker. Most previous work evaluatesnaturalnesswith MOS [2, 3, 15]. Re- cently, [16] proposed evaluating suitability of TTS output for 1https://minixc.github.io/sap/ arXiv:2606.21343v1 [eess.AS] 19 Jun 2026 Table 1:Breakdown of...

  4. [4]

    We only kept speakers whose first language was English to avoid adding foreign accents as an additional confound, as TTS systems have shown to under- perform in those [31]

    Dataset We selected a total of 193 speakers from the Speech Accessi- bility Project (SAP) dataset (December 2024 release) [30] with four different types of condition: Parkinson’s, Cerebral Palsy, ALS, or Down Syndrome. We only kept speakers whose first language was English to avoid adding foreign accents as an additional confound, as TTS systems have show...

  5. [5]

    Evaluating voice reconstruction V oice reconstruction is a specific use-case for TTS which should not be evaluated with general naturalness and similarity evaluations. A number of dimensions are important in voice re- construction, including for example: (1) the intelligibility of the output, (2) how well the speaker was reconstructed, i.e., evaluat- ing ...

  6. [6]

    BWS has been shown to be a good alternative to MOS which can evalu- ate system preferences with fewer screens (and therefore fewer listeners) with statistical significance [17]

    Subjective evaluation We chose Best Worst Scaling (BWS) to conductINTELLIGIBIL- ITYandRECONSTRUCTIONsubjective evaluations. BWS has been shown to be a good alternative to MOS which can evalu- ate system preferences with fewer screens (and therefore fewer listeners) with statistical significance [17]. For each type of evaluation, we conducted 5 separate li...

  7. [7]

    Objective evaluation We investigate objective measures applied to previous voice re- construction work by testing whether system rankings accord- ing to those measures correlate with the rankings derived from subjective evaluation. We compute WER by automatically tran- scribing the synthetic speech and recordings using Whisper [19] (turbocheckpoint) and P...

  8. [8]

    Objective evaluation methods sometimes fail to generalise to new domains [8] and have to be tested each time a new system or domain is introduced [52]

    Discussion and Conclusion With advancements in generative speech synthesis, common subjective evaluation protocols such as MOS naturalness have become saturated, and their existing limitations have been am- plified [17]. Objective evaluation methods sometimes fail to generalise to new domains [8] and have to be tested each time a new system or domain is i...

  9. [9]

    The first author was supported by the UKRI Centre for Doctoral Training in Natural Language Pro- cessing, funded by UKRI (grant EP/S022481/1)

    Acknowledgements This study has been approved by the School of Informatics Ethics’ Committee at the University of Edinburgh, with refer- ence number 997684. The first author was supported by the UKRI Centre for Doctoral Training in Natural Language Pro- cessing, funded by UKRI (grant EP/S022481/1). We want to thank Mark Hasegawa-Jonhson and collaborators ...

  10. [10]

    Generative AI use disclosure We did not use any generative AI for this work, except to gen- erate the synthetic speech stimuli and parts of the demo page

  11. [11]

    A comparison of manual and automatic voice repair for individual with vocal disabilities,

    C. Veaux, J. Yamagishi, and S. King, “A comparison of manual and automatic voice repair for individual with vocal disabilities,” inProceedings of SLPAT 2015: 6th Workshop on Speech and Lan- guage Processing for Assistive Technologies, 2015, pp. 130–133

  12. [12]

    Creating personalized synthetic voices from articulation impaired speech using augmented reconstruc- tion loss,

    Y . Tian, J. Li, and T. Lee, “Creating personalized synthetic voices from articulation impaired speech using augmented reconstruc- tion loss,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 501–11 505

  13. [13]

    Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning,

    Y . Jeon, S. Im, Y . Kim, and G. G. Lee, “Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning,”arXiv preprint arXiv:2508.10412, 2025

  14. [14]

    The limits of the mean opinion score for speech synthesis evaluation,

    S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,”Computer Speech & Language, vol. 84, p. 101577, 2024

  15. [15]

    The State Of TTS: A Case Study with Human Fooling Rates,

    P. Srinivasa Varadhan, S. Thomas, S. Teja M S, S. Bhooshan, and M. M. Khapra, “The State Of TTS: A Case Study with Human Fooling Rates,” inInterspeech 2025. ISCA, 2025, pp. 2285– 2289

  16. [16]

    Hot topics in speech synthesis evaluation,

    G. Bailly, E. Andr ´e, E. Cooper, B. Cowan, J. Edlund, N. Harte, S. King, E. Klabbers, S. Le Maguer, Z. Maliszet al., “Hot topics in speech synthesis evaluation,” inSpeech Synthesis Workshop. ISCA, 2025, pp. 1–7

  17. [17]

    Good practices for evaluation of synthesized speech,

    E. Cooper, S. L. Maguer, E. Klabbers, and J. Yamagishi, “Good practices for evaluation of synthesized speech,”arXiv preprint arXiv:2503.03250, 2025

  18. [18]

    A review on subjective and objective evaluation of syn- thetic speech,

    E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “A review on subjective and objective evaluation of syn- thetic speech,”Acoustical Science and Technology, vol. 45, no. 4, pp. 161–183, 2024

  19. [19]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]....

  20. [20]

    Fr ´echet Au- dio Distance: A reference-free metric for evaluating music en- hancement algorithms,

    K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr ´echet Au- dio Distance: A reference-free metric for evaluating music en- hancement algorithms,” inInterspeech. ISCA, 2019

  21. [21]

    TTSDS-text-to-speech distribution score,

    C. Minixhofer, O. Klejch, and P. Bell, “TTSDS-text-to-speech distribution score,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 766–773

  22. [22]

    TTSDS2: Robust objective evaluation for human-quality synthetic speech,

    ——, “TTSDS2: Robust objective evaluation for human-quality synthetic speech,” inThe 13th Speech Synthesis Workshop. ISCA, 2025, pp. 68–75

  23. [23]

    Layer- wise Analysis for Quality of Multilingual Synthesized Speech,

    E. Cooper, T. Okamoto, Y . Ohtani, T. Toda, and H. Kawai, “Layer- wise Analysis for Quality of Multilingual Synthesized Speech,” arXiv preprint arXiv:2509.04830, 2025

  24. [24]

    Progress and Challenges in DNN-Based Objective Quality Assessment of Synthesized Speech,

    E. Cooper, “Progress and Challenges in DNN-Based Objective Quality Assessment of Synthesized Speech,” in2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2025, pp. 2558–2563

  25. [25]

    Zero-shot voice cloning text-to-speech for dyspho- nia disorder speakers,

    K. Azizah, “Zero-shot voice cloning text-to-speech for dyspho- nia disorder speakers,”IEEE Access, vol. 12, pp. 63 528–63 547, 2024

  26. [26]

    V oice Re- construction through Large-Scale TTS Models: Comparing Zero- Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication,

    ´E. Sz ´ekely, P. Mihajlik, M. S. K ´ad´ar, and L. T ´oth, “V oice Re- construction through Large-Scale TTS Models: Comparing Zero- Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication,” inInterspeech. ISCA, 2025, pp. 2735–2739

  27. [27]

    Experimental evaluation of MOS, AB and BWS listening test designs,

    D. Wells, A. L. Aldana Blanco, C. Valentini, E. Cooper, A. Pine, J. Yamagishi, and K. Richmond, “Experimental evaluation of MOS, AB and BWS listening test designs,” inInterspeech 2024, 2024, pp. 2695–2699

  28. [28]

    Assess- ing the impact of contextual framing on subjective TTS quality,

    J. Edlund, C. T ˚annander, S. Le Maguer, and P. Wagner, “Assess- ing the impact of contextual framing on subjective TTS quality,” inInterspeech. ISCA, 2024, pp. 1205–1209

  29. [29]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  30. [30]

    Univer- sal phone recognition with a multilingual allophone system,

    X. Li, S. Dalmia, J. Li, M. Lee, P. Littell, J. Yao, A. Anastasopou- los, D. R. Mortensen, G. Neubig, A. W. Blacket al., “Univer- sal phone recognition with a multilingual allophone system,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 8249– 8253

  31. [31]

    Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit,

    H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

  32. [32]

    UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inInterspeech 2022. ISCA, 2022, pp. 4521–4525

  33. [33]

    Reinforcement Learning-Driven Personalized V oice Similarity Enhancement in Dysarthric Speech Synthesis,

    C. Pu, F. Mu, J. Zhang, R. Huang, and H. Cheng, “Reinforcement Learning-Driven Personalized V oice Similarity Enhancement in Dysarthric Speech Synthesis,” in2025 IEEE International Con- ference on Robotics and Biomimetics (ROBIO). IEEE, 2025, pp. 383–389

  34. [34]

    Can We Reconstruct a Dysarthric V oice with the Large Speech Model Parler TTS?

    A. Sanchez and S. King, “Can We Reconstruct a Dysarthric V oice with the Large Speech Model Parler TTS?” inInterspeech 2025, 2025, pp. 4138–4142

  35. [35]

    Using HMM-based Speech Synthesis to Reconstruct the V oice of Individuals with Degener- ative Speech Disorders

    C. Veaux, J. Yamagishi, and S. King, “Using HMM-based Speech Synthesis to Reconstruct the V oice of Individuals with Degener- ative Speech Disorders.” inInterspeech. ISCA, 2012, pp. 967– 970

  36. [36]

    Personalized Fine-Tuning with Con- trollable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition,

    D. Wagner, I. Baumann, N. Engert, S. Lee, E. N ¨oth, K. Ried- hammer, and T. Bocklet, “Personalized Fine-Tuning with Con- trollable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition,” inInterspeech 2025. ISCA, 2025, pp. 3294–3298

  37. [37]

    Finding My V oice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation,

    K. Rosero, E. Yeo, D. R. Mortensen, C. V . Slot, R. R. Hallac, and C. Busso, “Finding My V oice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation,”arXiv preprint arXiv:2509.19231, 2025

  38. [38]

    DiffDSR: Dysarthric Speech Reconstruction Using La- tent Diffusion Model,

    X. Chen, D. Yang, W. Wu, M. Wu, J. Xu, X. Wu, Z. Wu, and H. Meng, “DiffDSR: Dysarthric Speech Reconstruction Using La- tent Diffusion Model,” inInterspeech 2025. ISCA, 2025, pp. 2113–2117

  39. [39]

    CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction,

    X. Chen, D. Yang, D. Wang, X. Wu, Z. Wu, and H. Meng, “CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction,” inInterspeech

  40. [40]

    4129–4133

    ISCA, 2024, pp. 4129–4133

  41. [41]

    Community-supported shared infrastructure in support of speech accessibility,

    M. Hasegawa-Johnson, X. Zheng, H. Kim, C. Mendes, M. Dickin- son, E. Hege, C. Zwilling, M. M. Channell, L. Mattie, H. Hodges et al., “Community-supported shared infrastructure in support of speech accessibility,”Journal of Speech, Language, and Hearing Research, vol. 67, no. 11, pp. 4162–4175, 2024

  42. [42]

    AccentBox: Towards High-Fidelity Zero-Shot Accent Generation,

    J. Zhong, K. Richmond, Z. Su, and S. Sun, “AccentBox: Towards High-Fidelity Zero-Shot Accent Generation,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  43. [43]

    IndexTTS2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

    S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “IndexTTS2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025

  44. [44]

    Qwen3-TTS Technical Report,

    H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-TTS Technical Report,” arXiv preprint arXiv:2601.15621, 2026

  45. [45]

    E2 TTS: Embarrass- ingly easy fully non-autoregressive zero-shot TTS,

    S. E. Eskimez, X. Wang, M. Thakker, C. Li, C.-H. Tsai, Z. Xiao, H. Yang, Z. Zhu, M. Tang, X. Tanet al., “E2 TTS: Embarrass- ingly easy fully non-autoregressive zero-shot TTS,” in2024 IEEE spoken language technology workshop (SLT). IEEE, 2024, pp. 682–689

  46. [46]

    Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis,

    S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis,”arXiv preprint arXiv:2411.01156, 2024

  47. [47]

    F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

    Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 6255–6271

  48. [48]

    Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

    Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to- speech with masked generative codec transformer,”arXiv preprint arXiv:2409.00750, 2024

  49. [49]

    VibeV oice: Expressive Podcast Generation with Next-Token Diffusion,

    Z. Peng, J. Yu, W. Wang, Y . Chang, Y . Sun, L. Dong, Y . Zhu, W. Xu, H. Bao, Z. Wanget al., “VibeV oice: Expressive Podcast Generation with Next-Token Diffusion,” inThe Fourteenth Inter- national Conference on Learning Representations

  50. [50]

    V oicecraft: Zero-shot speech editing and text-to-speech in the wild,

    P. Peng, P.-Y . Huang, S.-W. Li, A. Mohamed, and D. Harwath, “V oicecraft: Zero-shot speech editing and text-to-speech in the wild,” inProceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 442–12 462

  51. [51]

    GPT-SoVITS,

    RVC-Boss, “GPT-SoVITS,” https://github.com/RVC-Boss/ GPT-SoVITS, 2024, gitHub repository. [Online]. Available: https://github.com/RVC-Boss/GPT-SoVITS

  52. [52]

    Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised repre- sentations for speech synthesis,

    S.-H. Lee, S.-B. Kim, J.-H. Lee, E. Song, M.-J. Hwang, and S.- W. Lee, “Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised repre- sentations for speech synthesis,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 624–16 636, 2022

  53. [53]

    Styletts 2: Towards human-level text-to-speech through style dif- fusion and adversarial training with large speech language mod- els,

    Y . A. Li, C. Han, V . Raghavan, G. Mischler, and N. Mesgarani, “Styletts 2: Towards human-level text-to-speech through style dif- fusion and adversarial training with large speech language mod- els,”Advances in neural information processing systems, vol. 36, pp. 19 594–19 621, 2023

  54. [54]

    Better speech synthesis through scaling,

    J. Betker, “Better speech synthesis through scaling,”arXiv preprint arXiv:2305.07243, 2023

  55. [55]

    Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement,

    X. Zhang, X. Zhang, K. Peng, Z. Tang, V . Manohar, Y . Liu, J. Hwang, D. Li, Y . Wang, J. Chanet al., “Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement,” arXiv preprint arXiv:2502.07243, 2025

  56. [56]

    MetaV oice-src,

    MetaV oice, “MetaV oice-src,” https://github.com/metavoiceio/ metavoice-src, 2024, gitHub repository. [Online]. Available: https://github.com/metavoiceio/metavoice-src

  57. [57]

    XTTS: a massively multilingual zero-shot text-to-speech model,

    E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemiet al., “XTTS: a massively multilingual zero-shot text-to-speech model,”arXiv preprint arXiv:2406.04904, 2024

  58. [58]

    WhisperSpeech,

    WhisperSpeech, “WhisperSpeech,” https://github.com/ WhisperSpeech/WhisperSpeech, 2024, gitHub repository. [Online]. Available: https://github.com/WhisperSpeech/ WhisperSpeech

  59. [59]

    Openvoice: Versatile instant voice cloning,

    Z. Qin, W. Zhao, X. Yu, and X. Sun, “Openvoice: Versatile instant voice cloning,”arXiv preprint arXiv:2312.01479, 2023

  60. [60]

    Modelling rankings in R: the PlackettLuce package,

    H. L. Turner, J. van Etten, D. Firth, and I. Kosmidis, “Modelling rankings in R: the PlackettLuce package,”Computational Statis- tics, vol. 35, no. 3, pp. 1027–1057, 2020

  61. [61]

    Automatic cross- and multi-lingual recognition of dysphonia by ensemble classification using deep speaker embedding models,

    D. Aziz and D. Sztah ´o, “Automatic cross- and multi-lingual recognition of dysphonia by ensemble classification using deep speaker embedding models,”Expert Systems, vol. 41, no. 10, p. e13660, 2024. [Online]. Available: https://onlinelibrary.wiley. com/doi/abs/10.1111/exsy.13660

  62. [62]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,

    H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” inInterspeech 2019. ISCA, 2019, pp. 1526– 1530

  63. [63]

    Quality prediction for synthesized speech: Comparison of approaches,

    S. M ¨oller and T. H. Falk, “Quality prediction for synthesized speech: Comparison of approaches,” inInternational Conference on Acoustics, 2009, pp. 1168–1171