An Evaluation Framework for Text-to-Speech Voice Reconstruction
Pith reviewed 2026-06-26 13:14 UTC · model grok-4.3
The pith
An evaluation framework using Best Worst Scaling and a dual-reference measure reliably assesses TTS voice reconstruction where Mean Opinion Scores fall short.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The framework consists of subjective evaluation via Best Worst Scaling with situational framing to assess perceived intelligibility and speaker identity, paired with an objective dual-reference distributional measure that captures the trade-off between intelligibility and speaker identity. This approach addresses the limitations of Mean Opinion Scores, which fail to predict reconstruction success for highly unintelligible speakers. Evaluation across 17 zero-shot TTS systems and 193 speakers demonstrates its reliability and task alignment.
What carries the argument
The central mechanism is the combination of Best Worst Scaling (BWS) with situational framing for subjective ratings and a novel dual-reference distributional measure for objective assessment of the intelligibility-speaker identity trade-off in voice reconstruction.
If this is right
- The framework enables more sensitive comparison of zero-shot TTS systems for voice reconstruction.
- It highlights the shortcomings of standard MOS for unintelligible speakers.
- Results from 193 speakers support its use as a task-aligned evaluation method.
- Objective and subjective components together provide a balanced view of reconstruction quality.
Where Pith is reading between the lines
- This approach could guide development of TTS systems optimized for specific disorder types.
- Similar methods might apply to evaluating other speech synthesis tasks like accent conversion.
- Real-world user testing with actual patients could further validate the framework's practical utility.
Load-bearing premise
That the introduced Best Worst Scaling and dual-reference measure more accurately reflect successful voice reconstruction than traditional Mean Opinion Scores, particularly for unintelligible speakers.
What would settle it
Finding that Mean Opinion Scores from listeners correlate more strongly with actual communication success in real scenarios than the new framework's ratings would challenge the framework's superiority.
Figures
read the original abstract
Voice reconstruction using Text-to-Speech (TTS) offers a communication method for people with speech disorders, which aims to retain their speaker identity while improving intelligibility. Previous work generally relies on Mean Opinion Score (MOS) to evaluate naturalness and speaker similarity, but this has limited sensitivity and reliability. We propose an evaluation framework with subjective and objective components. Subjectively, we evaluate perceived intelligibility and speaker identity using Best Worst Scaling (BWS) with situational framing. Objectively, we demonstrate that standard measures fail to predict reconstruction success for highly unintelligible speakers, so we introduce a novel dual-reference distributional measure to assess the trade-off between intelligibility and speaker identity. By evaluating the output of 17 zero-shot TTS systems for 193 speakers, we show that our framework provides a reliable and task-aligned approach for assessing voice reconstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an evaluation framework for text-to-speech (TTS) voice reconstruction intended to retain speaker identity while improving intelligibility for individuals with speech disorders. It critiques standard Mean Opinion Score (MOS) measures for limited sensitivity and reliability, introduces Best Worst Scaling (BWS) with situational framing for subjective assessment of perceived intelligibility and speaker identity, and presents a novel dual-reference distributional objective measure to capture the intelligibility-speaker identity trade-off. The framework is evaluated on outputs from 17 zero-shot TTS systems for 193 speakers, with the claim that it provides a reliable and task-aligned approach.
Significance. If the empirical results hold, the framework could meaningfully advance evaluation practices in assistive speech technology by offering metrics better aligned with the reconstruction task than MOS. The scale of the evaluation (17 systems, 193 speakers) supplies independent empirical grounding for the task-alignment assertion and represents a strength of the work.
minor comments (2)
- The abstract states that standard measures fail for highly unintelligible speakers and that the new objective measure addresses this, but a brief quantitative summary of the failure (e.g., correlation values or prediction error) would strengthen the motivation paragraph.
- Clarify the precise formulation of the dual-reference distributional measure (e.g., how the two references are combined and what distance or divergence is used) in the methods section to allow replication.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work, the accurate summary of the proposed framework, and the recommendation for minor revision. The scale of the evaluation (17 systems, 193 speakers) is indeed a strength, and we are pleased that the task-alignment of the metrics was recognized.
Circularity Check
No significant circularity
full rationale
The paper motivates limitations of MOS for voice reconstruction evaluation, then introduces BWS with situational framing for subjective assessment and a dual-reference distributional objective measure. It validates the framework via independent large-scale evaluation on outputs from 17 zero-shot TTS systems across 193 speakers. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on external empirical results rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mean Opinion Score has limited sensitivity and reliability for evaluating voice reconstruction of highly unintelligible speakers
invented entities (1)
-
dual-reference distributional measure
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Eventually, some of these people will become users of a voice output communica- tion aid (VOCA), which often uses Text-to-Speech (TTS)
Introduction Speech disorders, caused by neurological conditions, can im- pact someone’s ability to communicate. Eventually, some of these people will become users of a voice output communica- tion aid (VOCA), which often uses Text-to-Speech (TTS). V oice reconstruction is the task of creating a personalised TTS voice for speakers whose speech is already ...
-
[2]
This approach is difficult to apply to voice reconstruction due to the lack of pre-condition reference data [13, 14]
and shown correlation with listener preferences [11–13]. This approach is difficult to apply to voice reconstruction due to the lack of pre-condition reference data [13, 14]. Despite widespread use, the ability of objective measures to predict lis- tener preferences for voice reconstruction is unexplored. In previous work, TTS systems trained solely on ty...
-
[3]
Most previous work evaluatesnaturalnesswith MOS [2, 3, 15]
Related work TTS voice reconstruction has mostly been evaluated for natu- ralness, intelligibility, and similarity to the target speaker. Most previous work evaluatesnaturalnesswith MOS [2, 3, 15]. Re- cently, [16] proposed evaluating suitability of TTS output for 1https://minixc.github.io/sap/ arXiv:2606.21343v1 [eess.AS] 19 Jun 2026 Table 1:Breakdown of...
Pith/arXiv arXiv 2026
-
[4]
We only kept speakers whose first language was English to avoid adding foreign accents as an additional confound, as TTS systems have shown to under- perform in those [31]
Dataset We selected a total of 193 speakers from the Speech Accessi- bility Project (SAP) dataset (December 2024 release) [30] with four different types of condition: Parkinson’s, Cerebral Palsy, ALS, or Down Syndrome. We only kept speakers whose first language was English to avoid adding foreign accents as an additional confound, as TTS systems have show...
2024
-
[5]
Evaluating voice reconstruction V oice reconstruction is a specific use-case for TTS which should not be evaluated with general naturalness and similarity evaluations. A number of dimensions are important in voice re- construction, including for example: (1) the intelligibility of the output, (2) how well the speaker was reconstructed, i.e., evaluat- ing ...
-
[6]
BWS has been shown to be a good alternative to MOS which can evalu- ate system preferences with fewer screens (and therefore fewer listeners) with statistical significance [17]
Subjective evaluation We chose Best Worst Scaling (BWS) to conductINTELLIGIBIL- ITYandRECONSTRUCTIONsubjective evaluations. BWS has been shown to be a good alternative to MOS which can evalu- ate system preferences with fewer screens (and therefore fewer listeners) with statistical significance [17]. For each type of evaluation, we conducted 5 separate li...
-
[7]
Objective evaluation We investigate objective measures applied to previous voice re- construction work by testing whether system rankings accord- ing to those measures correlate with the rankings derived from subjective evaluation. We compute WER by automatically tran- scribing the synthetic speech and recordings using Whisper [19] (turbocheckpoint) and P...
-
[8]
Objective evaluation methods sometimes fail to generalise to new domains [8] and have to be tested each time a new system or domain is introduced [52]
Discussion and Conclusion With advancements in generative speech synthesis, common subjective evaluation protocols such as MOS naturalness have become saturated, and their existing limitations have been am- plified [17]. Objective evaluation methods sometimes fail to generalise to new domains [8] and have to be tested each time a new system or domain is i...
-
[9]
The first author was supported by the UKRI Centre for Doctoral Training in Natural Language Pro- cessing, funded by UKRI (grant EP/S022481/1)
Acknowledgements This study has been approved by the School of Informatics Ethics’ Committee at the University of Edinburgh, with refer- ence number 997684. The first author was supported by the UKRI Centre for Doctoral Training in Natural Language Pro- cessing, funded by UKRI (grant EP/S022481/1). We want to thank Mark Hasegawa-Jonhson and collaborators ...
-
[10]
Generative AI use disclosure We did not use any generative AI for this work, except to gen- erate the synthetic speech stimuli and parts of the demo page
-
[11]
A comparison of manual and automatic voice repair for individual with vocal disabilities,
C. Veaux, J. Yamagishi, and S. King, “A comparison of manual and automatic voice repair for individual with vocal disabilities,” inProceedings of SLPAT 2015: 6th Workshop on Speech and Lan- guage Processing for Assistive Technologies, 2015, pp. 130–133
2015
-
[12]
Creating personalized synthetic voices from articulation impaired speech using augmented reconstruc- tion loss,
Y . Tian, J. Li, and T. Lee, “Creating personalized synthetic voices from articulation impaired speech using augmented reconstruc- tion loss,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 501–11 505
2024
-
[13]
Y . Jeon, S. Im, Y . Kim, and G. G. Lee, “Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning,”arXiv preprint arXiv:2508.10412, 2025
arXiv 2025
-
[14]
The limits of the mean opinion score for speech synthesis evaluation,
S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,”Computer Speech & Language, vol. 84, p. 101577, 2024
2024
-
[15]
The State Of TTS: A Case Study with Human Fooling Rates,
P. Srinivasa Varadhan, S. Thomas, S. Teja M S, S. Bhooshan, and M. M. Khapra, “The State Of TTS: A Case Study with Human Fooling Rates,” inInterspeech 2025. ISCA, 2025, pp. 2285– 2289
2025
-
[16]
Hot topics in speech synthesis evaluation,
G. Bailly, E. Andr ´e, E. Cooper, B. Cowan, J. Edlund, N. Harte, S. King, E. Klabbers, S. Le Maguer, Z. Maliszet al., “Hot topics in speech synthesis evaluation,” inSpeech Synthesis Workshop. ISCA, 2025, pp. 1–7
2025
-
[17]
Good practices for evaluation of synthesized speech,
E. Cooper, S. L. Maguer, E. Klabbers, and J. Yamagishi, “Good practices for evaluation of synthesized speech,”arXiv preprint arXiv:2503.03250, 2025
arXiv 2025
-
[18]
A review on subjective and objective evaluation of syn- thetic speech,
E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “A review on subjective and objective evaluation of syn- thetic speech,”Acoustical Science and Technology, vol. 45, no. 4, pp. 161–183, 2024
2024
-
[19]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]....
2017
-
[20]
Fr ´echet Au- dio Distance: A reference-free metric for evaluating music en- hancement algorithms,
K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr ´echet Au- dio Distance: A reference-free metric for evaluating music en- hancement algorithms,” inInterspeech. ISCA, 2019
2019
-
[21]
TTSDS-text-to-speech distribution score,
C. Minixhofer, O. Klejch, and P. Bell, “TTSDS-text-to-speech distribution score,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 766–773
2024
-
[22]
TTSDS2: Robust objective evaluation for human-quality synthetic speech,
——, “TTSDS2: Robust objective evaluation for human-quality synthetic speech,” inThe 13th Speech Synthesis Workshop. ISCA, 2025, pp. 68–75
2025
-
[23]
Layer- wise Analysis for Quality of Multilingual Synthesized Speech,
E. Cooper, T. Okamoto, Y . Ohtani, T. Toda, and H. Kawai, “Layer- wise Analysis for Quality of Multilingual Synthesized Speech,” arXiv preprint arXiv:2509.04830, 2025
arXiv 2025
-
[24]
Progress and Challenges in DNN-Based Objective Quality Assessment of Synthesized Speech,
E. Cooper, “Progress and Challenges in DNN-Based Objective Quality Assessment of Synthesized Speech,” in2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2025, pp. 2558–2563
2025
-
[25]
Zero-shot voice cloning text-to-speech for dyspho- nia disorder speakers,
K. Azizah, “Zero-shot voice cloning text-to-speech for dyspho- nia disorder speakers,”IEEE Access, vol. 12, pp. 63 528–63 547, 2024
2024
-
[26]
V oice Re- construction through Large-Scale TTS Models: Comparing Zero- Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication,
´E. Sz ´ekely, P. Mihajlik, M. S. K ´ad´ar, and L. T ´oth, “V oice Re- construction through Large-Scale TTS Models: Comparing Zero- Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication,” inInterspeech. ISCA, 2025, pp. 2735–2739
2025
-
[27]
Experimental evaluation of MOS, AB and BWS listening test designs,
D. Wells, A. L. Aldana Blanco, C. Valentini, E. Cooper, A. Pine, J. Yamagishi, and K. Richmond, “Experimental evaluation of MOS, AB and BWS listening test designs,” inInterspeech 2024, 2024, pp. 2695–2699
2024
-
[28]
Assess- ing the impact of contextual framing on subjective TTS quality,
J. Edlund, C. T ˚annander, S. Le Maguer, and P. Wagner, “Assess- ing the impact of contextual framing on subjective TTS quality,” inInterspeech. ISCA, 2024, pp. 1205–1209
2024
-
[29]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
2023
-
[30]
Univer- sal phone recognition with a multilingual allophone system,
X. Li, S. Dalmia, J. Li, M. Lee, P. Littell, J. Yao, A. Anastasopou- los, D. R. Mortensen, G. Neubig, A. W. Blacket al., “Univer- sal phone recognition with a multilingual allophone system,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 8249– 8253
2020
-
[31]
Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit,
H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5
2023
-
[32]
UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,
T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inInterspeech 2022. ISCA, 2022, pp. 4521–4525
2022
-
[33]
Reinforcement Learning-Driven Personalized V oice Similarity Enhancement in Dysarthric Speech Synthesis,
C. Pu, F. Mu, J. Zhang, R. Huang, and H. Cheng, “Reinforcement Learning-Driven Personalized V oice Similarity Enhancement in Dysarthric Speech Synthesis,” in2025 IEEE International Con- ference on Robotics and Biomimetics (ROBIO). IEEE, 2025, pp. 383–389
2025
-
[34]
Can We Reconstruct a Dysarthric V oice with the Large Speech Model Parler TTS?
A. Sanchez and S. King, “Can We Reconstruct a Dysarthric V oice with the Large Speech Model Parler TTS?” inInterspeech 2025, 2025, pp. 4138–4142
2025
-
[35]
Using HMM-based Speech Synthesis to Reconstruct the V oice of Individuals with Degener- ative Speech Disorders
C. Veaux, J. Yamagishi, and S. King, “Using HMM-based Speech Synthesis to Reconstruct the V oice of Individuals with Degener- ative Speech Disorders.” inInterspeech. ISCA, 2012, pp. 967– 970
2012
-
[36]
Personalized Fine-Tuning with Con- trollable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition,
D. Wagner, I. Baumann, N. Engert, S. Lee, E. N ¨oth, K. Ried- hammer, and T. Bocklet, “Personalized Fine-Tuning with Con- trollable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition,” inInterspeech 2025. ISCA, 2025, pp. 3294–3298
2025
-
[37]
Finding My V oice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation,
K. Rosero, E. Yeo, D. R. Mortensen, C. V . Slot, R. R. Hallac, and C. Busso, “Finding My V oice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation,”arXiv preprint arXiv:2509.19231, 2025
arXiv 2025
-
[38]
DiffDSR: Dysarthric Speech Reconstruction Using La- tent Diffusion Model,
X. Chen, D. Yang, W. Wu, M. Wu, J. Xu, X. Wu, Z. Wu, and H. Meng, “DiffDSR: Dysarthric Speech Reconstruction Using La- tent Diffusion Model,” inInterspeech 2025. ISCA, 2025, pp. 2113–2117
2025
-
[39]
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction,
X. Chen, D. Yang, D. Wang, X. Wu, Z. Wu, and H. Meng, “CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction,” inInterspeech
-
[40]
4129–4133
ISCA, 2024, pp. 4129–4133
2024
-
[41]
Community-supported shared infrastructure in support of speech accessibility,
M. Hasegawa-Johnson, X. Zheng, H. Kim, C. Mendes, M. Dickin- son, E. Hege, C. Zwilling, M. M. Channell, L. Mattie, H. Hodges et al., “Community-supported shared infrastructure in support of speech accessibility,”Journal of Speech, Language, and Hearing Research, vol. 67, no. 11, pp. 4162–4175, 2024
2024
-
[42]
AccentBox: Towards High-Fidelity Zero-Shot Accent Generation,
J. Zhong, K. Richmond, Z. Su, and S. Sun, “AccentBox: Towards High-Fidelity Zero-Shot Accent Generation,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
2025
-
[43]
S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “IndexTTS2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025
arXiv 2025
-
[44]
H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-TTS Technical Report,” arXiv preprint arXiv:2601.15621, 2026
Pith/arXiv arXiv 2026
-
[45]
E2 TTS: Embarrass- ingly easy fully non-autoregressive zero-shot TTS,
S. E. Eskimez, X. Wang, M. Thakker, C. Li, C.-H. Tsai, Z. Xiao, H. Yang, Z. Zhu, M. Tang, X. Tanet al., “E2 TTS: Embarrass- ingly easy fully non-autoregressive zero-shot TTS,” in2024 IEEE spoken language technology workshop (SLT). IEEE, 2024, pp. 682–689
2024
-
[46]
Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis,
S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-speech: Leveraging large language models for advanced multilingual text-to-speech synthesis,”arXiv preprint arXiv:2411.01156, 2024
arXiv 2024
-
[47]
F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,
Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 6255–6271
2025
-
[48]
Maskgct: Zero-shot text-to- speech with masked generative codec transformer,
Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to- speech with masked generative codec transformer,”arXiv preprint arXiv:2409.00750, 2024
arXiv 2024
-
[49]
VibeV oice: Expressive Podcast Generation with Next-Token Diffusion,
Z. Peng, J. Yu, W. Wang, Y . Chang, Y . Sun, L. Dong, Y . Zhu, W. Xu, H. Bao, Z. Wanget al., “VibeV oice: Expressive Podcast Generation with Next-Token Diffusion,” inThe Fourteenth Inter- national Conference on Learning Representations
-
[50]
V oicecraft: Zero-shot speech editing and text-to-speech in the wild,
P. Peng, P.-Y . Huang, S.-W. Li, A. Mohamed, and D. Harwath, “V oicecraft: Zero-shot speech editing and text-to-speech in the wild,” inProceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 442–12 462
2024
-
[51]
GPT-SoVITS,
RVC-Boss, “GPT-SoVITS,” https://github.com/RVC-Boss/ GPT-SoVITS, 2024, gitHub repository. [Online]. Available: https://github.com/RVC-Boss/GPT-SoVITS
2024
-
[52]
Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised repre- sentations for speech synthesis,
S.-H. Lee, S.-B. Kim, J.-H. Lee, E. Song, M.-J. Hwang, and S.- W. Lee, “Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised repre- sentations for speech synthesis,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 624–16 636, 2022
2022
-
[53]
Styletts 2: Towards human-level text-to-speech through style dif- fusion and adversarial training with large speech language mod- els,
Y . A. Li, C. Han, V . Raghavan, G. Mischler, and N. Mesgarani, “Styletts 2: Towards human-level text-to-speech through style dif- fusion and adversarial training with large speech language mod- els,”Advances in neural information processing systems, vol. 36, pp. 19 594–19 621, 2023
2023
-
[54]
Better speech synthesis through scaling,
J. Betker, “Better speech synthesis through scaling,”arXiv preprint arXiv:2305.07243, 2023
arXiv 2023
-
[55]
Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement,
X. Zhang, X. Zhang, K. Peng, Z. Tang, V . Manohar, Y . Liu, J. Hwang, D. Li, Y . Wang, J. Chanet al., “Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement,” arXiv preprint arXiv:2502.07243, 2025
arXiv 2025
-
[56]
MetaV oice-src,
MetaV oice, “MetaV oice-src,” https://github.com/metavoiceio/ metavoice-src, 2024, gitHub repository. [Online]. Available: https://github.com/metavoiceio/metavoice-src
2024
-
[57]
XTTS: a massively multilingual zero-shot text-to-speech model,
E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemiet al., “XTTS: a massively multilingual zero-shot text-to-speech model,”arXiv preprint arXiv:2406.04904, 2024
arXiv 2024
-
[58]
WhisperSpeech,
WhisperSpeech, “WhisperSpeech,” https://github.com/ WhisperSpeech/WhisperSpeech, 2024, gitHub repository. [Online]. Available: https://github.com/WhisperSpeech/ WhisperSpeech
2024
-
[59]
Openvoice: Versatile instant voice cloning,
Z. Qin, W. Zhao, X. Yu, and X. Sun, “Openvoice: Versatile instant voice cloning,”arXiv preprint arXiv:2312.01479, 2023
arXiv 2023
-
[60]
Modelling rankings in R: the PlackettLuce package,
H. L. Turner, J. van Etten, D. Firth, and I. Kosmidis, “Modelling rankings in R: the PlackettLuce package,”Computational Statis- tics, vol. 35, no. 3, pp. 1027–1057, 2020
2020
-
[61]
D. Aziz and D. Sztah ´o, “Automatic cross- and multi-lingual recognition of dysphonia by ensemble classification using deep speaker embedding models,”Expert Systems, vol. 41, no. 10, p. e13660, 2024. [Online]. Available: https://onlinelibrary.wiley. com/doi/abs/10.1111/exsy.13660
-
[62]
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,
H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” inInterspeech 2019. ISCA, 2019, pp. 1526– 1530
2019
-
[63]
Quality prediction for synthesized speech: Comparison of approaches,
S. M ¨oller and T. H. Falk, “Quality prediction for synthesized speech: Comparison of approaches,” inInternational Conference on Acoustics, 2009, pp. 1168–1171
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.