pith. machine review for the scientific record. sign in

arxiv: 2605.00861 · v1 · submitted 2026-04-21 · 📡 eess.AS · cs.AI· eess.SP

Recognition: unknown

Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:02 UTC · model grok-4.3

classification 📡 eess.AS cs.AIeess.SP
keywords text-to-speechTTS evaluationvoice mappingcepstral peak prominenceacoustic metricsvoice qualityvocal effortnaturalness assessment
0
0 comments X

The pith

Voice range serves as the primary indicator of text-to-speech model capability while cepstral peak prominence values distinguish natural from robotic speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes voice mapping as an objective evaluation framework for text-to-speech quality that relies on acoustic metrics rather than listener judgments. It applies crest factor, spectrum balance, and cepstral peak prominence plus overall voice range to six models spanning Merlin to VITS. The analysis finds that voice range tracks model capability with VITS at the top, Glow-TTS handling soft phonation best despite narrower range, and specific CPPs thresholds separating natural from artificial output. A sympathetic reader would care because subjective listening tests are costly and variable, while repeatable metrics focused on vocal effort and dynamics could speed assessment and guide improvements in expressiveness.

Core claim

We show that voice mapping, built from crest factor, spectrum balance, cepstral peak prominence, and voice range, quantifies TTS quality and expressiveness. Across the six models voice range emerges as the main marker of capability, VITS displays the largest range, Glow-TTS records higher spectrum balance indicating superior soft phonation, and CPPs values between 7-8 dB align with natural voice quality while values exceeding 10 dB correspond to robotic speech. These patterns underscore the value of voice mapping for capturing vocal effort and dynamic range in synthesized speech.

What carries the argument

Voice mapping as a metric-based evaluation framework that combines crest factor, spectrum balance, cepstral peak prominence (CPPs), and voice range to assess naturalness, vocal effort, and expressiveness in TTS outputs.

If this is right

  • VITS shows the largest voice range among the tested models, indicating strongest handling of vocal dynamics.
  • Glow-TTS achieves higher spectrum balance values, indicating better soft phonation despite its more limited range.
  • CPPs values between 7-8 dB mark natural voice quality while values above 10 dB mark robotic speech.
  • Voice range functions as the leading single indicator of overall TTS model capability.
  • Objective voice mapping is required to evaluate vocal effort and expressiveness beyond what current metrics provide.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same metric set could screen new TTS architectures for naturalness before any listener study is run.
  • Training objectives could be adjusted to target the 7-8 dB CPPs window identified for natural output.
  • The framework highlights design trade-offs, such as trading voice range for improved phonation control.
  • Direct comparison of the same metrics on human reference speech would test whether the thresholds generalize beyond the six models.

Load-bearing premise

That the three chosen acoustic metrics plus voice range fully capture perceived naturalness, vocal effort, and expressiveness without any subjective listening tests or comparison to human speech baselines.

What would settle it

A blind listening test in which human raters score the naturalness and expressiveness of the same TTS samples and the ratings fail to align with the reported CPPs thresholds or voice-range ordering.

read the original abstract

This study investigates voice mapping as an evaluation framework for text-to-speech (TTS) synthesis quality. The study analyzes six TTS models, including historical and recent ones. The metrics are crest factor, spectrum balance, and cepstral peak prominence (CPPs). We investigated 6 influential TTS models: Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS. The results demonstrate that voice range serves as a primary indicator of model capability, with VITS showing the largest range among tested models. Glow-TTS exhibited superior performance in soft phonation, indicated by higher spectrum balance, despite limited voice range. The results showed that the CPPs values between 7-8 dB indicate natural voice quality, while with CPPs exceeding 10 dB, the speech tends to sound robotic. These findings underscore the need for voice mapping to evaluate vocal effort, and capture how TTS systems handle voice dynamic and expressiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a voice-mapping evaluation framework for TTS synthesis that computes three acoustic metrics—crest factor, spectrum balance, and cepstral peak prominence (CPPs)—plus voice-range statistics on outputs from six systems (Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, VITS). It claims that voice range is the primary indicator of model capability (VITS largest), that Glow-TTS excels in soft phonation via spectrum balance, and that CPPs values of 7–8 dB correspond to natural voice quality while values >10 dB indicate robotic speech.

Significance. If the metric-to-perception mapping were validated, the approach would supply an objective, reference-free method for diagnosing vocal effort, dynamic range, and expressiveness in TTS, potentially reducing reliance on costly listening tests. The multi-model comparison across historical and modern architectures is a positive feature, but the absence of any anchoring data leaves the interpretive claims unsupported.

major comments (3)
  1. [Abstract / Results] Abstract and results section: the specific thresholds 'CPPs values between 7-8 dB indicate natural voice quality' and 'CPPs exceeding 10 dB, the speech tends to sound robotic' are stated without any accompanying measurements, statistical tests, error bars, raw data tables, or subjective listening tests that would anchor the numerical cutoffs to perceptual labels.
  2. [Abstract] Abstract: the assertion that 'voice range serves as a primary indicator of model capability' and that the three chosen metrics 'fully capture' vocal effort and expressiveness is presented without justification, ablation studies, or comparison against human-speech baselines, making the central interpretive claims rest on an untested assumption.
  3. [Abstract] Abstract: no description is given of how the metrics were computed (windowing, normalization, exact CPPs definition, or reference implementations), nor are any quantitative results, confidence intervals, or cross-model statistical comparisons supplied to support the ranking of VITS and Glow-TTS.
minor comments (2)
  1. [Abstract] The acronym CPPs is introduced without an explicit expansion on first use.
  2. [Results] The manuscript would benefit from a table summarizing the six models, their training data, and the exact metric values obtained for each.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and results section: the specific thresholds 'CPPs values between 7-8 dB indicate natural voice quality' and 'CPPs exceeding 10 dB, the speech tends to sound robotic' are stated without any accompanying measurements, statistical tests, error bars, raw data tables, or subjective listening tests that would anchor the numerical cutoffs to perceptual labels.

    Authors: We acknowledge that the perceptual mapping of CPPs thresholds was presented without sufficient supporting data in the original submission. These ranges were derived from the distribution of computed values across the six TTS systems, where lower CPPs aligned with models producing more natural-sounding output in our internal checks. In the revision, we will add a table reporting mean CPPs values with standard deviations for each model, along with the underlying per-utterance data summary. We will also qualify the statements as observational findings based on the metric distributions rather than validated perceptual cutoffs, and cite relevant literature on CPPs in voice quality. No new listening tests will be added, as the study focuses on objective metrics. revision: partial

  2. Referee: [Abstract] Abstract: the assertion that 'voice range serves as a primary indicator of model capability' and that the three chosen metrics 'fully capture' vocal effort and expressiveness is presented without justification, ablation studies, or comparison against human-speech baselines, making the central interpretive claims rest on an untested assumption.

    Authors: The claim regarding voice range as a primary indicator stems from the comparative results, where VITS exhibited the widest range consistent with its established performance. We agree that the phrasing 'fully capture' is overstated and will revise the abstract and discussion to state that the metrics 'offer insights into' vocal effort and expressiveness. We will incorporate comparisons against human-speech baselines drawn from standard corpora (e.g., mean voice range and spectrum balance values from natural recordings) in the results section. Ablation studies on metric combinations are outside the current scope but will be noted as future work. revision: partial

  3. Referee: [Abstract] Abstract: no description is given of how the metrics were computed (windowing, normalization, exact CPPs definition, or reference implementations), nor are any quantitative results, confidence intervals, or cross-model statistical comparisons supplied to support the ranking of VITS and Glow-TTS.

    Authors: We will add a dedicated subsection in the methods describing the exact computation pipeline for each metric, including window length and overlap, normalization steps, the specific CPPs implementation (following the standard definition with 0.5 ms quefrency range), and references to open-source implementations. Quantitative results will be expanded into tables showing per-model means, standard deviations, and 95% confidence intervals, accompanied by statistical comparisons (ANOVA with Tukey post-hoc tests) to justify the reported rankings of VITS and Glow-TTS. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metric reporting with no derivation chain

full rationale

The paper is an observational study that computes three acoustic metrics (crest factor, spectrum balance, CPPs) on outputs from six TTS systems and reports ranges and thresholds. No equations, fitted parameters, predictions, or self-citations appear in the provided text. Interpretive statements equating CPPs ranges to 'natural' vs 'robotic' quality are direct observations from the data rather than reductions of any claimed derivation to its own inputs. This is self-contained empirical reporting against the chosen metrics; the absence of subjective listening tests is a validation weakness but not a circularity issue per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5473 in / 1140 out tokens · 59409 ms · 2026-05-10T01:02:10.943932+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 13 canonical work pages

  1. [1]

    The Blizzard Challenge 20132013

    King S, Karaiskos V. The Blizzard Challenge 20132013

  2. [2]

    A Review on Subjective and Objective Evaluation of Synthetic Speech

    Cooper E, Huang W-C, Tsao Y, Wang H-m, Toda T, Yamagishi J. A Review on Subjective and Objective Evaluation of Synthetic Speech. Acoustical Science and Technology. 2024;45

  3. [3]

    Methods for Objective and Subjective Assessment of Quality

    Quality TT. Methods for Objective and Subjective Assessment of Quality. ITU-T Recommendation. 1996:830

  4. [4]

    The Limits of the Mean Opinion Score for Speech Synthesis Evaluation

    Le Maguer S, King S, Harte N. The Limits of the Mean Opinion Score for Speech Synthesis Evaluation. Computer Speech & Language. 2024;84:101577. 28

  5. [5]

    Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech2023

    Cooper E, Yamagishi J. Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech2023

  6. [6]

    Stuck in the Mos Pit: A Critical Analysis of Mos Test Methodology in Tts Evaluation

    Kirkland A, Mehta S, Lameris H, Henter GE, Szé kely E, Gustafson J. Stuck in the Mos Pit: A Critical Analysis of Mos Test Methodology in Tts Evaluation. 12th Speech Synthesis Workshop (SSW) 20232023

  7. [7]

    Fastspeech 2: Fast and high-quality end-to-end text to speech,

    Ren Y, Hu C, Tan X, et al. Fastspeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv preprint arXiv:2006.04558. 2020

  8. [8]

    Natural Tts Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions

    Shen J, Pang R, Weiss RJ, et al. Natural Tts Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP): IEEE; 2018:4779-4783

  9. [9]

    Naturalspeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

    Tan X, Chen J, Liu H, et al. Naturalspeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024

  10. [10]

    Why We Should Report the Details in Subjective Evaluation of Tts More Rigorously

    Chiang C-H, Huang W-P, Lee H-y. Why We Should Report the Details in Subjective Evaluation of Tts More Rigorously. arXiv preprint arXiv:2306.02044. 2023

  11. [11]

    Deep Mos Predictor for Synthetic Speech Using Cluster- Based Modeling

    Choi Y, Jung Y, Kim H. Deep Mos Predictor for Synthetic Speech Using Cluster- Based Modeling. arXiv preprint arXiv:2008.03710. 2020

  12. [12]

    Mosnet: Deep Learning Based Objective Assessment for Voice Conversion

    Lo C-C, Fu S-W, Huang W-C, et al. Mosnet: Deep Learning Based Objective Assessment for Voice Conversion. arXiv preprint arXiv:1904.08352. 2019

  13. [13]

    Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

    Mittag G, Naderi B, Chehadi A, Mö ller S. Nisqa: A Deep Cnn-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. arXiv preprint arXiv:2104.09494. 2021

  14. [14]

    Deepmos: Deep Posterior Mean- Opinion-Score of Speech

    Liang X, Cumlin F, Schü ldt C, Chatterjee S. Deepmos: Deep Posterior Mean- Opinion-Score of Speech. Proceedings of INTERSPEECH2023:526-530

  15. [15]

    arXiv:2203.11389 (2022)

    Huang W-C, Cooper E, Tsao Y, Wang H-M, Toda T, Yamagishi J. The Voicemos Challenge 2022. arXiv preprint arXiv:2203.11389. 2022

  16. [16]

    Experimental Evaluation of Mos, Ab and Bws Listening Test Designs

    Wells D, Blanco ALA, Valentini-Botinhao C, et al. Experimental Evaluation of Mos, Ab and Bws Listening Test Designs. INTERSPEECH 2024: Speech and Beyond: International Speech Communication Association (ISCA); 2024

  17. [17]

    Mel-Cepstral Distance Measure for Objective Speech Quality Assessment

    Kubichek R. Mel-Cepstral Distance Measure for Objective Speech Quality Assessment. Proceedings of IEEE pacific rim conference on communications computers and signal processing. Vol 1: IEEE; 1993:125-128

  18. [18]

    Modified Estoi for Improving Speech Intelligibility Prediction

    Alghamdi A, Chan W-Y. Modified Estoi for Improving Speech Intelligibility Prediction. 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE): IEEE; 2020:1-5

  19. [19]

    Perceptual Evaluation of Speech Quality (Pesq)-a New Method for Speech Quality Assessment of Telephone Networks and Codecs

    Rix AW, Beerends JG, Hollier MP, Hekstra AP. Perceptual Evaluation of Speech Quality (Pesq)-a New Method for Speech Quality Assessment of Telephone Networks and Codecs. 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221). Vol 2: IEEE; 2001:749-752

  20. [20]

    D4c, a Band-Aperiodicity Estimator for High-Quality Speech Synthesis

    Morise M. D4c, a Band-Aperiodicity Estimator for High-Quality Speech Synthesis. Speech Communication. 2016;84:57-65

  21. [21]

    Cepstrum-Based Pitch Detection Using a New Statistical V/Uv Classification Algorithm

    Ahmadi S, Spanias AS. Cepstrum-Based Pitch Detection Using a New Statistical V/Uv Classification Algorithm. IEEE Transactions on speech and audio processing. 1999;7:333-338

  22. [22]

    Analysis and Assessment of Controllability of an Expressive Deep Learning-Based Tts System

    Tits N, El Haddad K, Dutoit T. Analysis and Assessment of Controllability of an Expressive Deep Learning-Based Tts System. Informatics. Vol 8: MDPI; 2021:84

  23. [23]

    Relationship between Changes in Voice Pitch and Loudness

    Gramming P, Sundberg J, Ternströ m S, Leanderson R, Perkins WH. Relationship between Changes in Voice Pitch and Loudness. Journal of Voice. 1988;2:118-126. 29

  24. [24]

    Loud Speech over Noise: Some Spectral Attributes, with Gender Differences

    Ternströ m S, Bohman M, Sö dersten M. Loud Speech over Noise: Some Spectral Attributes, with Gender Differences. The Journal of the Acoustical Society of America. 2006;119:1648-1665

  25. [25]

    Acoustic Effects of Variation in Vocal Effort by Men, Women, and Children

    Traunmü ller H, Eriksson A. Acoustic Effects of Variation in Vocal Effort by Men, Women, and Children. The Journal of the Acoustical Society of America. 2000;107:3438- 3451

  26. [26]

    Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice

    Ternströ m S, Pabon P. Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice. Applied Sciences. 2022;12

  27. [27]

    Acoustic Analysis, Electroglottography, and Voice Range Profile: A Measure for Outcome of Thyroplasty Type 1

    Agresti C, George E, Behrman A, Blumstein E. Acoustic Analysis, Electroglottography, and Voice Range Profile: A Measure for Outcome of Thyroplasty Type 1. Otolaryngology - Head and Neck Surgery. 1996;115:P116

  28. [28]

    The Phonetogram

    Damsté PH. The Phonetogram. Pract Otorhinolaryngol (Basel). 1970;32:185-187

  29. [29]

    Mapping Individual Voice Quality over the Voice Range : The Measurement Paradigm of the Voice Range Profile [Doctoral thesis, comprehensive summary]

    Pabon P. Mapping Individual Voice Quality over the Voice Range : The Measurement Paradigm of the Voice Range Profile [Doctoral thesis, comprehensive summary]. Stockholm: TRITA-EECS-AVL, KTH Royal Institute of Technology; 2018

  30. [30]

    Feature Maps of the Acoustic Spectrum of the Voice

    Pabon P, Ternströ m S. Feature Maps of the Acoustic Spectrum of the Voice. J Voice. 2020;34:161 e161-161 e126

  31. [31]

    Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps

    Cai H, Ternströ m S, Chaffanjon P, Henrich Bernardoni N. Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps. Journal of Voice. 2024

  32. [32]

    Effects of the Lung Volume on the Electroglottographic Waveform in Trained Female Singers

    Ternströ m S, D'Amario S, Selamtzis A. Effects of the Lung Volume on the Electroglottographic Waveform in Trained Female Singers. Journal of Voice. 2020;34:485.e481-485.e421

  33. [33]

    Quantifying the Cepstral Peak Prominence, a Measure of Dysphonia

    Heman-Ackah YD, Sataloff RT, Laureyns G, et al. Quantifying the Cepstral Peak Prominence, a Measure of Dysphonia. J Voice. 2014;28:783-788

  34. [34]

    A Model of Articulatory Dynamics and Control

    CECIL H. A Model of Articulatory Dynamics and Control. PROCEEDINGS OF THE IEEE. 1976;64

  35. [35]

    A Model of Articulatory Dynamics and Control

    Coker CH. A Model of Articulatory Dynamics and Control. Proceedings of the IEEE. 1976;64:452-460

  36. [36]

    Prospects for Articulatory Synthesis: A Position Paper

    Shadle CH, Damper RI. Prospects for Articulatory Synthesis: A Position Paper. 2002

  37. [37]

    Automatic Generation of Control Signals for a Parallel Formant Speech Synthesizer

    Seeviour P, Holmes J, Judd M. Automatic Generation of Control Signals for a Parallel Formant Speech Synthesizer. ICASSP'76. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol 1: IEEE; 1976:690-693

  38. [38]

    Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones

    Moulines E, Charpentier F. Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones. Speech communication. 1990;9:453-467

  39. [39]

    Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database

    Hunt AJ, Black AW. Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database. 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings. Vol 1: IEEE; 1996:373-376

  40. [40]

    Introduction to Digital Speech Processing

    Rabiner LR, Schafer RW. Introduction to Digital Speech Processing. Foundations and Trends® in Signal Processing. 2007;1:1-194

  41. [41]

    Simultaneous Modeling of Spectrum, Pitch and Duration in Hmm-Based Speech Synthesis

    Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T. Simultaneous Modeling of Spectrum, Pitch and Duration in Hmm-Based Speech Synthesis. Sixth European conference on speech communication and technology1999

  42. [42]

    Statistical Parametric Speech Synthesis

    Zen H, Tokuda K, Black AW. Statistical Parametric Speech Synthesis. speech communication. 2009;51:1039-1064

  43. [43]

    Kawahara H, Masuda-Katsuse I, de Cheveigné A. Restructuring Speech Representations Using a Pitch-Adaptive Time–Frequency Smoothing and an Instantaneous-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure 30 in Sounds1speech Files Available. See Http://Www.Elsevier.Nl/Locate/Specom1. Speech Communication. 1999;27:187-207

  44. [44]

    Soong, and Tie-Yan Liu

    Tan X, Qin T, Soong F, Liu T-Y. A Survey on Neural Speech Synthesis. arXiv preprint arXiv:2106.15561. 2021

  45. [45]

    Statistical Parametric Speech Synthesis Using Deep Neural Networks

    Zen H, Senior A, Schuster M. Statistical Parametric Speech Synthesis Using Deep Neural Networks. 2013 ieee international conference on acoustics, speech and signal processing: IEEE; 2013:7962-7966

  46. [46]

    Merlin: An Open Source Neural Network Speech Synthesis System

    Wu Z, Watts O, King S. Merlin: An Open Source Neural Network Speech Synthesis System. 9th ISCA Speech Synthesis Workshop2016:202-207

  47. [47]

    Wavenet: A Generative Model for Raw Audio

    Oord A, Dieleman S, Zen H, et al. Wavenet: A Generative Model for Raw Audio. 2016

  48. [48]

    Tacotron: Towards end-to-end speech synthesis,

    Wang Y, Skerry-Ryan R, Stanton D, et al. Tacotron: Towards End-to-End Speech Synthesis. arXiv preprint arXiv:1703.10135. 2017

  49. [49]

    Neural Speech Synthesis with Transformer Network

    Li N, Liu S, Liu Y, Zhao S, Liu M. Neural Speech Synthesis with Transformer Network. Proceedings of the AAAI conference on artificial intelligence. Vol 332019:6706-6713

  50. [50]

    Fastpitch: Parallel Text-to-Speech with Pitch Prediction

    Łańcucki A. Fastpitch: Parallel Text-to-Speech with Pitch Prediction. ICASSP 2021- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): IEEE; 2021:6588-6592

  51. [51]

    Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

    Kim J, Kong J, Son J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. International Conference on Machine Learning: PMLR; 2021:5530-5540

  52. [52]

    Prosody-Aware Speecht5 for Expressive Neural Tts

    Deng Y, Zhou L, Yi Y, Liu S, He L. Prosody-Aware Speecht5 for Expressive Neural Tts. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): IEEE; 2023:1-5

  53. [53]

    Naturalspeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

    Tan X, Chen J, Liu H, et al. Naturalspeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024;46:4234-4245

  54. [54]

    arXiv preprint arXiv:2403.03100 , year=

    Ju Z, Wang Y, Shen K, et al. Naturalspeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. arXiv preprint arXiv:2403.03100. 2024

  55. [55]

    F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

    Chen Y, Niu Z, Ma Z, et al. F5-Tts: A Fairytaler That Fakes Fluent and Faithful Speech with Flow Matching. arXiv preprint arXiv:2410.06885. 2024

  56. [56]

    Natural language guidance of high-fidelity text-to-speech with synthetic annotations

    Lyth D, King S. Natural Language Guidance of High-Fidelity Text-to-Speech with Synthetic Annotations. arXiv preprint arXiv:2402.01912. 2024

  57. [57]

    In: Johnson KIaL, ed2017

    The Lj Speech Dataset. In: Johnson KIaL, ed2017

  58. [58]

    Esp Net2- Tts : Extending the Edge of Tts Research

    Hayashi T, Yamamoto R, Yoshimura T, et al. Esp Net2- Tts : Extending the Edge of Tts Research. arXiv preprint arXiv:2110.07840. 2021

  59. [59]

    Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-to-Speech

    Yang G, Yang S, Liu K, Fang P, Chen W, Xie L. Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-to-Speech. 2021 IEEE Spoken Language Technology Workshop (SLT): IEEE; 2021:492-498

  60. [60]

    Univnet: A Neural Vocoder with Multi- Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

    Jang W, Lim D, Yoon J, Kim B, Kim J. Univnet: A Neural Vocoder with Multi- Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. arXiv preprint arXiv:2106.07889. 2021

  61. [61]

    Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women with Structural Dysphonia before and after Treatment

    Naomi Anna Iob LH, Sten Ternströ m, Huanchen Cai, Meike Brockmann-Bauser. Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women with Structural Dysphonia before and after Treatment. Journal of Speech, Language, and Hearing Research. 2023

  62. [62]

    Update 3.1 to Fonadyn : A System for Real-Time Analysis of the Electroglottogram, over the Voice Range

    Ternströ m S. Update 3.1 to Fonadyn : A System for Real-Time Analysis of the Electroglottogram, over the Voice Range. SoftwareX. 2024;26. 31

  63. [63]

    Normalized Time-Domain Parameters for Electroglottographic Waveforms

    Ternströ m S. Normalized Time-Domain Parameters for Electroglottographic Waveforms. J Acoust Soc Am. 2019;146:EL65

  64. [64]

    Spectral-Cepstral Estimation of Dysphonia Severity: External Validation

    Awan SN, Solomon NP, Helou LB, Stojadinovic A. Spectral-Cepstral Estimation of Dysphonia Severity: External Validation. Ann Otol Rhinol Laryngol. 2013;122:40-48

  65. [65]

    Software Automatic Mouth (Sam)

    Barton M. Software Automatic Mouth (Sam). Los Angeles: Don't Ask Software; 1982

  66. [66]

    Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women with Structural Dysphonia before and after Treatment

    Iob NA, He L, Ternströ m S, Cai H, Brockmann-Bauser M. Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women with Structural Dysphonia before and after Treatment. Journal of Speech, Language, and Hearing Research. 2024;67:1660-1681