arxiv: 2605.00861 · v1 · submitted 2026-04-21 · 📡 eess.AS · cs.AI· eess.SP

Recognition: unknown

Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment

Huanchen Cai , Sten Ternstr\"om

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:02 UTC · model grok-4.3

classification 📡 eess.AS cs.AIeess.SP

keywords text-to-speechTTS evaluationvoice mappingcepstral peak prominenceacoustic metricsvoice qualityvocal effortnaturalness assessment

0 comments

The pith

Voice range serves as the primary indicator of text-to-speech model capability while cepstral peak prominence values distinguish natural from robotic speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes voice mapping as an objective evaluation framework for text-to-speech quality that relies on acoustic metrics rather than listener judgments. It applies crest factor, spectrum balance, and cepstral peak prominence plus overall voice range to six models spanning Merlin to VITS. The analysis finds that voice range tracks model capability with VITS at the top, Glow-TTS handling soft phonation best despite narrower range, and specific CPPs thresholds separating natural from artificial output. A sympathetic reader would care because subjective listening tests are costly and variable, while repeatable metrics focused on vocal effort and dynamics could speed assessment and guide improvements in expressiveness.

Core claim

We show that voice mapping, built from crest factor, spectrum balance, cepstral peak prominence, and voice range, quantifies TTS quality and expressiveness. Across the six models voice range emerges as the main marker of capability, VITS displays the largest range, Glow-TTS records higher spectrum balance indicating superior soft phonation, and CPPs values between 7-8 dB align with natural voice quality while values exceeding 10 dB correspond to robotic speech. These patterns underscore the value of voice mapping for capturing vocal effort and dynamic range in synthesized speech.

What carries the argument

Voice mapping as a metric-based evaluation framework that combines crest factor, spectrum balance, cepstral peak prominence (CPPs), and voice range to assess naturalness, vocal effort, and expressiveness in TTS outputs.

If this is right

VITS shows the largest voice range among the tested models, indicating strongest handling of vocal dynamics.
Glow-TTS achieves higher spectrum balance values, indicating better soft phonation despite its more limited range.
CPPs values between 7-8 dB mark natural voice quality while values above 10 dB mark robotic speech.
Voice range functions as the leading single indicator of overall TTS model capability.
Objective voice mapping is required to evaluate vocal effort and expressiveness beyond what current metrics provide.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same metric set could screen new TTS architectures for naturalness before any listener study is run.
Training objectives could be adjusted to target the 7-8 dB CPPs window identified for natural output.
The framework highlights design trade-offs, such as trading voice range for improved phonation control.
Direct comparison of the same metrics on human reference speech would test whether the thresholds generalize beyond the six models.

Load-bearing premise

That the three chosen acoustic metrics plus voice range fully capture perceived naturalness, vocal effort, and expressiveness without any subjective listening tests or comparison to human speech baselines.

What would settle it

A blind listening test in which human raters score the naturalness and expressiveness of the same TTS samples and the ratings fail to align with the reported CPPs thresholds or voice-range ordering.

read the original abstract

This study investigates voice mapping as an evaluation framework for text-to-speech (TTS) synthesis quality. The study analyzes six TTS models, including historical and recent ones. The metrics are crest factor, spectrum balance, and cepstral peak prominence (CPPs). We investigated 6 influential TTS models: Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS. The results demonstrate that voice range serves as a primary indicator of model capability, with VITS showing the largest range among tested models. Glow-TTS exhibited superior performance in soft phonation, indicated by higher spectrum balance, despite limited voice range. The results showed that the CPPs values between 7-8 dB indicate natural voice quality, while with CPPs exceeding 10 dB, the speech tends to sound robotic. These findings underscore the need for voice mapping to evaluate vocal effort, and capture how TTS systems handle voice dynamic and expressiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a voice-mapping evaluation framework for TTS synthesis that computes three acoustic metrics—crest factor, spectrum balance, and cepstral peak prominence (CPPs)—plus voice-range statistics on outputs from six systems (Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, VITS). It claims that voice range is the primary indicator of model capability (VITS largest), that Glow-TTS excels in soft phonation via spectrum balance, and that CPPs values of 7–8 dB correspond to natural voice quality while values >10 dB indicate robotic speech.

Significance. If the metric-to-perception mapping were validated, the approach would supply an objective, reference-free method for diagnosing vocal effort, dynamic range, and expressiveness in TTS, potentially reducing reliance on costly listening tests. The multi-model comparison across historical and modern architectures is a positive feature, but the absence of any anchoring data leaves the interpretive claims unsupported.

major comments (3)

[Abstract / Results] Abstract and results section: the specific thresholds 'CPPs values between 7-8 dB indicate natural voice quality' and 'CPPs exceeding 10 dB, the speech tends to sound robotic' are stated without any accompanying measurements, statistical tests, error bars, raw data tables, or subjective listening tests that would anchor the numerical cutoffs to perceptual labels.
[Abstract] Abstract: the assertion that 'voice range serves as a primary indicator of model capability' and that the three chosen metrics 'fully capture' vocal effort and expressiveness is presented without justification, ablation studies, or comparison against human-speech baselines, making the central interpretive claims rest on an untested assumption.
[Abstract] Abstract: no description is given of how the metrics were computed (windowing, normalization, exact CPPs definition, or reference implementations), nor are any quantitative results, confidence intervals, or cross-model statistical comparisons supplied to support the ranking of VITS and Glow-TTS.

minor comments (2)

[Abstract] The acronym CPPs is introduced without an explicit expansion on first use.
[Results] The manuscript would benefit from a table summarizing the six models, their training data, and the exact metric values obtained for each.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results section: the specific thresholds 'CPPs values between 7-8 dB indicate natural voice quality' and 'CPPs exceeding 10 dB, the speech tends to sound robotic' are stated without any accompanying measurements, statistical tests, error bars, raw data tables, or subjective listening tests that would anchor the numerical cutoffs to perceptual labels.

Authors: We acknowledge that the perceptual mapping of CPPs thresholds was presented without sufficient supporting data in the original submission. These ranges were derived from the distribution of computed values across the six TTS systems, where lower CPPs aligned with models producing more natural-sounding output in our internal checks. In the revision, we will add a table reporting mean CPPs values with standard deviations for each model, along with the underlying per-utterance data summary. We will also qualify the statements as observational findings based on the metric distributions rather than validated perceptual cutoffs, and cite relevant literature on CPPs in voice quality. No new listening tests will be added, as the study focuses on objective metrics. revision: partial
Referee: [Abstract] Abstract: the assertion that 'voice range serves as a primary indicator of model capability' and that the three chosen metrics 'fully capture' vocal effort and expressiveness is presented without justification, ablation studies, or comparison against human-speech baselines, making the central interpretive claims rest on an untested assumption.

Authors: The claim regarding voice range as a primary indicator stems from the comparative results, where VITS exhibited the widest range consistent with its established performance. We agree that the phrasing 'fully capture' is overstated and will revise the abstract and discussion to state that the metrics 'offer insights into' vocal effort and expressiveness. We will incorporate comparisons against human-speech baselines drawn from standard corpora (e.g., mean voice range and spectrum balance values from natural recordings) in the results section. Ablation studies on metric combinations are outside the current scope but will be noted as future work. revision: partial
Referee: [Abstract] Abstract: no description is given of how the metrics were computed (windowing, normalization, exact CPPs definition, or reference implementations), nor are any quantitative results, confidence intervals, or cross-model statistical comparisons supplied to support the ranking of VITS and Glow-TTS.

Authors: We will add a dedicated subsection in the methods describing the exact computation pipeline for each metric, including window length and overlap, normalization steps, the specific CPPs implementation (following the standard definition with 0.5 ms quefrency range), and references to open-source implementations. Quantitative results will be expanded into tables showing per-model means, standard deviations, and 95% confidence intervals, accompanied by statistical comparisons (ANOVA with Tukey post-hoc tests) to justify the reported rankings of VITS and Glow-TTS. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metric reporting with no derivation chain

full rationale

The paper is an observational study that computes three acoustic metrics (crest factor, spectrum balance, CPPs) on outputs from six TTS systems and reports ranges and thresholds. No equations, fitted parameters, predictions, or self-citations appear in the provided text. Interpretive statements equating CPPs ranges to 'natural' vs 'robotic' quality are direct observations from the data rather than reductions of any claimed derivation to its own inputs. This is self-contained empirical reporting against the chosen metrics; the absence of subjective listening tests is a validation weakness but not a circularity issue per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5473 in / 1140 out tokens · 59409 ms · 2026-05-10T01:02:10.943932+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 13 canonical work pages

[1]

The Blizzard Challenge 20132013

King S, Karaiskos V. The Blizzard Challenge 20132013
[2]

A Review on Subjective and Objective Evaluation of Synthetic Speech

Cooper E, Huang W-C, Tsao Y, Wang H-m, Toda T, Yamagishi J. A Review on Subjective and Objective Evaluation of Synthetic Speech. Acoustical Science and Technology. 2024;45

2024
[3]

Methods for Objective and Subjective Assessment of Quality

Quality TT. Methods for Objective and Subjective Assessment of Quality. ITU-T Recommendation. 1996:830

1996
[4]

The Limits of the Mean Opinion Score for Speech Synthesis Evaluation

Le Maguer S, King S, Harte N. The Limits of the Mean Opinion Score for Speech Synthesis Evaluation. Computer Speech & Language. 2024;84:101577. 28

2024
[5]

Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech2023

Cooper E, Yamagishi J. Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech2023
[6]

Stuck in the Mos Pit: A Critical Analysis of Mos Test Methodology in Tts Evaluation

Kirkland A, Mehta S, Lameris H, Henter GE, Szé kely E, Gustafson J. Stuck in the Mos Pit: A Critical Analysis of Mos Test Methodology in Tts Evaluation. 12th Speech Synthesis Workshop (SSW) 20232023
[7]

Fastspeech 2: Fast and high-quality end-to-end text to speech,

Ren Y, Hu C, Tan X, et al. Fastspeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv preprint arXiv:2006.04558. 2020

work page arXiv 2006
[8]

Natural Tts Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions

Shen J, Pang R, Weiss RJ, et al. Natural Tts Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP): IEEE; 2018:4779-4783

2018
[9]

Naturalspeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

Tan X, Chen J, Liu H, et al. Naturalspeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024

2024
[10]

Why We Should Report the Details in Subjective Evaluation of Tts More Rigorously

Chiang C-H, Huang W-P, Lee H-y. Why We Should Report the Details in Subjective Evaluation of Tts More Rigorously. arXiv preprint arXiv:2306.02044. 2023

work page arXiv 2023
[11]

Deep Mos Predictor for Synthetic Speech Using Cluster- Based Modeling

Choi Y, Jung Y, Kim H. Deep Mos Predictor for Synthetic Speech Using Cluster- Based Modeling. arXiv preprint arXiv:2008.03710. 2020

work page arXiv 2008
[12]

Mosnet: Deep Learning Based Objective Assessment for Voice Conversion

Lo C-C, Fu S-W, Huang W-C, et al. Mosnet: Deep Learning Based Objective Assessment for Voice Conversion. arXiv preprint arXiv:1904.08352. 2019

work page arXiv 1904
[13]

Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

Mittag G, Naderi B, Chehadi A, Mö ller S. Nisqa: A Deep Cnn-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. arXiv preprint arXiv:2104.09494. 2021

work page arXiv 2021
[14]

Deepmos: Deep Posterior Mean- Opinion-Score of Speech

Liang X, Cumlin F, Schü ldt C, Chatterjee S. Deepmos: Deep Posterior Mean- Opinion-Score of Speech. Proceedings of INTERSPEECH2023:526-530
[15]

arXiv:2203.11389 (2022)

Huang W-C, Cooper E, Tsao Y, Wang H-M, Toda T, Yamagishi J. The Voicemos Challenge 2022. arXiv preprint arXiv:2203.11389. 2022

work page arXiv 2022
[16]

Experimental Evaluation of Mos, Ab and Bws Listening Test Designs

Wells D, Blanco ALA, Valentini-Botinhao C, et al. Experimental Evaluation of Mos, Ab and Bws Listening Test Designs. INTERSPEECH 2024: Speech and Beyond: International Speech Communication Association (ISCA); 2024

2024
[17]

Mel-Cepstral Distance Measure for Objective Speech Quality Assessment

Kubichek R. Mel-Cepstral Distance Measure for Objective Speech Quality Assessment. Proceedings of IEEE pacific rim conference on communications computers and signal processing. Vol 1: IEEE; 1993:125-128

1993
[18]

Modified Estoi for Improving Speech Intelligibility Prediction

Alghamdi A, Chan W-Y. Modified Estoi for Improving Speech Intelligibility Prediction. 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE): IEEE; 2020:1-5

2020
[19]

Perceptual Evaluation of Speech Quality (Pesq)-a New Method for Speech Quality Assessment of Telephone Networks and Codecs

Rix AW, Beerends JG, Hollier MP, Hekstra AP. Perceptual Evaluation of Speech Quality (Pesq)-a New Method for Speech Quality Assessment of Telephone Networks and Codecs. 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221). Vol 2: IEEE; 2001:749-752

2001
[20]

D4c, a Band-Aperiodicity Estimator for High-Quality Speech Synthesis

Morise M. D4c, a Band-Aperiodicity Estimator for High-Quality Speech Synthesis. Speech Communication. 2016;84:57-65

2016
[21]

Cepstrum-Based Pitch Detection Using a New Statistical V/Uv Classification Algorithm

Ahmadi S, Spanias AS. Cepstrum-Based Pitch Detection Using a New Statistical V/Uv Classification Algorithm. IEEE Transactions on speech and audio processing. 1999;7:333-338

1999
[22]

Analysis and Assessment of Controllability of an Expressive Deep Learning-Based Tts System

Tits N, El Haddad K, Dutoit T. Analysis and Assessment of Controllability of an Expressive Deep Learning-Based Tts System. Informatics. Vol 8: MDPI; 2021:84

2021
[23]

Relationship between Changes in Voice Pitch and Loudness

Gramming P, Sundberg J, Ternströ m S, Leanderson R, Perkins WH. Relationship between Changes in Voice Pitch and Loudness. Journal of Voice. 1988;2:118-126. 29

1988
[24]

Loud Speech over Noise: Some Spectral Attributes, with Gender Differences

Ternströ m S, Bohman M, Sö dersten M. Loud Speech over Noise: Some Spectral Attributes, with Gender Differences. The Journal of the Acoustical Society of America. 2006;119:1648-1665

2006
[25]

Acoustic Effects of Variation in Vocal Effort by Men, Women, and Children

Traunmü ller H, Eriksson A. Acoustic Effects of Variation in Vocal Effort by Men, Women, and Children. The Journal of the Acoustical Society of America. 2000;107:3438- 3451

2000
[26]

Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice

Ternströ m S, Pabon P. Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice. Applied Sciences. 2022;12

2022
[27]

Acoustic Analysis, Electroglottography, and Voice Range Profile: A Measure for Outcome of Thyroplasty Type 1

Agresti C, George E, Behrman A, Blumstein E. Acoustic Analysis, Electroglottography, and Voice Range Profile: A Measure for Outcome of Thyroplasty Type 1. Otolaryngology - Head and Neck Surgery. 1996;115:P116

1996
[28]

The Phonetogram

Damsté PH. The Phonetogram. Pract Otorhinolaryngol (Basel). 1970;32:185-187

1970
[29]

Mapping Individual Voice Quality over the Voice Range : The Measurement Paradigm of the Voice Range Profile [Doctoral thesis, comprehensive summary]

Pabon P. Mapping Individual Voice Quality over the Voice Range : The Measurement Paradigm of the Voice Range Profile [Doctoral thesis, comprehensive summary]. Stockholm: TRITA-EECS-AVL, KTH Royal Institute of Technology; 2018

2018
[30]

Feature Maps of the Acoustic Spectrum of the Voice

Pabon P, Ternströ m S. Feature Maps of the Acoustic Spectrum of the Voice. J Voice. 2020;34:161 e161-161 e126

2020
[31]

Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps

Cai H, Ternströ m S, Chaffanjon P, Henrich Bernardoni N. Effects on Voice Quality of Thyroidectomy: A Qualitative and Quantitative Study Using Voice Maps. Journal of Voice. 2024

2024
[32]

Effects of the Lung Volume on the Electroglottographic Waveform in Trained Female Singers

Ternströ m S, D'Amario S, Selamtzis A. Effects of the Lung Volume on the Electroglottographic Waveform in Trained Female Singers. Journal of Voice. 2020;34:485.e481-485.e421

2020
[33]

Quantifying the Cepstral Peak Prominence, a Measure of Dysphonia

Heman-Ackah YD, Sataloff RT, Laureyns G, et al. Quantifying the Cepstral Peak Prominence, a Measure of Dysphonia. J Voice. 2014;28:783-788

2014
[34]

A Model of Articulatory Dynamics and Control

CECIL H. A Model of Articulatory Dynamics and Control. PROCEEDINGS OF THE IEEE. 1976;64

1976
[35]

A Model of Articulatory Dynamics and Control

Coker CH. A Model of Articulatory Dynamics and Control. Proceedings of the IEEE. 1976;64:452-460

1976
[36]

Prospects for Articulatory Synthesis: A Position Paper

Shadle CH, Damper RI. Prospects for Articulatory Synthesis: A Position Paper. 2002

2002
[37]

Automatic Generation of Control Signals for a Parallel Formant Speech Synthesizer

Seeviour P, Holmes J, Judd M. Automatic Generation of Control Signals for a Parallel Formant Speech Synthesizer. ICASSP'76. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol 1: IEEE; 1976:690-693

1976
[38]

Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones

Moulines E, Charpentier F. Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones. Speech communication. 1990;9:453-467

1990
[39]

Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database

Hunt AJ, Black AW. Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database. 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings. Vol 1: IEEE; 1996:373-376

1996
[40]

Introduction to Digital Speech Processing

Rabiner LR, Schafer RW. Introduction to Digital Speech Processing. Foundations and Trends® in Signal Processing. 2007;1:1-194

2007
[41]

Simultaneous Modeling of Spectrum, Pitch and Duration in Hmm-Based Speech Synthesis

Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T. Simultaneous Modeling of Spectrum, Pitch and Duration in Hmm-Based Speech Synthesis. Sixth European conference on speech communication and technology1999
[42]

Statistical Parametric Speech Synthesis

Zen H, Tokuda K, Black AW. Statistical Parametric Speech Synthesis. speech communication. 2009;51:1039-1064

2009
[43]

Kawahara H, Masuda-Katsuse I, de Cheveigné A. Restructuring Speech Representations Using a Pitch-Adaptive Time–Frequency Smoothing and an Instantaneous-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure 30 in Sounds1speech Files Available. See Http://Www.Elsevier.Nl/Locate/Specom1. Speech Communication. 1999;27:187-207

1999
[44]

Soong, and Tie-Yan Liu

Tan X, Qin T, Soong F, Liu T-Y. A Survey on Neural Speech Synthesis. arXiv preprint arXiv:2106.15561. 2021

work page arXiv 2021
[45]

Statistical Parametric Speech Synthesis Using Deep Neural Networks

Zen H, Senior A, Schuster M. Statistical Parametric Speech Synthesis Using Deep Neural Networks. 2013 ieee international conference on acoustics, speech and signal processing: IEEE; 2013:7962-7966

2013
[46]

Merlin: An Open Source Neural Network Speech Synthesis System

Wu Z, Watts O, King S. Merlin: An Open Source Neural Network Speech Synthesis System. 9th ISCA Speech Synthesis Workshop2016:202-207
[47]

Wavenet: A Generative Model for Raw Audio

Oord A, Dieleman S, Zen H, et al. Wavenet: A Generative Model for Raw Audio. 2016

2016
[48]

Tacotron: Towards end-to-end speech synthesis,

Wang Y, Skerry-Ryan R, Stanton D, et al. Tacotron: Towards End-to-End Speech Synthesis. arXiv preprint arXiv:1703.10135. 2017

work page arXiv 2017
[49]

Neural Speech Synthesis with Transformer Network

Li N, Liu S, Liu Y, Zhao S, Liu M. Neural Speech Synthesis with Transformer Network. Proceedings of the AAAI conference on artificial intelligence. Vol 332019:6706-6713
[50]

Fastpitch: Parallel Text-to-Speech with Pitch Prediction

Łańcucki A. Fastpitch: Parallel Text-to-Speech with Pitch Prediction. ICASSP 2021- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): IEEE; 2021:6588-6592

2021
[51]

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Kim J, Kong J, Son J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. International Conference on Machine Learning: PMLR; 2021:5530-5540

2021
[52]

Prosody-Aware Speecht5 for Expressive Neural Tts

Deng Y, Zhou L, Yi Y, Liu S, He L. Prosody-Aware Speecht5 for Expressive Neural Tts. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): IEEE; 2023:1-5

2023
[53]

Naturalspeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

Tan X, Chen J, Liu H, et al. Naturalspeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024;46:4234-4245

2024
[54]

arXiv preprint arXiv:2403.03100 , year=

Ju Z, Wang Y, Shen K, et al. Naturalspeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. arXiv preprint arXiv:2403.03100. 2024

work page arXiv 2024
[55]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Chen Y, Niu Z, Ma Z, et al. F5-Tts: A Fairytaler That Fakes Fluent and Faithful Speech with Flow Matching. arXiv preprint arXiv:2410.06885. 2024

work page arXiv 2024
[56]

Natural language guidance of high-fidelity text-to-speech with synthetic annotations

Lyth D, King S. Natural Language Guidance of High-Fidelity Text-to-Speech with Synthetic Annotations. arXiv preprint arXiv:2402.01912. 2024

work page arXiv 2024
[57]

In: Johnson KIaL, ed2017

The Lj Speech Dataset. In: Johnson KIaL, ed2017
[58]

Esp Net2- Tts : Extending the Edge of Tts Research

Hayashi T, Yamamoto R, Yoshimura T, et al. Esp Net2- Tts : Extending the Edge of Tts Research. arXiv preprint arXiv:2110.07840. 2021

work page arXiv 2021
[59]

Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-to-Speech

Yang G, Yang S, Liu K, Fang P, Chen W, Xie L. Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-to-Speech. 2021 IEEE Spoken Language Technology Workshop (SLT): IEEE; 2021:492-498

2021
[60]

Univnet: A Neural Vocoder with Multi- Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Jang W, Lim D, Yoon J, Kim B, Kim J. Univnet: A Neural Vocoder with Multi- Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. arXiv preprint arXiv:2106.07889. 2021

work page arXiv 2021
[61]

Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women with Structural Dysphonia before and after Treatment

Naomi Anna Iob LH, Sten Ternströ m, Huanchen Cai, Meike Brockmann-Bauser. Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women with Structural Dysphonia before and after Treatment. Journal of Speech, Language, and Hearing Research. 2023

2023
[62]

Update 3.1 to Fonadyn : A System for Real-Time Analysis of the Electroglottogram, over the Voice Range

Ternströ m S. Update 3.1 to Fonadyn : A System for Real-Time Analysis of the Electroglottogram, over the Voice Range. SoftwareX. 2024;26. 31

2024
[63]

Normalized Time-Domain Parameters for Electroglottographic Waveforms

Ternströ m S. Normalized Time-Domain Parameters for Electroglottographic Waveforms. J Acoust Soc Am. 2019;146:EL65

2019
[64]

Spectral-Cepstral Estimation of Dysphonia Severity: External Validation

Awan SN, Solomon NP, Helou LB, Stojadinovic A. Spectral-Cepstral Estimation of Dysphonia Severity: External Validation. Ann Otol Rhinol Laryngol. 2013;122:40-48

2013
[65]

Software Automatic Mouth (Sam)

Barton M. Software Automatic Mouth (Sam). Los Angeles: Don't Ask Software; 1982

1982
[66]

Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women with Structural Dysphonia before and after Treatment

Iob NA, He L, Ternströ m S, Cai H, Brockmann-Bauser M. Effects of Speech Characteristics on Electroglottographic and Instrumental Acoustic Voice Analysis Metrics in Women with Structural Dysphonia before and after Treatment. Journal of Speech, Language, and Hearing Research. 2024;67:1660-1681

2024