Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

Masato Takagi; Masaya Kawamura; Reo Shimizu; Yuma Shirahata

arxiv: 2606.19951 · v1 · pith:LDABRGJHnew · submitted 2026-06-18 · 📡 eess.AS · cs.CL· cs.LG· cs.SD

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

Masato Takagi , Masaya Kawamura , Reo Shimizu , Yuma Shirahata This is my paper

Pith reviewed 2026-06-26 15:54 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LGcs.SD

keywords MOS predictionspeech quality assessmentprosodic errorsacoustic degradationhuman-model discrepanciesTTS evaluationfundamental frequencyspeaking rate

0 comments

The pith

Speech quality models track acoustic degradation but ignore prosodic errors and speaking rate changes that lower human scores, while overreacting to mean pitch shifts humans ignore.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests popular MOS prediction models against human listeners by applying controlled changes to speech samples. It introduces acoustic degradations, prosodic mistakes, and alterations to speaker traits such as average pitch and speaking rate, then compares the resulting scores. Most models align with humans when acoustics degrade, yet none respond to prosodic errors even when humans assign much lower ratings. Models also show a mismatch on speaker traits, reacting strongly to mean fundamental frequency shifts that leave human ratings unchanged while missing the effects of rate and pitch variation that humans detect.

Core claim

Most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

What carries the argument

Controlled perturbations of acoustic degradation, prosodic errors, and speaker-specific characteristics (mean F0, speaking rate, F0 variability) applied to speech samples to compare human and model MOS responses.

If this is right

Scalar MOS models remain reliable proxies for acoustic fidelity checks in TTS evaluation.
Prosodic quality in generated speech cannot be assessed reliably with current models alone.
Speaker trait modeling in quality predictors requires adjustment to remove mean F0 bias and capture rate and variability effects.
TTS systems optimized using these models may overlook prosody and temporal issues that affect perceived quality.
The observed double dissociation indicates distinct model limitations for pitch versus temporal speech features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models might be improved by adding explicit training signals for prosodic and rate features rather than relying on scalar scores alone.
The same perturbation method could be applied to other audio tasks such as music or environmental sound assessment to check for similar gaps.
Real-world TTS outputs may show different patterns than the clean perturbed samples used here, suggesting a need for follow-up tests on generated speech.
Multi-dimensional quality metrics that separate acoustic, prosodic, and speaker dimensions could address the single-score limitations identified.

Load-bearing premise

The controlled perturbations isolate the targeted perceptual dimensions without introducing confounding effects that would invalidate direct human-model comparisons.

What would settle it

A model whose predicted scores drop in response to prosodic errors at rates matching the human score drops, or that shows no change with mean F0 shifts while decreasing with speaking rate changes.

Figures

Figures reproduced from arXiv: 2606.19951 by Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata.

**Figure 1.** Figure 1: Scatter plots of MOS against speaker characteristics (mean log F0, speaking rate) for Groups C-1–C-3. MOS showed moderate associations with speaking rate and log F0 variability, most models instead displayed strong correlations with mean log F0, a trend largely absent in human ratings. These results indicate that MOS prediction models do not replicate the perceptual structure underlying human judgments o… view at source ↗

read the original abstract

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOS models track acoustics but miss prosody, with a double dissociation on speaker traits, though perturbation orthogonality is unverified.

read the letter

The main thing to know is that this paper reports MOS models following human judgments on acoustic degradation but showing complete insensitivity to prosodic errors that humans rate much lower, plus a double dissociation where models bias toward mean F0 changes that humans ignore while missing speaking rate and F0 variability that humans notice.

The work does a reasonable job defining three perturbation classes and running the human-model comparison. The targeted breakdown into acoustic, prosodic, and speaker-characteristic changes produces specific observations rather than vague claims about model gaps, and the double dissociation on speaker traits is a clean empirical point that prior work on MOS limitations has not emphasized in this way.

The soft spots are real but not fatal. The abstract supplies no evidence that prosodic or rate perturbations were checked for side effects on spectral envelope, energy, or other low-level features the models already track. If those side effects exist, the claimed insensitivity to prosody cannot be isolated from acoustic cues. Sample sizes, exact models tested, statistical tests, and raw score distributions are also absent, so the strength of the reported gaps cannot be assessed yet. The assumption that the three perturbation types are orthogonal is load-bearing and unaddressed.

This is for TTS researchers who rely on automated MOS predictors or want to improve them. A reader focused on metric blind spots would get usable observations if the methods section confirms the perturbation controls.

I would bring the full paper to a reading group to examine the implementation details. It is not yet something I would cite, but the empirical direction is worth referee time to verify the orthogonality checks and statistics.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates discrepancies between human mean opinion scores (MOS) and predictions from speech quality assessment models by applying three classes of controlled perturbations to speech samples: acoustic degradation, prosodic errors, and manipulations of speaker characteristics (mean F0, speaking rate, F0 variability). Human listeners and multiple models rate the perturbed samples; results indicate that most models track acoustic degradation, all models are insensitive to prosodic errors (despite large human score drops), and models show a double dissociation on speaker traits (strong mean-F0 bias absent in humans, but insensitivity to rate and variability that humans detect). The abstract frames this as evidence that scalar MOS models are limited beyond acoustic fidelity.

Significance. If the perturbation classes are shown to be orthogonal, the work supplies concrete, falsifiable evidence of model limitations that could inform better TTS evaluation metrics. The empirical design (human-model comparison under matched conditions) is a positive feature; however, the absence of reported orthogonality checks in the provided abstract leaves the attribution of insensitivity to prosody unverified.

major comments (2)

[Methods] Methods (perturbation implementation): the claim that prosodic-error insensitivity is the source of the human-model gap requires evidence that prosodic manipulations (F0 contour, duration) leave low-level acoustic features (spectral envelope, energy, SNR) unchanged. No such verification (e.g., objective acoustic metric deltas before/after perturbation) is described; without it the observed gap could be an artifact of unintended acoustic side-effects that models are known to track.
[Results] Results (speaker-characteristic section): the double-dissociation claim (models biased by mean F0 but insensitive to rate/variability) is load-bearing for the paper's central narrative, yet the abstract supplies no listener-pool size, number of models, statistical tests, or effect-size reporting that would allow assessment of whether the human-model divergence is reliable or merely descriptive.

minor comments (2)

[Abstract] Abstract: the phrase 'most models track acoustic degradation well' is vague; specify how many models were tested and what quantitative criterion defines 'well'.
Notation: 'prosodic errors' and 'speaker-specific characteristics' should be defined with explicit parameter ranges or perturbation magnitudes in the main text to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the strength of our claims. We respond to each major point below.

read point-by-point responses

Referee: [Methods] Methods (perturbation implementation): the claim that prosodic-error insensitivity is the source of the human-model gap requires evidence that prosodic manipulations (F0 contour, duration) leave low-level acoustic features (spectral envelope, energy, SNR) unchanged. No such verification (e.g., objective acoustic metric deltas before/after perturbation) is described; without it the observed gap could be an artifact of unintended acoustic side-effects that models are known to track.

Authors: We agree that explicit verification of acoustic-feature invariance is needed to support attribution of the gap specifically to prosody. The perturbations were generated with standard, targeted signal-processing methods (e.g., Praat-based F0 and duration editing) intended to isolate prosodic dimensions, yet we did not include quantitative before/after acoustic-metric comparisons in the submitted manuscript. In revision we will add these checks (spectral-envelope distance, energy, SNR deltas) for the prosodic-error conditions to confirm orthogonality. revision: yes
Referee: [Results] Results (speaker-characteristic section): the double-dissociation claim (models biased by mean F0 but insensitive to rate/variability) is load-bearing for the paper's central narrative, yet the abstract supplies no listener-pool size, number of models, statistical tests, or effect-size reporting that would allow assessment of whether the human-model divergence is reliable or merely descriptive.

Authors: Listener-pool size, model count, statistical tests, and effect sizes are reported in full in Sections 3.2 (human listening test) and 4–5 (model evaluations and statistical analysis). The abstract is a concise summary and therefore omits these numbers; the empirical support for the double dissociation is contained in the body of the manuscript. If the editor prefers, we can add a sentence to the abstract summarizing the key quantitative details. revision: partial

Circularity Check

0 steps flagged

Empirical comparison study; no derivations or fitted predictions present

full rationale

The paper performs controlled perturbations on speech (acoustic degradation, prosodic errors, speaker characteristics) and directly compares human MOS ratings against model outputs. No equations, derivations, parameter fittings, or self-citation chains are described that would reduce any claim to its own inputs by construction. The central findings rest on empirical differences observed in the data, not on any self-referential prediction or ansatz.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the experimental validity of the perturbations and the assumption that the tested models and human raters are representative of broader MOS prediction and perception.

axioms (1)

domain assumption Controlled perturbations can independently manipulate acoustic quality, prosody, and speaker characteristics without unintended interactions.
Required to attribute rating differences specifically to each factor rather than confounds.

pith-pipeline@v0.9.1-grok · 5692 in / 1109 out tokens · 39582 ms · 2026-06-26T15:54:43.846546+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 7 canonical work pages · 5 internal anchors

[1]

Introduction Modern text-to-speech (TTS) systems [1–3] have reached a level of quality that narrows the gap between synthesized and natural speech. As coarse acoustic artifacts become less preva- lent, the quality differentiators increasingly lie in fine-grained aspects such as prosodic naturalness, accentuation accuracy, and speaker-specific characterist...
[2]

Hypotheses We formulate three hypotheses on how MOS prediction models and human listeners differ in their sensitivity to three quality dimensions: acoustic degradation, prosodic errors, and speaker- characteristics. H1: Comparable sensitivity to acoustic degradation.We hy- pothesize that MOS prediction models and human listeners will show similar sensitiv...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Experiments 3.1. Experimental Design To verify the hypotheses presented in Section 2, we designed three groups of evaluation samples, each targeting a different quality dimension: •Group A (Acoustic Degradation): natural speech with con- trolled signal-level distortions (clipping, additive noise, low- bitrate MP3 compression), for testing H1. •Group B (Pr...
[4]

Across conditions, model predictions exhibited systematic divergence from human perceptual patterns

Time stretching reduces within-segment spectral variation, which these CNNs likely interpret as higher quality. Across conditions, model predictions exhibited systematic divergence from human perceptual patterns. While human 100 150 200 250 300 Mean F0 (Hz) per speaker 2.0 2.5 3.0 3.5 4.0 4.5 5.0Score C-1 (Natural) mean F0 5.5 6.0 6.5 7.0 7.5 8.0 Mean spe...
[5]

Collectively, these findings demonstrate that current MOS prediction models do not replicate the per- ceptual structure underlying human quality judgments

Conclusion Under controlled acoustic-prosodic perturbations, MOS predic- tion models reliably tracked signal-level acoustic degradations but were insensitive to linguistically meaningful prosodic errors and exhibited systematically misaligned sensitivity to speaker- related characteristics. Collectively, these findings demonstrate that current MOS predict...
[6]

All authors have reviewed the final ver- sion and take full responsibility for the scientific content, exper- imental design, results, and conclusions

Generative AI Use Disclosure Generative AI tools were used to assist in English language edit- ing of the manuscript. All authors have reviewed the final ver- sion and take full responsibility for the scientific content, exper- imental design, results, and conclusions
[7]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

C. Wanget al., “Neural codec language models are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,

Z. Juet al., “NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” inProc. ICML, 2024, pp. 22 605–22 623

2024
[9]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProc. ACL, 2025, pp. 6255–6271

2025
[10]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z. Duet al., “CosyV oice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiouet al., “Seed-TTS: A family of high-quality versa- tile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Recommendation P.800: Methods for subjective determi- nation of transmission quality,

ITU-T, “Recommendation P.800: Methods for subjective determi- nation of transmission quality,” International Telecommunication Union, Tech. Rep., 1996

1996
[13]

A review on subjective and objective evaluation of syn- thetic speech,

E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “A review on subjective and objective evaluation of syn- thetic speech,”Acoustical Science and Technology, vol. 45, no. 4, pp. 161–183, 2024

2024
[14]

Generaliza- tion ability of MOS prediction networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generaliza- tion ability of MOS prediction networks,” inProc. ICASSP, 2022, pp. 8442–8446

2022
[15]

UTMOS: UTokyo-SaruLab system for V oice- MOS Challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for V oice- MOS Challenge 2022,” inProc. Interspeech, 2022, pp. 4521– 4525

2022
[16]

The V oiceMOS Challenge 2024: Beyond speech quality prediction,

W.-C. Huang, S.-W. Fu, E. Cooper, R. E. Zezario, T. Toda, H.- M. Wang, J. Yamagishi, and Y . Tsao, “The V oiceMOS Challenge 2024: Beyond speech quality prediction,” inProc. IEEE SLT, 2024, pp. 803–810

2024
[17]

The limits of the mean opinion score for speech synthesis evaluation,

S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,”Computer Speech & Language, vol. 84, p. 101577, 2024

2024
[18]

Uni-VERSA: Versatile speech assessment with a unified network,

J. Shi, H.-J. Shim, and S. Watanabe, “Uni-VERSA: Versatile speech assessment with a unified network,” inProc. Interspeech, 2025

2025
[19]

TTSDS2: Robust objec- tive evaluation for human-quality synthetic speech,

C. Minixhofer, O. Klejch, and P. Bell, “TTSDS2: Robust objec- tive evaluation for human-quality synthetic speech,” inProc. 13th SSW, 2025

2025
[20]

Investigating content-aware neural text-to-speech MOS prediction using prosodic and linguistic fea- tures,

A. Vioni, G. Maniati, N. Ellinas, J. S. Sung, I. Hwang, A. Cha- lamandaris, and P. Tsialoulis, “Investigating content-aware neural text-to-speech MOS prediction using prosodic and linguistic fea- tures,” inProc. ICASSP, 2023, pp. 1–5

2023
[21]

Measuring prosody diversity in zero-shot TTS: A new metric, benchmark, and exploration,

Y . Yang, B. Han, H. Wang, L. Zhou, W. Wang, M. Cui, X. Tan, and X. Chen, “Measuring prosody diversity in zero-shot TTS: A new metric, benchmark, and exploration,”arXiv preprint arXiv:2509.19928, 2025

work page arXiv 2025
[22]

Pitch-and- Spectrum-Aware singing quality assessment with bias correction and model fusion,

Y .-F. Shi, Y . Ai, Y .-X. Lu, H.-P. Du, and Z.-H. Ling, “Pitch-and- Spectrum-Aware singing quality assessment with bias correction and model fusion,” inProc. IEEE SLT, 2024

2024
[23]

Adaptive end-to-end text- to-speech synthesis based on error correction feedback from hu- mans,

K. Fujii, Y . Saito, and H. Saruwatari, “Adaptive end-to-end text- to-speech synthesis based on error correction feedback from hu- mans,” inProc. APSIPA ASC, 2022

2022
[24]

Compari- son of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis,

J. Williams, J. Rownicka, P. Oplustil, and S. King, “Compari- son of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis,” inThe Speaker and Language Recognition Workshop, 2020, pp. 222–229

2020
[25]

NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inProc. Inter- speech, 2021, pp. 2127–2131

2021
[26]

DNSMOS: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP, 2021, pp. 6493–6497

2021
[27]

JSUT and JVS: Free Japanese voice corpora for accelerating speech synthesis research,

S. Takamichi, R. Sonobe, K. Mitsui, Y . Saito, T. Koriyama, N. Tanji, and H. Saruwatari, “JSUT and JVS: Free Japanese voice corpora for accelerating speech synthesis research,”Acoustical Science and Technology, vol. 41, no. 5, pp. 761–768, 2020

2020
[28]

NANSY++: Unified voice synthesis with neural analysis and synthesis,

H.-S. Choi, J. Yang, J. Lee, and H. Kim, “NANSY++: Unified voice synthesis with neural analysis and synthesis,” inProc. ICLR, 2023

2023
[29]

A unified accent esti- mation method based on multi-task learning for Japanese text-to- speech,

B. Park, R. Yamamoto, and K. Tachibana, “A unified accent esti- mation method based on multi-task learning for Japanese text-to- speech,” inProc. Interspeech, 2022, pp. 1931–1935

2022
[30]

Source-Filter HiFi-GAN: Fast and pitch controllable high-fidelity neural vocoder,

R. Yoneyama, Y .-C. Wu, and T. Toda, “Source-Filter HiFi-GAN: Fast and pitch controllable high-fidelity neural vocoder,” inProc. ICASSP, 2023

2023
[31]

WORLD: A vocoder- based high-quality speech synthesis system for real-time applica- tions,

M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder- based high-quality speech synthesis system for real-time applica- tions,”IEICE Trans. Inf. Syst., vol. E99-D, no. 7, pp. 1877–1884, 2016

2016
[32]

VERSA: A versatile evaluation toolkit for speech, audio, and music,

J. Shiet al., “VERSA: A versatile evaluation toolkit for speech, audio, and music,” inProc. NAACL-HLT (System Demonstra- tions), 2025

2025
[33]

ESPnet-Codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech,

——, “ESPnet-Codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech,” inProc. IEEE SLT, 2024, pp. 562–569

2024
[34]

SHEET: A multi-purpose open-source speech human evaluation estimation toolkit,

W.-C. Huang, E. Cooper, and T. Toda, “SHEET: A multi-purpose open-source speech human evaluation estimation toolkit,” inProc. Interspeech, 2025

2025
[35]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chenet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[36]

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

W.-C. Huang, E. Cooper, and T. Toda, “MOS-Bench: Bench- marking generalization abilities of subjective speech quality as- sessment models,”arXiv preprint arXiv:2411.03715, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Data for the V oiceMOS Challenge 2022,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Data for the V oiceMOS Challenge 2022,” 2022

2022
[38]

SingMOS: An extensive open- source singing voice dataset for MOS prediction,

Y . Tang, J. Shi, Y . Wu, and Q. Jin, “SingMOS: An extensive open- source singing voice dataset for MOS prediction,”arXiv preprint arXiv:2406.10911, 2024

work page arXiv 2024
[39]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inProc. NeurIPS, vol. 33, 2020, pp. 12 449–12 460

2020
[40]

The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inProc. IEEE SLT, 2024

2024
[41]

EfficientNetV2: Smaller models and faster training,

M. Tan and Q. Le, “EfficientNetV2: Smaller models and faster training,” inProc. ICML, 2021, pp. 10 096–10 106

2021
[42]

Analytic study of text-free speech synthesis for raw audio using a self-supervised learning model,

J. Park, D. Saito, and N. Minematsu, “Analytic study of text-free speech synthesis for raw audio using a self-supervised learning model,” inProc. APSIPA ASC, 2024, pp. 1–6

2024

[1] [1]

Introduction Modern text-to-speech (TTS) systems [1–3] have reached a level of quality that narrows the gap between synthesized and natural speech. As coarse acoustic artifacts become less preva- lent, the quality differentiators increasingly lie in fine-grained aspects such as prosodic naturalness, accentuation accuracy, and speaker-specific characterist...

[2] [2]

Hypotheses We formulate three hypotheses on how MOS prediction models and human listeners differ in their sensitivity to three quality dimensions: acoustic degradation, prosodic errors, and speaker- characteristics. H1: Comparable sensitivity to acoustic degradation.We hy- pothesize that MOS prediction models and human listeners will show similar sensitiv...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Experiments 3.1. Experimental Design To verify the hypotheses presented in Section 2, we designed three groups of evaluation samples, each targeting a different quality dimension: •Group A (Acoustic Degradation): natural speech with con- trolled signal-level distortions (clipping, additive noise, low- bitrate MP3 compression), for testing H1. •Group B (Pr...

[4] [4]

Across conditions, model predictions exhibited systematic divergence from human perceptual patterns

Time stretching reduces within-segment spectral variation, which these CNNs likely interpret as higher quality. Across conditions, model predictions exhibited systematic divergence from human perceptual patterns. While human 100 150 200 250 300 Mean F0 (Hz) per speaker 2.0 2.5 3.0 3.5 4.0 4.5 5.0Score C-1 (Natural) mean F0 5.5 6.0 6.5 7.0 7.5 8.0 Mean spe...

[5] [5]

Collectively, these findings demonstrate that current MOS prediction models do not replicate the per- ceptual structure underlying human quality judgments

Conclusion Under controlled acoustic-prosodic perturbations, MOS predic- tion models reliably tracked signal-level acoustic degradations but were insensitive to linguistically meaningful prosodic errors and exhibited systematically misaligned sensitivity to speaker- related characteristics. Collectively, these findings demonstrate that current MOS predict...

[6] [6]

All authors have reviewed the final ver- sion and take full responsibility for the scientific content, exper- imental design, results, and conclusions

Generative AI Use Disclosure Generative AI tools were used to assist in English language edit- ing of the manuscript. All authors have reviewed the final ver- sion and take full responsibility for the scientific content, exper- imental design, results, and conclusions

[7] [7]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

C. Wanget al., “Neural codec language models are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,

Z. Juet al., “NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” inProc. ICML, 2024, pp. 22 605–22 623

2024

[9] [9]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProc. ACL, 2025, pp. 6255–6271

2025

[10] [10]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z. Duet al., “CosyV oice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiouet al., “Seed-TTS: A family of high-quality versa- tile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Recommendation P.800: Methods for subjective determi- nation of transmission quality,

ITU-T, “Recommendation P.800: Methods for subjective determi- nation of transmission quality,” International Telecommunication Union, Tech. Rep., 1996

1996

[13] [13]

A review on subjective and objective evaluation of syn- thetic speech,

E. Cooper, W.-C. Huang, Y . Tsao, H.-M. Wang, T. Toda, and J. Ya- magishi, “A review on subjective and objective evaluation of syn- thetic speech,”Acoustical Science and Technology, vol. 45, no. 4, pp. 161–183, 2024

2024

[14] [14]

Generaliza- tion ability of MOS prediction networks,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generaliza- tion ability of MOS prediction networks,” inProc. ICASSP, 2022, pp. 8442–8446

2022

[15] [15]

UTMOS: UTokyo-SaruLab system for V oice- MOS Challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for V oice- MOS Challenge 2022,” inProc. Interspeech, 2022, pp. 4521– 4525

2022

[16] [16]

The V oiceMOS Challenge 2024: Beyond speech quality prediction,

W.-C. Huang, S.-W. Fu, E. Cooper, R. E. Zezario, T. Toda, H.- M. Wang, J. Yamagishi, and Y . Tsao, “The V oiceMOS Challenge 2024: Beyond speech quality prediction,” inProc. IEEE SLT, 2024, pp. 803–810

2024

[17] [17]

The limits of the mean opinion score for speech synthesis evaluation,

S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,”Computer Speech & Language, vol. 84, p. 101577, 2024

2024

[18] [18]

Uni-VERSA: Versatile speech assessment with a unified network,

J. Shi, H.-J. Shim, and S. Watanabe, “Uni-VERSA: Versatile speech assessment with a unified network,” inProc. Interspeech, 2025

2025

[19] [19]

TTSDS2: Robust objec- tive evaluation for human-quality synthetic speech,

C. Minixhofer, O. Klejch, and P. Bell, “TTSDS2: Robust objec- tive evaluation for human-quality synthetic speech,” inProc. 13th SSW, 2025

2025

[20] [20]

Investigating content-aware neural text-to-speech MOS prediction using prosodic and linguistic fea- tures,

A. Vioni, G. Maniati, N. Ellinas, J. S. Sung, I. Hwang, A. Cha- lamandaris, and P. Tsialoulis, “Investigating content-aware neural text-to-speech MOS prediction using prosodic and linguistic fea- tures,” inProc. ICASSP, 2023, pp. 1–5

2023

[21] [21]

Measuring prosody diversity in zero-shot TTS: A new metric, benchmark, and exploration,

Y . Yang, B. Han, H. Wang, L. Zhou, W. Wang, M. Cui, X. Tan, and X. Chen, “Measuring prosody diversity in zero-shot TTS: A new metric, benchmark, and exploration,”arXiv preprint arXiv:2509.19928, 2025

work page arXiv 2025

[22] [22]

Pitch-and- Spectrum-Aware singing quality assessment with bias correction and model fusion,

Y .-F. Shi, Y . Ai, Y .-X. Lu, H.-P. Du, and Z.-H. Ling, “Pitch-and- Spectrum-Aware singing quality assessment with bias correction and model fusion,” inProc. IEEE SLT, 2024

2024

[23] [23]

Adaptive end-to-end text- to-speech synthesis based on error correction feedback from hu- mans,

K. Fujii, Y . Saito, and H. Saruwatari, “Adaptive end-to-end text- to-speech synthesis based on error correction feedback from hu- mans,” inProc. APSIPA ASC, 2022

2022

[24] [24]

Compari- son of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis,

J. Williams, J. Rownicka, P. Oplustil, and S. King, “Compari- son of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis,” inThe Speaker and Language Recognition Workshop, 2020, pp. 222–229

2020

[25] [25]

NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inProc. Inter- speech, 2021, pp. 2127–2131

2021

[26] [26]

DNSMOS: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. A. Reddy, V . Gopal, and R. Cutler, “DNSMOS: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP, 2021, pp. 6493–6497

2021

[27] [27]

JSUT and JVS: Free Japanese voice corpora for accelerating speech synthesis research,

S. Takamichi, R. Sonobe, K. Mitsui, Y . Saito, T. Koriyama, N. Tanji, and H. Saruwatari, “JSUT and JVS: Free Japanese voice corpora for accelerating speech synthesis research,”Acoustical Science and Technology, vol. 41, no. 5, pp. 761–768, 2020

2020

[28] [28]

NANSY++: Unified voice synthesis with neural analysis and synthesis,

H.-S. Choi, J. Yang, J. Lee, and H. Kim, “NANSY++: Unified voice synthesis with neural analysis and synthesis,” inProc. ICLR, 2023

2023

[29] [29]

A unified accent esti- mation method based on multi-task learning for Japanese text-to- speech,

B. Park, R. Yamamoto, and K. Tachibana, “A unified accent esti- mation method based on multi-task learning for Japanese text-to- speech,” inProc. Interspeech, 2022, pp. 1931–1935

2022

[30] [30]

Source-Filter HiFi-GAN: Fast and pitch controllable high-fidelity neural vocoder,

R. Yoneyama, Y .-C. Wu, and T. Toda, “Source-Filter HiFi-GAN: Fast and pitch controllable high-fidelity neural vocoder,” inProc. ICASSP, 2023

2023

[31] [31]

WORLD: A vocoder- based high-quality speech synthesis system for real-time applica- tions,

M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder- based high-quality speech synthesis system for real-time applica- tions,”IEICE Trans. Inf. Syst., vol. E99-D, no. 7, pp. 1877–1884, 2016

2016

[32] [32]

VERSA: A versatile evaluation toolkit for speech, audio, and music,

J. Shiet al., “VERSA: A versatile evaluation toolkit for speech, audio, and music,” inProc. NAACL-HLT (System Demonstra- tions), 2025

2025

[33] [33]

ESPnet-Codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech,

——, “ESPnet-Codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech,” inProc. IEEE SLT, 2024, pp. 562–569

2024

[34] [34]

SHEET: A multi-purpose open-source speech human evaluation estimation toolkit,

W.-C. Huang, E. Cooper, and T. Toda, “SHEET: A multi-purpose open-source speech human evaluation estimation toolkit,” inProc. Interspeech, 2025

2025

[35] [35]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chenet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[36] [36]

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

W.-C. Huang, E. Cooper, and T. Toda, “MOS-Bench: Bench- marking generalization abilities of subjective speech quality as- sessment models,”arXiv preprint arXiv:2411.03715, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Data for the V oiceMOS Challenge 2022,

E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Data for the V oiceMOS Challenge 2022,” 2022

2022

[38] [38]

SingMOS: An extensive open- source singing voice dataset for MOS prediction,

Y . Tang, J. Shi, Y . Wu, and Q. Jin, “SingMOS: An extensive open- source singing voice dataset for MOS prediction,”arXiv preprint arXiv:2406.10911, 2024

work page arXiv 2024

[39] [39]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inProc. NeurIPS, vol. 33, 2020, pp. 12 449–12 460

2020

[40] [40]

The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The T05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inProc. IEEE SLT, 2024

2024

[41] [41]

EfficientNetV2: Smaller models and faster training,

M. Tan and Q. Le, “EfficientNetV2: Smaller models and faster training,” inProc. ICML, 2021, pp. 10 096–10 106

2021

[42] [42]

Analytic study of text-free speech synthesis for raw audio using a self-supervised learning model,

J. Park, D. Saito, and N. Minematsu, “Analytic study of text-free speech synthesis for raw audio using a self-supervised learning model,” inProc. APSIPA ASC, 2024, pp. 1–6

2024