Recognition: unknown
Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models
Pith reviewed 2026-05-10 12:04 UTC · model grok-4.3
The pith
GatherMOS uses an LLM to combine acoustic descriptors and pseudo-labels for more accurate speech quality prediction than existing methods when labeled data is scarce.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By feeding an LLM heterogeneous acoustic descriptors together with pseudo-labels from DNSMOS and VQScore, GatherMOS enables the model to infer accurate perceptual MOS values. Zero-shot operation remains stable across conditions, while few-shot guidance produces large gains once support samples are chosen to match the test distribution. Experiments confirm that this aggregation strategy consistently surpasses both non-intrusive baselines and learning-based models trained under the same low-label regime on VoiceBank-DEMAND.
What carries the argument
GatherMOS, the LLM meta-evaluator that aggregates lightweight acoustic descriptors with DNSMOS and VQScore pseudo-labels to infer MOS.
If this is right
- Few-shot performance improves markedly when support examples match the acoustic conditions of the test utterances.
- Zero-shot GatherMOS already exceeds several established non-intrusive metrics without any labeled examples.
- The approach reduces dependence on large human-labeled MOS collections for training quality predictors.
- LLM-based aggregation can be applied directly to new domains once suitable pseudo-label generators exist.
Where Pith is reading between the lines
- Similar meta-evaluation could be tested on music or environmental sound quality where pseudo-label generators are available.
- If LLM reasoning proves robust, human listener panels for routine MOS collection might be scaled down.
- Bias checks on the LLM outputs would be needed before deployment in regulatory or clinical speech applications.
Load-bearing premise
The LLM can reliably combine the supplied acoustic descriptors and pseudo-labels into accurate MOS predictions without hallucination or bias introduced by the in-context examples.
What would settle it
On a held-out test set, GatherMOS with matched few-shot examples produces lower correlation with human MOS than DNSMOS alone.
read the original abstract
In this paper, we introduce GatherMOS, a novel framework that leverages large language models (LLM) as meta-evaluators to aggregate diverse signals into quality predictions. GatherMOS integrates lightweight acoustic descriptors with pseudo-labels from DNSMOS and VQScore, enabling the LLM to reason over heterogeneous inputs and infer perceptual mean opinion scores (MOS). We further explore both zero-shot and few-shot in-context learning setups, showing that zero-shot GatherMOS maintains stable performance across diverse conditions, while few-shot guidance yields large gains when support samples match the test conditions. Experiments on the VoiceBank-DEMAND dataset demonstrate that GatherMOS consistently outperforms DNSMOS, VQScore, naive score averaging, and even learning-based models such as CNN-BLSTM and MOS-SSL when trained under limited labeled-data conditions. These results highlight the potential of LLM-based aggregation as a practical strategy for non-intrusive speech quality evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GatherMOS, a framework that uses large language models as meta-evaluators to aggregate lightweight acoustic descriptors with pseudo-labels from DNSMOS and VQScore for inferring perceptual MOS in speech quality evaluation. It examines zero-shot and few-shot in-context learning, claiming stable zero-shot performance and large few-shot gains when support samples match test conditions. Experiments on VoiceBank-DEMAND are said to show GatherMOS outperforming DNSMOS, VQScore, naive averaging, and limited-data supervised models such as CNN-BLSTM and MOS-SSL.
Significance. If substantiated with quantitative evidence, the work would indicate that LLMs can usefully aggregate heterogeneous signals for non-intrusive speech quality assessment, potentially lowering the labeled-data requirements compared to traditional supervised approaches. The few-shot component with matched supports could offer practical gains in targeted scenarios, while the zero-shot variant provides a baseline for broader applicability; the approach explicitly builds on existing pseudo-labelers rather than replacing them.
major comments (3)
- [Abstract] Abstract: the claim that GatherMOS 'consistently outperforms' DNSMOS, VQScore, naive averaging, and limited-data models (CNN-BLSTM, MOS-SSL) is presented without any quantitative results, error bars, or methodology details, preventing assessment of the magnitude or reliability of the reported gains.
- [Abstract] Abstract: the few-shot gains are conditioned on 'support samples match the test conditions,' yet no mechanism is described for selecting or obtaining such matched supports without additional labeled data from the target distribution; this assumption is load-bearing for the central claim of advantage over baselines and risks making the few-shot setup oracle-assisted rather than generally applicable.
- [Method] Method (inferred from abstract description of integration): the LLM receives pseudo-labels from DNSMOS and VQScore as inputs; without an ablation isolating whether the LLM's reasoning adds value beyond a simple combination or regression on these pre-existing scores, it remains unclear if the meta-evaluation step is load-bearing or reduces to reweighting the supplied signals.
minor comments (3)
- [Abstract] The acronym GatherMOS should be defined on first use in the abstract.
- Include specific numerical tables (e.g., Pearson correlation, MSE) comparing all methods, plus details on the LLM variant, prompt templates, and number of in-context examples to support reproducibility.
- Clarify how acoustic descriptors are extracted and formatted for the LLM input to avoid ambiguity in the heterogeneous-signal aggregation step.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that GatherMOS 'consistently outperforms' DNSMOS, VQScore, naive averaging, and limited-data models (CNN-BLSTM, MOS-SSL) is presented without any quantitative results, error bars, or methodology details, preventing assessment of the magnitude or reliability of the reported gains.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revision we will incorporate specific metrics (e.g., Pearson correlation and MSE values on VoiceBank-DEMAND) and a brief note on the evaluation protocol. Full tables with error bars from multiple runs and detailed methodology already appear in the experimental section; we will ensure the abstract points readers to these results. revision: yes
-
Referee: [Abstract] Abstract: the few-shot gains are conditioned on 'support samples match the test conditions,' yet no mechanism is described for selecting or obtaining such matched supports without additional labeled data from the target distribution; this assumption is load-bearing for the central claim of advantage over baselines and risks making the few-shot setup oracle-assisted rather than generally applicable.
Authors: The observation is correct: the reported few-shot gains presuppose access to matched support samples. The manuscript demonstrates the magnitude of improvement when such samples are available (as may occur in targeted deployment settings), but does not supply an automatic, label-free selection procedure. We will revise the abstract and discussion sections to state this assumption explicitly, qualify the scope of the claim, and outline practical acquisition strategies such as collecting a small matched validation set. This renders the contribution more precise without overstating generality. revision: partial
-
Referee: [Method] Method (inferred from abstract description of integration): the LLM receives pseudo-labels from DNSMOS and VQScore as inputs; without an ablation isolating whether the LLM's reasoning adds value beyond a simple combination or regression on these pre-existing scores, it remains unclear if the meta-evaluation step is load-bearing or reduces to reweighting the supplied signals.
Authors: We acknowledge the value of a more targeted ablation. The current experiments already compare GatherMOS against naive averaging of the supplied pseudo-labels, showing consistent gains. To further isolate the contribution of LLM reasoning, we will add an ablation that replaces the LLM with a lightweight regression model (linear or small MLP) operating on the same acoustic descriptors and pseudo-labels. Results of this comparison will be included in the revised manuscript. revision: yes
Circularity Check
No significant circularity; framework is empirical aggregation with external baselines
full rationale
The paper introduces GatherMOS as an LLM-based meta-evaluator that takes lightweight acoustic descriptors plus explicit pseudo-labels from DNSMOS and VQScore as inputs, then uses zero-shot or few-shot prompting to produce MOS predictions. Performance is evaluated empirically on VoiceBank-DEMAND against those same baselines plus averaging and supervised models, with the abstract stating outperformance over naive averaging. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The method is presented as an additional reasoning layer rather than a closed-form reduction to its inputs, and the few-shot matching requirement is a stated experimental condition rather than a definitional equivalence. The derivation chain is therefore self-contained against the external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
GatherMOS
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models
INTRODUCTION Speech quality has long been an important indicator for evaluating various speech processing applications, including speech enhancement [1], hearing aid (HA) devices [2], and telecommunications [3]. While human-based evaluation re- mains the gold standard, it requires a sufficient number of listeners to obtain reliable and generalized scores....
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Zero-shot GatherMOS Given an input waveformx∈R T , we extract several acous- tic descriptors that summarize temporal, spectral, and percep- tual information
GA THERMOS 2.1. Zero-shot GatherMOS Given an input waveformx∈R T , we extract several acous- tic descriptors that summarize temporal, spectral, and percep- tual information. The RMS reflects the overall signal energy, while the ZCR indicates the level of noisiness or voicing. Du- ration and clipping detection are included to capture tempo- ral length and ...
-
[3]
Experimental setup The proposed approaches are evaluated on the V oiceBank- DEMAND dataset [15], which is also included in the test evaluation of the V oiceMOS Challenge 2024 [16]
EXPERIMENTS 3.1. Experimental setup The proposed approaches are evaluated on the V oiceBank- DEMAND dataset [15], which is also included in the test evaluation of the V oiceMOS Challenge 2024 [16]. We select a test set of 200 utterances, consisting of clean speech, noisy speech corrupted by four noise types (car, white, music, and cafeteria) at an SNR of ...
2024
-
[4]
By leveraging the reasoning capabilities of large language models, GatherMOS integrates these diverse signals to produce more reliable MOS pre- dictions
CONCLUSION In this work, we explored the use of generative AI for non- intrusive speech quality estimation and introduced Gather- MOS, a framework that acts as a meta-evaluator by com- bining acoustic feature representations with pseudo-labels from DNSMOS and VQScore. By leveraging the reasoning capabilities of large language models, GatherMOS integrates ...
-
[5]
P. C. Loizou,Speech enhancement: theory and practice, CRC press, 2013
2013
-
[6]
The Hearing-Aid Speech Quality Index (HASQI) Version 2,
J. M. Kates and K. H. Arehart, “The Hearing-Aid Speech Quality Index (HASQI) Version 2,”Journal of the Audio Engineering Society, vol. 62, no. 3, pp. 99– 117, 2014
2014
-
[7]
Perceptual ob- jective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech qual- ity measurement part i—temporal alignment,
J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann, J. Pomy, and M. Keyhl, “Perceptual ob- jective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech qual- ity measurement part i—temporal alignment,”Journal of The Audio Engineering Society, vol. 61, no. 6, pp. 366–384, 2013
2013
-
[8]
MOSNet: Deep learning-based objective assessment for voice conver- sion,
C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yam- agishi, Y . Tsao, and H.-M. Wang, “MOSNet: Deep learning-based objective assessment for voice conver- sion,” inProc. Interspeech, 2019, pp. 1541–1545
2019
-
[9]
Deep learning-based non-intrusive multi- objective speech assessment model with cross-domain features,
R. E. Zezario, S.-W Fu, F. Chen, C.-S Fuh, H.-M. Wan, and Yu Tsao, “Deep learning-based non-intrusive multi- objective speech assessment model with cross-domain features,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54–70, 2023
2023
-
[10]
Self-supervised speech quality estimation and en- hancement using only clean speech,
S.-W. Fu, K.-H. Hung, Y . Tsao, and Y .-C. F. Wang, “Self-supervised speech quality estimation and en- hancement using only clean speech,” inProc. ICLR, 2024, pp. 1–22
2024
-
[11]
Generalization ability of MOS prediction networks,
E. Cooper, W.-H. Huang, T. Toda, and J. Yamagishi, “Generalization ability of MOS prediction networks,” in Proc. ICASSP, 2022, pp. 8442–8446
2022
-
[12]
Enabling auditory large language models for automatic speech quality evaluation,
S. Wang, W. Yu, Y . Yang, C. Tang, Y . Li, J. Zhuang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, and C. Zhang, “Enabling auditory large language models for automatic speech quality evaluation,” inProc. ICASSP, 2025, pp. 1–5
2025
-
[13]
Audio large language models can be descriptive speech quality evaluators,
C. Chen, Y . Hu, S. Wang, H. Wang, Z. Chen, C. Zhang, C.-H. H. Yang, and E. Chng, “Audio large language models can be descriptive speech quality evaluators,” in Proc. ICLR, 2025, pp. 1–15
2025
-
[14]
Wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,” inProc. NIPS, 2020, pp. 12449 – 12460
2020
-
[15]
Robust speech recogni- tion via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recogni- tion via large-scale weak supervision,” inProc. ICML, 2023, pp. 28492–28518
2023
-
[16]
Exploring in-context learning capabilities of ChatGPT for patho- logical speech detection,
M. Amiri, H.O. Shahreza, and I. Kodrasi, “Exploring in-context learning capabilities of ChatGPT for patho- logical speech detection,”arXiv 2503.23873, 2025
-
[17]
A study on zero-shot non-intrusive speech assessment using large language models,
R. E. Zezario, S. M. Siniscalchi, H.-M. Wang, and Y . Tsao, “A study on zero-shot non-intrusive speech assessment using large language models,” inProc. ICASSP, 2025, pp. 1–5
2025
-
[18]
DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,
C. K. A Reddy, V . Gopal, and R. Cutler, “DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP, 2021, pp. 6493–6497
2021
-
[19]
Investigating RNN-based speech enhancement methods for noise-robust text-to-speech.,
C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yam- agishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech.,” inProc. SSW, 2016, pp. 146–152
2016
-
[20]
The V oicemos challenge 2024: Beyond speech quality pre- diction,
W.-C. Huang, S.-W. Fu, E. Cooper, R. E. Zezario, T. Toda, H.-M. Wang, J. Yamagishi, and Y . Tsao, “The V oicemos challenge 2024: Beyond speech quality pre- diction,” inProc. SLT, 2024, pp. 803–810
2024
-
[21]
Boosting Self-Supervised Embeddings for Speech Enhancement,
K.-H. Hung, S.-W. Fu, H.-H. Tseng, H.-T. Chiang, Y . Tsao, and C.-W. Lin, “Boosting Self-Supervised Embeddings for Speech Enhancement,” inProc. Inter- speech, 2022, pp. 186–190
2022
-
[22]
MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magni- tude and Phase Spectra,
Y .-X. Lu, Y . Ai, and Z.-H. Ling, “MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magni- tude and Phase Spectra,” inProc. Interspeech, 2023, pp. 3834–3838
2023
-
[23]
CMGAN: Conformer-based Metric GAN for Speech Enhance- ment,
R. Cao, S. Abdulatif, and B. Yang, “CMGAN: Conformer-based Metric GAN for Speech Enhance- ment,” inProc. Interspeech, 2022, pp. 936–940
2022
-
[24]
Real Time Speech Enhancement in the Waveform Domain,
A. D ´efossez, G. Synnaeve, and Y . Adi, “Real Time Speech Enhancement in the Waveform Domain,” in Proc. Interspeech, 2020, pp. 3291–3295
2020
-
[25]
The proof and measurement of associ- ation between two things,
C. Spearman, “The proof and measurement of associ- ation between two things,”The American Journal of Psychology, vol. 15, no. 1, pp. 72–101, 1904
1904
-
[26]
The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhance- ment,
S. Leglaive, L. Borne, E. Tzinis, M. Sadeghi, M. Frati- celli, S. Wisdom, M. Pariente, D. Pressnitzer, and J. R. Hershey, “The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhance- ment,” inProc. CHIME, 2023, pp. 1–7
2023
-
[27]
Objective and subjec- tive evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge,
S. Leglaive, M. Fraticelli, H. ElGhazaly, L. Borne, M. Sadeghi, S. Wisdom, M. Pariente, J. R Hershey, D. Pressnitzer, and J. P Barker, “Objective and subjec- tive evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge,”Computer Speech and Language, vol. 89, pp. 101685, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.