arxiv: 2604.13528 · v1 · submitted 2026-04-15 · 📡 eess.AS · cs.SD

Recognition: unknown

Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models

Dyah A. M. G. Wisnu, Hsin-Min Wang, Ryandhimas E. Zezario, Sabato Marco Siniscalchi, Szu-Wei Fu, Yu Tsao

Pith reviewed 2026-05-10 12:04 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords speech quality evaluationlarge language modelsfew-shot learningpseudo-labelingMOS predictionnon-intrusive assessmentVoiceBank-DEMAND

0 comments

The pith

GatherMOS uses an LLM to combine acoustic descriptors and pseudo-labels for more accurate speech quality prediction than existing methods when labeled data is scarce.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GatherMOS as a framework that treats large language models as meta-evaluators capable of reasoning over mixed inputs to produce perceptual mean opinion scores. It supplies the LLM with lightweight acoustic descriptors plus pseudo-labels generated by DNSMOS and VQScore, then tests both zero-shot prompting and few-shot in-context learning. On the VoiceBank-DEMAND dataset, GatherMOS exceeds the standalone performance of DNSMOS, VQScore, simple averaging, and several supervised models such as CNN-BLSTM and MOS-SSL precisely when training data is limited. The results indicate that LLM aggregation can serve as a practical route to non-intrusive quality assessment without large labeled corpora.

Core claim

By feeding an LLM heterogeneous acoustic descriptors together with pseudo-labels from DNSMOS and VQScore, GatherMOS enables the model to infer accurate perceptual MOS values. Zero-shot operation remains stable across conditions, while few-shot guidance produces large gains once support samples are chosen to match the test distribution. Experiments confirm that this aggregation strategy consistently surpasses both non-intrusive baselines and learning-based models trained under the same low-label regime on VoiceBank-DEMAND.

What carries the argument

GatherMOS, the LLM meta-evaluator that aggregates lightweight acoustic descriptors with DNSMOS and VQScore pseudo-labels to infer MOS.

If this is right

Few-shot performance improves markedly when support examples match the acoustic conditions of the test utterances.
Zero-shot GatherMOS already exceeds several established non-intrusive metrics without any labeled examples.
The approach reduces dependence on large human-labeled MOS collections for training quality predictors.
LLM-based aggregation can be applied directly to new domains once suitable pseudo-label generators exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar meta-evaluation could be tested on music or environmental sound quality where pseudo-label generators are available.
If LLM reasoning proves robust, human listener panels for routine MOS collection might be scaled down.
Bias checks on the LLM outputs would be needed before deployment in regulatory or clinical speech applications.

Load-bearing premise

The LLM can reliably combine the supplied acoustic descriptors and pseudo-labels into accurate MOS predictions without hallucination or bias introduced by the in-context examples.

What would settle it

On a held-out test set, GatherMOS with matched few-shot examples produces lower correlation with human MOS than DNSMOS alone.

read the original abstract

In this paper, we introduce GatherMOS, a novel framework that leverages large language models (LLM) as meta-evaluators to aggregate diverse signals into quality predictions. GatherMOS integrates lightweight acoustic descriptors with pseudo-labels from DNSMOS and VQScore, enabling the LLM to reason over heterogeneous inputs and infer perceptual mean opinion scores (MOS). We further explore both zero-shot and few-shot in-context learning setups, showing that zero-shot GatherMOS maintains stable performance across diverse conditions, while few-shot guidance yields large gains when support samples match the test conditions. Experiments on the VoiceBank-DEMAND dataset demonstrate that GatherMOS consistently outperforms DNSMOS, VQScore, naive score averaging, and even learning-based models such as CNN-BLSTM and MOS-SSL when trained under limited labeled-data conditions. These results highlight the potential of LLM-based aggregation as a practical strategy for non-intrusive speech quality evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GatherMOS shows an LLM can aggregate acoustic descriptors and DNSMOS/VQScore pseudo-labels for MOS prediction, but the few-shot gains depend on matched support samples that are unlikely in real deployment.

read the letter

The paper's core move is to feed an LLM a mix of lightweight acoustic features plus pseudo-labels from existing models, then let it reason to a final MOS score. In zero-shot mode it stays stable across conditions, and in few-shot it beats the individual predictors plus some low-data supervised baselines on VoiceBank-DEMAND. That framing is practical for settings where you cannot collect many human ratings.

Referee Report

3 major / 3 minor

Summary. The paper introduces GatherMOS, a framework that uses large language models as meta-evaluators to aggregate lightweight acoustic descriptors with pseudo-labels from DNSMOS and VQScore for inferring perceptual MOS in speech quality evaluation. It examines zero-shot and few-shot in-context learning, claiming stable zero-shot performance and large few-shot gains when support samples match test conditions. Experiments on VoiceBank-DEMAND are said to show GatherMOS outperforming DNSMOS, VQScore, naive averaging, and limited-data supervised models such as CNN-BLSTM and MOS-SSL.

Significance. If substantiated with quantitative evidence, the work would indicate that LLMs can usefully aggregate heterogeneous signals for non-intrusive speech quality assessment, potentially lowering the labeled-data requirements compared to traditional supervised approaches. The few-shot component with matched supports could offer practical gains in targeted scenarios, while the zero-shot variant provides a baseline for broader applicability; the approach explicitly builds on existing pseudo-labelers rather than replacing them.

major comments (3)

[Abstract] Abstract: the claim that GatherMOS 'consistently outperforms' DNSMOS, VQScore, naive averaging, and limited-data models (CNN-BLSTM, MOS-SSL) is presented without any quantitative results, error bars, or methodology details, preventing assessment of the magnitude or reliability of the reported gains.
[Abstract] Abstract: the few-shot gains are conditioned on 'support samples match the test conditions,' yet no mechanism is described for selecting or obtaining such matched supports without additional labeled data from the target distribution; this assumption is load-bearing for the central claim of advantage over baselines and risks making the few-shot setup oracle-assisted rather than generally applicable.
[Method] Method (inferred from abstract description of integration): the LLM receives pseudo-labels from DNSMOS and VQScore as inputs; without an ablation isolating whether the LLM's reasoning adds value beyond a simple combination or regression on these pre-existing scores, it remains unclear if the meta-evaluation step is load-bearing or reduces to reweighting the supplied signals.

minor comments (3)

[Abstract] The acronym GatherMOS should be defined on first use in the abstract.
Include specific numerical tables (e.g., Pearson correlation, MSE) comparing all methods, plus details on the LLM variant, prompt templates, and number of in-context examples to support reproducibility.
Clarify how acoustic descriptors are extracted and formatted for the LLM input to avoid ambiguity in the heterogeneous-signal aggregation step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that GatherMOS 'consistently outperforms' DNSMOS, VQScore, naive averaging, and limited-data models (CNN-BLSTM, MOS-SSL) is presented without any quantitative results, error bars, or methodology details, preventing assessment of the magnitude or reliability of the reported gains.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revision we will incorporate specific metrics (e.g., Pearson correlation and MSE values on VoiceBank-DEMAND) and a brief note on the evaluation protocol. Full tables with error bars from multiple runs and detailed methodology already appear in the experimental section; we will ensure the abstract points readers to these results. revision: yes
Referee: [Abstract] Abstract: the few-shot gains are conditioned on 'support samples match the test conditions,' yet no mechanism is described for selecting or obtaining such matched supports without additional labeled data from the target distribution; this assumption is load-bearing for the central claim of advantage over baselines and risks making the few-shot setup oracle-assisted rather than generally applicable.

Authors: The observation is correct: the reported few-shot gains presuppose access to matched support samples. The manuscript demonstrates the magnitude of improvement when such samples are available (as may occur in targeted deployment settings), but does not supply an automatic, label-free selection procedure. We will revise the abstract and discussion sections to state this assumption explicitly, qualify the scope of the claim, and outline practical acquisition strategies such as collecting a small matched validation set. This renders the contribution more precise without overstating generality. revision: partial
Referee: [Method] Method (inferred from abstract description of integration): the LLM receives pseudo-labels from DNSMOS and VQScore as inputs; without an ablation isolating whether the LLM's reasoning adds value beyond a simple combination or regression on these pre-existing scores, it remains unclear if the meta-evaluation step is load-bearing or reduces to reweighting the supplied signals.

Authors: We acknowledge the value of a more targeted ablation. The current experiments already compare GatherMOS against naive averaging of the supplied pseudo-labels, showing consistent gains. To further isolate the contribution of LLM reasoning, we will add an ablation that replaces the LLM with a lightweight regression model (linear or small MLP) operating on the same acoustic descriptors and pseudo-labels. Results of this comparison will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is empirical aggregation with external baselines

full rationale

The paper introduces GatherMOS as an LLM-based meta-evaluator that takes lightweight acoustic descriptors plus explicit pseudo-labels from DNSMOS and VQScore as inputs, then uses zero-shot or few-shot prompting to produce MOS predictions. Performance is evaluated empirically on VoiceBank-DEMAND against those same baselines plus averaging and supervised models, with the abstract stating outperformance over naive averaging. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The method is presented as an additional reasoning layer rather than a closed-form reduction to its inputs, and the few-shot matching requirement is a stated experimental condition rather than a definitional equivalence. The derivation chain is therefore self-contained against the external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unproven assumption that LLMs possess reliable meta-reasoning over mixed speech quality signals; no formal axioms or free parameters are stated in the abstract.

invented entities (1)

GatherMOS no independent evidence
purpose: LLM-based meta-evaluator framework for aggregating signals into MOS predictions
Introduced as the novel contribution in the abstract.

pith-pipeline@v0.9.0 · 5487 in / 1176 out tokens · 40084 ms · 2026-05-10T12:04:25.452514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models

INTRODUCTION Speech quality has long been an important indicator for evaluating various speech processing applications, including speech enhancement [1], hearing aid (HA) devices [2], and telecommunications [3]. While human-based evaluation re- mains the gold standard, it requires a sufficient number of listeners to obtain reliable and generalized scores....

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Zero-shot GatherMOS Given an input waveformx∈R T , we extract several acous- tic descriptors that summarize temporal, spectral, and percep- tual information

GA THERMOS 2.1. Zero-shot GatherMOS Given an input waveformx∈R T , we extract several acous- tic descriptors that summarize temporal, spectral, and percep- tual information. The RMS reflects the overall signal energy, while the ZCR indicates the level of noisiness or voicing. Du- ration and clipping detection are included to capture tempo- ral length and ...
[3]

Experimental setup The proposed approaches are evaluated on the V oiceBank- DEMAND dataset [15], which is also included in the test evaluation of the V oiceMOS Challenge 2024 [16]

EXPERIMENTS 3.1. Experimental setup The proposed approaches are evaluated on the V oiceBank- DEMAND dataset [15], which is also included in the test evaluation of the V oiceMOS Challenge 2024 [16]. We select a test set of 200 utterances, consisting of clean speech, noisy speech corrupted by four noise types (car, white, music, and cafeteria) at an SNR of ...

2024
[4]

By leveraging the reasoning capabilities of large language models, GatherMOS integrates these diverse signals to produce more reliable MOS pre- dictions

CONCLUSION In this work, we explored the use of generative AI for non- intrusive speech quality estimation and introduced Gather- MOS, a framework that acts as a meta-evaluator by com- bining acoustic feature representations with pseudo-labels from DNSMOS and VQScore. By leveraging the reasoning capabilities of large language models, GatherMOS integrates ...
[5]

P. C. Loizou,Speech enhancement: theory and practice, CRC press, 2013

2013
[6]

The Hearing-Aid Speech Quality Index (HASQI) Version 2,

J. M. Kates and K. H. Arehart, “The Hearing-Aid Speech Quality Index (HASQI) Version 2,”Journal of the Audio Engineering Society, vol. 62, no. 3, pp. 99– 117, 2014

2014
[7]

Perceptual ob- jective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech qual- ity measurement part i—temporal alignment,

J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann, J. Pomy, and M. Keyhl, “Perceptual ob- jective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech qual- ity measurement part i—temporal alignment,”Journal of The Audio Engineering Society, vol. 61, no. 6, pp. 366–384, 2013

2013
[8]

MOSNet: Deep learning-based objective assessment for voice conver- sion,

C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yam- agishi, Y . Tsao, and H.-M. Wang, “MOSNet: Deep learning-based objective assessment for voice conver- sion,” inProc. Interspeech, 2019, pp. 1541–1545

2019
[9]

Deep learning-based non-intrusive multi- objective speech assessment model with cross-domain features,

R. E. Zezario, S.-W Fu, F. Chen, C.-S Fuh, H.-M. Wan, and Yu Tsao, “Deep learning-based non-intrusive multi- objective speech assessment model with cross-domain features,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54–70, 2023

2023
[10]

Self-supervised speech quality estimation and en- hancement using only clean speech,

S.-W. Fu, K.-H. Hung, Y . Tsao, and Y .-C. F. Wang, “Self-supervised speech quality estimation and en- hancement using only clean speech,” inProc. ICLR, 2024, pp. 1–22

2024
[11]

Generalization ability of MOS prediction networks,

E. Cooper, W.-H. Huang, T. Toda, and J. Yamagishi, “Generalization ability of MOS prediction networks,” in Proc. ICASSP, 2022, pp. 8442–8446

2022
[12]

Enabling auditory large language models for automatic speech quality evaluation,

S. Wang, W. Yu, Y . Yang, C. Tang, Y . Li, J. Zhuang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, and C. Zhang, “Enabling auditory large language models for automatic speech quality evaluation,” inProc. ICASSP, 2025, pp. 1–5

2025
[13]

Audio large language models can be descriptive speech quality evaluators,

C. Chen, Y . Hu, S. Wang, H. Wang, Z. Chen, C. Zhang, C.-H. H. Yang, and E. Chng, “Audio large language models can be descriptive speech quality evaluators,” in Proc. ICLR, 2025, pp. 1–15

2025
[14]

Wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learn- ing of speech representations,” inProc. NIPS, 2020, pp. 12449 – 12460

2020
[15]

Robust speech recogni- tion via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recogni- tion via large-scale weak supervision,” inProc. ICML, 2023, pp. 28492–28518

2023
[16]

Exploring in-context learning capabilities of ChatGPT for patho- logical speech detection,

M. Amiri, H.O. Shahreza, and I. Kodrasi, “Exploring in-context learning capabilities of ChatGPT for patho- logical speech detection,”arXiv 2503.23873, 2025

work page arXiv 2025
[17]

A study on zero-shot non-intrusive speech assessment using large language models,

R. E. Zezario, S. M. Siniscalchi, H.-M. Wang, and Y . Tsao, “A study on zero-shot non-intrusive speech assessment using large language models,” inProc. ICASSP, 2025, pp. 1–5

2025
[18]

DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. A Reddy, V . Gopal, and R. Cutler, “DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP, 2021, pp. 6493–6497

2021
[19]

Investigating RNN-based speech enhancement methods for noise-robust text-to-speech.,

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yam- agishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech.,” inProc. SSW, 2016, pp. 146–152

2016
[20]

The V oicemos challenge 2024: Beyond speech quality pre- diction,

W.-C. Huang, S.-W. Fu, E. Cooper, R. E. Zezario, T. Toda, H.-M. Wang, J. Yamagishi, and Y . Tsao, “The V oicemos challenge 2024: Beyond speech quality pre- diction,” inProc. SLT, 2024, pp. 803–810

2024
[21]

Boosting Self-Supervised Embeddings for Speech Enhancement,

K.-H. Hung, S.-W. Fu, H.-H. Tseng, H.-T. Chiang, Y . Tsao, and C.-W. Lin, “Boosting Self-Supervised Embeddings for Speech Enhancement,” inProc. Inter- speech, 2022, pp. 186–190

2022
[22]

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magni- tude and Phase Spectra,

Y .-X. Lu, Y . Ai, and Z.-H. Ling, “MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magni- tude and Phase Spectra,” inProc. Interspeech, 2023, pp. 3834–3838

2023
[23]

CMGAN: Conformer-based Metric GAN for Speech Enhance- ment,

R. Cao, S. Abdulatif, and B. Yang, “CMGAN: Conformer-based Metric GAN for Speech Enhance- ment,” inProc. Interspeech, 2022, pp. 936–940

2022
[24]

Real Time Speech Enhancement in the Waveform Domain,

A. D ´efossez, G. Synnaeve, and Y . Adi, “Real Time Speech Enhancement in the Waveform Domain,” in Proc. Interspeech, 2020, pp. 3291–3295

2020
[25]

The proof and measurement of associ- ation between two things,

C. Spearman, “The proof and measurement of associ- ation between two things,”The American Journal of Psychology, vol. 15, no. 1, pp. 72–101, 1904

1904
[26]

The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhance- ment,

S. Leglaive, L. Borne, E. Tzinis, M. Sadeghi, M. Frati- celli, S. Wisdom, M. Pariente, D. Pressnitzer, and J. R. Hershey, “The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhance- ment,” inProc. CHIME, 2023, pp. 1–7

2023
[27]

Objective and subjec- tive evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge,

S. Leglaive, M. Fraticelli, H. ElGhazaly, L. Borne, M. Sadeghi, S. Wisdom, M. Pariente, J. R Hershey, D. Pressnitzer, and J. P Barker, “Objective and subjec- tive evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge,”Computer Speech and Language, vol. 89, pp. 101685, 2025

2025