Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Driss Khalil; Jane Wottawa; Mickael Rouvier; Petr Motlicek; Richard Dufour; Sergio Burdisso; Shashi Kumar; Shiran Liu; Thibault Ba\~neras-Roux

arxiv: 2604.21928 · v3 · pith:7MHLXIJNnew · submitted 2026-04-23 · 💻 cs.CL

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Thibault Ba\~neras-Roux , Shashi Kumar , Driss Khalil , Sergio Burdisso , Petr Motlicek , Shiran Liu , Mickael Rouvier , Jane Wottawa

show 1 more author

Richard Dufour

This is my paper

Pith reviewed 2026-05-09 21:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic speech recognitionlarge language modelsevaluation metricssemantic evaluationhypothesis selectionword error rategenerative embeddings

0 comments

The pith

Large language models select the better automatic speech recognition hypothesis with 92-94 percent agreement to humans, compared to 63 percent for word error rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether decoder-based large language models can assess automatic speech recognition outputs according to meaning instead of surface word matches. It applies three methods on the HATS dataset: choosing the semantically superior hypothesis from a pair, deriving semantic distance from generative embeddings, and classifying error types. The key finding is that the strongest models match human choices 92 to 94 percent of the time, well above word error rate and other semantic baselines. This addresses the limitation that current metrics often reward or penalize transcriptions without regard to whether they convey the intended meaning. The authors further show that embeddings from these generative models perform at levels comparable to encoder-based alternatives and position LLMs as a route to more interpretable evaluation.

Core claim

Decoder-based large language models achieve 92-94 percent agreement with human annotators when selecting the semantically preferable hypothesis between two ASR candidates on the HATS dataset, exceeding the 63 percent agreement obtained with word error rate and outperforming embedding-based semantic metrics. Embeddings extracted from these same models yield semantic distance measures comparable to those from encoder architectures. The approach also supports qualitative classification of error categories, indicating a path toward evaluation that is both semantic and human-interpretable.

What carries the argument

Pairwise hypothesis selection, in which a prompted decoder-based large language model identifies which of two ASR transcriptions better preserves meaning.

If this is right

ASR evaluation can prioritize semantic fidelity over exact word matches when selecting or ranking hypotheses.
Generative embeddings from decoder models become a practical substitute for encoder embeddings in semantic distance calculations.
LLM prompts enable direct qualitative breakdown of error types in ASR outputs without additional specialized tools.
Semantic evaluation pipelines can reduce dependence on word error rate alone for system comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection method could be tested on other sequence tasks such as machine translation or summarization where surface metrics also miss meaning.
Production ASR systems might incorporate LLM-based scoring to flag outputs that humans would judge as semantically flawed even when word error rate is low.
Scaling the approach to rank more than two hypotheses at once would require checking whether agreement with humans remains high.

Load-bearing premise

Agreement with human annotators on which of two ASR hypotheses is semantically superior serves as a sufficient and unbiased proxy for overall correctness across domains.

What would settle it

A controlled test on a fresh ASR dataset with independent human ratings against known ground-truth transcriptions, where LLMs frequently select the wrong hypothesis while word error rate aligns more closely with the actual errors.

Figures

Figures reproduced from arXiv: 2604.21928 by Driss Khalil, Jane Wottawa, Mickael Rouvier, Petr Motlicek, Richard Dufour, Sergio Burdisso, Shashi Kumar, Shiran Liu, Thibault Ba\~neras-Roux.

read the original abstract

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs beat WER at picking the better hypothesis on the HATS dataset but the work stays narrow with thin implementation details.

read the letter

The main result is that decoder-only LLMs reach 92-94% agreement with human annotators when choosing the semantically better ASR hypothesis on the HATS set, while WER only reaches 63%. The paper also tests generative embeddings from the same models and runs a qualitative error classification step. That direct comparison on human-labeled data is what stands out as new. It extends earlier embedding-based semantic metrics by bringing in decoder-only generative models and checking them against both WER and encoder embeddings on the same set. The three-way breakdown of approaches is laid out clearly and the numbers are reported without much fluff. The paper does a decent job of showing that the LLM route can outperform surface metrics on this particular task and data. The soft spots are straightforward. Everything rests on a single dataset with no cross-domain checks or tests in an actual decoding pipeline. The abstract gives no model sizes, no prompting details, no inter-annotator agreement figures for the human labels, and no statistical significance tests on the gaps. Without those, the 92-94% figure is hard to treat as robust. The central assumption that pairwise human agreement on HATS is an unbiased proxy for semantic correctness across domains is not tested here, so the broader claim about semantic evaluation stays provisional. This paper is for ASR and speech researchers who already follow LLM applications and want concrete numbers on hypothesis selection. A reader looking for quick ideas on moving past WER could pull useful angles from the three approaches. It deserves peer review so the authors can supply the missing controls and perhaps add replication data.

Referee Report

3 major / 2 minor

Summary. The paper evaluates decoder-based LLMs for semantic ASR evaluation via three methods: pairwise hypothesis selection, generative embedding distances, and qualitative error classification. On the HATS dataset, LLMs achieve 92-94% agreement with human annotators on hypothesis selection (vs. 63% for WER and lower for other semantic metrics), with decoder embeddings performing comparably to encoder models, positioning LLMs as a promising interpretable alternative to WER.

Significance. If the empirical results hold under proper controls, the work could meaningfully advance ASR evaluation by demonstrating that LLMs capture semantic fidelity better than surface metrics in at least one setting, with potential for more interpretable error analysis. The absence of cross-domain replication and downstream validation, however, confines the immediate significance to a narrow empirical observation rather than a general methodological advance.

major comments (3)

[Abstract / Results] Abstract and results section: The headline 92-94% human agreement for hypothesis selection is reported only on the HATS dataset with no accompanying inter-annotator agreement statistics, model-size details, prompting specifications, or statistical significance tests; this directly undermines the claim that LLMs supply a robust semantic proxy.
[Discussion / Conclusion] Discussion / Conclusion: The inference that LLM-based selection is a superior semantic proxy rests on HATS-specific pairwise judgments without cross-domain replication, without testing whether LLM-chosen hypotheses improve downstream semantic metrics in a real ASR pipeline, and without evidence that human pairwise labels on HATS are unbiased across error distributions or domains.
[Methods] Methods: The three evaluation approaches are described at a high level but lack concrete implementation details (e.g., exact prompting templates for selection, how generative embeddings are extracted and normalized, or the error taxonomy used for qualitative classification), preventing assessment of reproducibility.

minor comments (2)

[Abstract] The abstract states that decoder embeddings are 'comparable' to encoder models but supplies no numerical values or tables for direct comparison.
[Introduction / Methods] Notation for the three approaches could be introduced with explicit labels or equations to improve clarity when referring back to them in the results.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating where revisions have been made to the manuscript.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results section: The headline 92-94% human agreement for hypothesis selection is reported only on the HATS dataset with no accompanying inter-annotator agreement statistics, model-size details, prompting specifications, or statistical significance tests; this directly undermines the claim that LLMs supply a robust semantic proxy.

Authors: We agree that the abstract and results would be strengthened by these details. In the revised manuscript we have added inter-annotator agreement statistics for the HATS annotations, specified the model sizes and variants evaluated, included the exact prompting templates in a new appendix, and reported statistical significance tests (McNemar) against WER and other baselines. These additions support the robustness of the reported agreement rates. revision: yes
Referee: [Discussion / Conclusion] Discussion / Conclusion: The inference that LLM-based selection is a superior semantic proxy rests on HATS-specific pairwise judgments without cross-domain replication, without testing whether LLM-chosen hypotheses improve downstream semantic metrics in a real ASR pipeline, and without evidence that human pairwise labels on HATS are unbiased across error distributions or domains.

Authors: We concur that the evaluation is limited to HATS and lacks cross-domain replication or downstream pipeline validation. The revised discussion now explicitly acknowledges these limitations, discusses possible biases in the HATS human labels, and outlines future work. However, new cross-domain experiments and downstream evaluations cannot be performed within the scope of this revision. revision: partial
Referee: [Methods] Methods: The three evaluation approaches are described at a high level but lack concrete implementation details (e.g., exact prompting templates for selection, how generative embeddings are extracted and normalized, or the error taxonomy used for qualitative classification), preventing assessment of reproducibility.

Authors: We have revised the methods section to supply the missing details: exact prompting templates appear in Appendix A, the generative embedding extraction and normalization procedure (including layer selection and cosine similarity) is now fully specified, and the error taxonomy with definitions and examples is provided in Section 3.3. These changes improve reproducibility. revision: yes

standing simulated objections not resolved

Cross-domain replication and downstream ASR pipeline validation, which were not performed in the original study

Circularity Check

0 steps flagged

No circularity: empirical comparison to external human labels

full rationale

The paper reports experimental results on LLM agreement with human annotators for ASR hypothesis selection on the HATS dataset, directly comparing percentages (92-94% vs 63% WER) without any equations, derivations, fitted parameters renamed as predictions, or self-referential definitions. All load-bearing claims rest on external human judgments and standard metrics, with no reduction of outputs to inputs by construction. This is a standard empirical evaluation and self-contained against the provided benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on human annotations as the reference standard and on the HATS dataset being representative of ASR errors; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Human annotators provide reliable ground-truth judgments of semantic correctness for ASR hypotheses
All reported agreement percentages are measured against these annotations

pith-pipeline@v0.9.0 · 5467 in / 1175 out tokens · 24475 ms · 2026-05-09T21:20:34.343252+00:00 · methodology

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)