Evaluation of Automatic Speech Recognition Using Generative Large Language Models
Pith reviewed 2026-05-09 21:20 UTC · model grok-4.3
The pith
Large language models select the better automatic speech recognition hypothesis with 92-94 percent agreement to humans, compared to 63 percent for word error rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decoder-based large language models achieve 92-94 percent agreement with human annotators when selecting the semantically preferable hypothesis between two ASR candidates on the HATS dataset, exceeding the 63 percent agreement obtained with word error rate and outperforming embedding-based semantic metrics. Embeddings extracted from these same models yield semantic distance measures comparable to those from encoder architectures. The approach also supports qualitative classification of error categories, indicating a path toward evaluation that is both semantic and human-interpretable.
What carries the argument
Pairwise hypothesis selection, in which a prompted decoder-based large language model identifies which of two ASR transcriptions better preserves meaning.
If this is right
- ASR evaluation can prioritize semantic fidelity over exact word matches when selecting or ranking hypotheses.
- Generative embeddings from decoder models become a practical substitute for encoder embeddings in semantic distance calculations.
- LLM prompts enable direct qualitative breakdown of error types in ASR outputs without additional specialized tools.
- Semantic evaluation pipelines can reduce dependence on word error rate alone for system comparisons.
Where Pith is reading between the lines
- The same selection method could be tested on other sequence tasks such as machine translation or summarization where surface metrics also miss meaning.
- Production ASR systems might incorporate LLM-based scoring to flag outputs that humans would judge as semantically flawed even when word error rate is low.
- Scaling the approach to rank more than two hypotheses at once would require checking whether agreement with humans remains high.
Load-bearing premise
Agreement with human annotators on which of two ASR hypotheses is semantically superior serves as a sufficient and unbiased proxy for overall correctness across domains.
What would settle it
A controlled test on a fresh ASR dataset with independent human ratings against known ground-truth transcriptions, where LLMs frequently select the wrong hypothesis while word error rate aligns more closely with the actual errors.
Figures
read the original abstract
Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates decoder-based LLMs for semantic ASR evaluation via three methods: pairwise hypothesis selection, generative embedding distances, and qualitative error classification. On the HATS dataset, LLMs achieve 92-94% agreement with human annotators on hypothesis selection (vs. 63% for WER and lower for other semantic metrics), with decoder embeddings performing comparably to encoder models, positioning LLMs as a promising interpretable alternative to WER.
Significance. If the empirical results hold under proper controls, the work could meaningfully advance ASR evaluation by demonstrating that LLMs capture semantic fidelity better than surface metrics in at least one setting, with potential for more interpretable error analysis. The absence of cross-domain replication and downstream validation, however, confines the immediate significance to a narrow empirical observation rather than a general methodological advance.
major comments (3)
- [Abstract / Results] Abstract and results section: The headline 92-94% human agreement for hypothesis selection is reported only on the HATS dataset with no accompanying inter-annotator agreement statistics, model-size details, prompting specifications, or statistical significance tests; this directly undermines the claim that LLMs supply a robust semantic proxy.
- [Discussion / Conclusion] Discussion / Conclusion: The inference that LLM-based selection is a superior semantic proxy rests on HATS-specific pairwise judgments without cross-domain replication, without testing whether LLM-chosen hypotheses improve downstream semantic metrics in a real ASR pipeline, and without evidence that human pairwise labels on HATS are unbiased across error distributions or domains.
- [Methods] Methods: The three evaluation approaches are described at a high level but lack concrete implementation details (e.g., exact prompting templates for selection, how generative embeddings are extracted and normalized, or the error taxonomy used for qualitative classification), preventing assessment of reproducibility.
minor comments (2)
- [Abstract] The abstract states that decoder embeddings are 'comparable' to encoder models but supplies no numerical values or tables for direct comparison.
- [Introduction / Methods] Notation for the three approaches could be introduced with explicit labels or equations to improve clarity when referring back to them in the results.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and results section: The headline 92-94% human agreement for hypothesis selection is reported only on the HATS dataset with no accompanying inter-annotator agreement statistics, model-size details, prompting specifications, or statistical significance tests; this directly undermines the claim that LLMs supply a robust semantic proxy.
Authors: We agree that the abstract and results would be strengthened by these details. In the revised manuscript we have added inter-annotator agreement statistics for the HATS annotations, specified the model sizes and variants evaluated, included the exact prompting templates in a new appendix, and reported statistical significance tests (McNemar) against WER and other baselines. These additions support the robustness of the reported agreement rates. revision: yes
-
Referee: [Discussion / Conclusion] Discussion / Conclusion: The inference that LLM-based selection is a superior semantic proxy rests on HATS-specific pairwise judgments without cross-domain replication, without testing whether LLM-chosen hypotheses improve downstream semantic metrics in a real ASR pipeline, and without evidence that human pairwise labels on HATS are unbiased across error distributions or domains.
Authors: We concur that the evaluation is limited to HATS and lacks cross-domain replication or downstream pipeline validation. The revised discussion now explicitly acknowledges these limitations, discusses possible biases in the HATS human labels, and outlines future work. However, new cross-domain experiments and downstream evaluations cannot be performed within the scope of this revision. revision: partial
-
Referee: [Methods] Methods: The three evaluation approaches are described at a high level but lack concrete implementation details (e.g., exact prompting templates for selection, how generative embeddings are extracted and normalized, or the error taxonomy used for qualitative classification), preventing assessment of reproducibility.
Authors: We have revised the methods section to supply the missing details: exact prompting templates appear in Appendix A, the generative embedding extraction and normalization procedure (including layer selection and cosine similarity) is now fully specified, and the error taxonomy with definitions and examples is provided in Section 3.3. These changes improve reproducibility. revision: yes
- Cross-domain replication and downstream ASR pipeline validation, which were not performed in the original study
Circularity Check
No circularity: empirical comparison to external human labels
full rationale
The paper reports experimental results on LLM agreement with human annotators for ASR hypothesis selection on the HATS dataset, directly comparing percentages (92-94% vs 63% WER) without any equations, derivations, fitted parameters renamed as predictions, or self-referential definitions. All load-bearing claims rest on external human judgments and standard metrics, with no reduction of outputs to inputs by construction. This is a standard empirical evaluation and self-contained against the provided benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotators provide reliable ground-truth judgments of semantic correctness for ASR hypotheses
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.