From Text Metrics to Model Internals: A Study of Whisper ASR Hallucination Detection
Pith reviewed 2026-06-26 06:38 UTC · model grok-4.3
The pith
Internal decoder probing detects Whisper ASR hallucinations more effectively than text or LLM methods without references.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Probing Whisper's decoder representations, without a ground-truth reference, yields the strongest performance, revealing that hallucination traits are encoded across intermediate decoding layers. A late-fusion meta-classifier combining text and internal-state outputs achieves the best overall detection performance.
What carries the argument
Late-fusion meta-classifier that combines outputs from text-based metrics and internal decoder state probing to classify hallucinations.
If this is right
- Text-based detection degrades significantly without reference transcripts.
- Hallucination traits are encoded in intermediate decoder layers rather than only at the end.
- Internal state probing enables reference-free detection that outperforms both text and LLM methods.
- Combining multiple paradigms via late fusion produces the highest detection performance.
Where Pith is reading between the lines
- Similar internal probing could be applied to other ASR models to check if hallucination encoding is a general property.
- Real-time systems might use these internal signals to flag or correct potential hallucinations during decoding.
- Annotation protocols for hallucinations may need standardization if performance patterns vary across datasets.
Load-bearing premise
The human-annotated real-speech dataset accurately identifies true hallucinations and the performance patterns observed on Whisper large v3 will hold for other models, domains, or annotation protocols.
What would settle it
A replication study on a different ASR model or a new human-annotated dataset where internal probing does not outperform text-based methods or the fusion does not achieve the best results would falsify the central claims.
Figures
read the original abstract
Hallucinations of ASR models - fluent transcriptions with no basis in audio - degrade system performance and pose risks in downstream applications. Robust detection of such errors remains a challenge. This paper studies Whisper large v3 hallucination detection on real-speech human-annotated data across three paradigms: text-based, LLM-based, and internal decoder state probing. Text classifiers utilizing metrics for text evaluation achieve high recall but degrade without reference transcripts. LLM-based detection improves precision with domain-specific prompt conditioning, yet remains less competitive than the lightweight text-based methods. Probing Whisper's decoder representations, without a ground-truth reference, yields the strongest performance, revealing that hallucination traits are encoded across intermediate decoding layers. A late-fusion meta-classifier combining text and internal-state outputs achieves the best overall detection performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines hallucination detection for Whisper large v3 on real-speech human-annotated data across three paradigms: text-based classifiers using evaluation metrics, LLM-based detection with domain-specific prompting, and probing of decoder internal states. It reports that text methods achieve high recall but suffer without references, LLM methods improve precision but lag, decoder probing yields the strongest reference-free performance with traits encoded in intermediate layers, and a late-fusion meta-classifier combining text and internal outputs performs best overall.
Significance. If the quantitative results and label validity hold, the work is significant for ASR reliability: it shows that internal representations encode hallucination signals without needing reference transcripts and that fusion can improve detection. The empirical comparison of paradigms on real data is a useful contribution, though the single-model scope limits broader impact.
major comments (3)
- [§3 (Dataset and Annotation)] The central performance rankings rest on human-annotated hallucination labels, yet the manuscript provides no inter-annotator agreement statistics, annotation guidelines, or error analysis on the real-speech dataset. This is load-bearing because label noise would confound all three paradigms equally and invalidate the claim that decoder probing is strongest.
- [§5 (Experiments and Results)] All reported results and layer-wise findings are restricted to Whisper large v3. The manuscript should test at least one additional Whisper size or non-Whisper ASR architecture to substantiate the generalization implied by the title and abstract.
- [Abstract] The abstract states clear performance orderings but supplies no numerical metrics, baseline values, dataset sizes, or statistical significance tests. Even if the full paper contains these, the absence of any quantitative anchor in the summary prevents assessment of effect sizes or robustness.
minor comments (2)
- [§4 (Methods)] Notation for the three paradigms and the late-fusion meta-classifier should be introduced with explicit equations or pseudocode in the methods section for reproducibility.
- [§5 (Experiments and Results)] Figure captions for layer-wise probing results should include the exact classifier architecture and training details used for each layer.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, proposing revisions where they strengthen the manuscript without misrepresenting our work.
read point-by-point responses
-
Referee: [§3 (Dataset and Annotation)] The central performance rankings rest on human-annotated hallucination labels, yet the manuscript provides no inter-annotator agreement statistics, annotation guidelines, or error analysis on the real-speech dataset. This is load-bearing because label noise would confound all three paradigms equally and invalidate the claim that decoder probing is strongest.
Authors: We agree this is a substantive concern. The revised manuscript will include the annotation guidelines and a detailed description of the annotation process. We will also expand the existing error analysis with additional examples and a discussion of potential label noise effects on the reported rankings. revision: yes
-
Referee: [§5 (Experiments and Results)] All reported results and layer-wise findings are restricted to Whisper large v3. The manuscript should test at least one additional Whisper size or non-Whisper ASR architecture to substantiate the generalization implied by the title and abstract.
Authors: The title and abstract explicitly limit the scope to Whisper large v3; no broader generalization is claimed. We will add an explicit limitations paragraph noting the single-model focus and outlining future work on other architectures. New experiments on additional models cannot be completed for this revision. revision: partial
-
Referee: [Abstract] The abstract states clear performance orderings but supplies no numerical metrics, baseline values, dataset sizes, or statistical significance tests. Even if the full paper contains these, the absence of any quantitative anchor in the summary prevents assessment of effect sizes or robustness.
Authors: We will revise the abstract to include the dataset size, key F1 scores for the best text, LLM, and probing methods, and a note on statistical significance of the main comparisons. revision: yes
- The request to run experiments on at least one additional Whisper size or non-Whisper architecture, as this requires new data, annotation, and compute beyond the current study scope.
Circularity Check
No circularity: empirical comparisons on external annotations with no derivations or self-referential fits.
full rationale
The paper is an empirical study comparing text metrics, LLM prompts, and decoder-state probing for hallucination detection on human-annotated real-speech data. No equations, derivations, parameter fits renamed as predictions, or self-citation chains appear in the provided abstract or described content. All performance claims rest on direct experimental rankings against the same external labels, with no load-bearing step that reduces to a definition or prior self-result by construction. This matches the default non-circular case for data-driven papers without mathematical self-reference.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Hallucinations in Automatic Speech Recognition (ASR) sys- tems present a critical challenge, amplified by the widespread deployment of large-scale models trained on weakly supervised data [1, 2]. As these training datasets often contain weak or machine-generated labels, models can learn incorrect acoustic event matchings, producing fluent tex...
Pith/arXiv arXiv 2026
-
[2]
Experimental Setup Previous investigations into the detection of ASR hallucinations have often relied on non-speech audio predictions [3, 12], syn- thetic noise injections [3, 4] or proxy metrics [7, 8, 19, 20] to judge effectiveness due to a lack of human-verified, real-speech hallucination data. In this study, we utilize the recently intro- duced HALAS ...
-
[3]
Proposed Detection Frameworks 3.1. Analysis based on text metrics First, we investigate the discriminative power of text-based met- rics for hallucination estimation, categorizing them into oracle (reference-dependent) and reference-free methods. 3.1.1. Feature definitions The most commonly utilized metrics require a ground-truth transcript to measure dev...
-
[4]
Finally, we im- plement a Naive CHP detector (NCHP)
confidence scores to measure the temporal consistency be- tween the generated text and the acoustic signal. Finally, we im- plement a Naive CHP detector (NCHP). Unlike its oracle coun- terpart, the naive approach simply checks for the presence of common erroneous phrases in the ASR prediction. Table 1:Hallu. detection ROC AUC for individual text features....
-
[5]
Detector Fusion 4.1. Comparison of different detection paradigms We analyzed detection overlap using pairwise agreement, re- vealing that the paradigms capture non-overlapping signals (agreement: 0.64-0.73). The text classifier maximizes Recall (0.73) at the cost of Precision (0.53), while the LLM is the most conservative, with the highest Precision (0.64...
-
[6]
Conclusions Robust detection of ASR hallucinations remains challenging, particularly in zero-shot deployments where ground-truth refer- ences are unavailable. Our investigation across three paradigms reveals that while text-based classifiers and LLMs achieve strong oracle performance, both suffer performance collapses in strictly reference-free settings. ...
-
[7]
Acknowledgments This research was supported by the National Science Centre, Poland under Grants 2021/42/E/ST7/00452 and 2023/49/B/ST7/04100, and by program ”Excellence initiative – research university” for the AGH University of Krakow. We gratefully acknowledge Polish high-performance computing in- frastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for p...
2021
-
[8]
All experimental de- sign, data processing, statistical analysis, and scientific con- clusions were independently conducted and verified by the au- thors
Generative AI Use Disclosure The authors used large language models (ChatGPT, Gemini, Claude) to assist with language editing. All experimental de- sign, data processing, statistical analysis, and scientific con- clusions were independently conducted and verified by the au- thors. The authors take full responsibility for the content of this manuscript
-
[9]
Robust Speech Recognition via Large-Scale Weak Supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockmanet al., “Robust Speech Recognition via Large-Scale Weak Supervision,” inProceedings of the 40th International Conference on Machine Learning, vol. 202, 2023
2023
-
[10]
Granite-speech: open-source speech-aware LLMs with strong English ASR capa- bilities,
G. Saon, A. Dekel, A. Brooks, T. Naganoet al., “Granite-speech: open-source speech-aware LLMs with strong English ASR capa- bilities,”arXiv preprint arXiv:2505.08699, 2025
arXiv 2025
-
[11]
Investigation of Whisper ASR Hallucinations Induced by Non- Speech Audio,
M. Bara ´nski, J. Jasi ´nski, J. Bartolewska, S. Kacprzaket al., “Investigation of Whisper ASR Hallucinations Induced by Non- Speech Audio,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025
2025
-
[12]
R. Frieske and B. E. Shi, “Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Mod- els,”arXiv preprint arXiv:2401.01572, 2024
arXiv 2024
-
[13]
Careless Whisper: Speech-to-Text Hallucination Harms,
A. Koenecke, A. S. G. Choi, K. X. Mei, H. Schellmann, and M. Sloane, “Careless Whisper: Speech-to-Text Hallucination Harms,” inProceedings of the ACM Conference on Fairness, Ac- countability, and Transparency, 2024
2024
-
[14]
Why Aren’t We NER Yet? Artifacts of ASR Errors in Named Entity Recognition in Spontaneous Speech Transcripts,
P. Szyma ´nski, L. Augustyniak, M. Morzy, A. Szymczaket al., “Why Aren’t We NER Yet? Artifacts of ASR Errors in Named Entity Recognition in Spontaneous Speech Transcripts,” inPro- ceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), 2023
2023
-
[15]
Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction,
Y . Fang, B. Chen, J. Peng, X. Liet al., “Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction,”arXiv preprint arXiv:2505.24347, 2025
arXiv 2025
-
[16]
WhisperX: Time- Accurate Speech Transcription of Long-Form Audio,
M. Bain, J. Huh, T. Han, and A. Zisserman, “WhisperX: Time- Accurate Speech Transcription of Long-Form Audio,”arXiv preprint arXiv:2303.00747, 2023
arXiv 2023
-
[17]
SeMaScore: A new evaluation metric for automatic speech recognition tasks,
Z. Sasindran, H. Yelchuri, and T. V . Prabhakar, “SeMaScore: A new evaluation metric for automatic speech recognition tasks,” in Proceedings of Interspeech, 2024
2024
-
[18]
BERTScore: Evaluating Text Generation with BERT,
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating Text Generation with BERT,”arXiv preprint arXiv:1904.09675, 2020
Pith/arXiv arXiv 1904
-
[19]
Hallucination Benchmark for Speech Foundation Models,
A. Koudounas, M. L. Quatra, M. Giollo, S. M. Siniscalchiet al., “Hallucination Benchmark for Speech Foundation Models,”arXiv preprint arXiv:2510.16567, 2025
arXiv 2025
-
[20]
Calm- Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down,
Y . Wang, A. Alhmoud, S. Alsahly, M. Alqurishiet al., “Calm- Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down,” inProceedings of Interspeech, 2025
2025
-
[21]
What does it take to get state of the art in simultaneous speech-to-speech translation?
V . Wilmet and J. Du, “What does it take to get state of the art in simultaneous speech-to-speech translation?”arXiv preprint arXiv:2409.00965, 2024
arXiv 2024
-
[22]
Language Models are Unsupervised Multitask Learn- ers,
A. Radford, J. Wu, R. Child, D. Luanet al., “Language Models are Unsupervised Multitask Learn- ers,”OpenAI, 2019, accessed: 2024-11-15. [On- line]. Available: https://cdn.openai.com/better-language-models/ language models are unsupervised multitask learners.pdf
2019
-
[23]
S. Pulikodan, S. K, P. K. Ghosh, V . Sankaet al., “An approach to measuring the performance of Automatic Speech Recognition (ASR) models in the context of Large Language Model (LLM) powered applications,”arXiv preprint arXiv:2507.16456, 2025
arXiv 2025
-
[24]
H. Atwany, A. Waheed, R. Singh, M. Choudhuryet al., “Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models,”arXiv preprint arXiv:2502.12414, 2025
arXiv 2025
-
[25]
Be- yond Transcription: Mechanistic Interpretability in ASR,
N. Glazer, Y . Segal-Feldman, H. Segev, A. Shamsianet al., “Be- yond Transcription: Mechanistic Interpretability in ASR,”arXiv preprint arXiv:2508.15882, 2025
arXiv 2025
-
[26]
HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems,
M. Bara ´nski, J. Jasi ´nski, J. Bartolewska, M. Witkowskiet al., “HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems,” inProceedings of Interspeech, 2026
2026
-
[27]
K. Tripathi, A. S. Menon, A. Gaurav, R. P. Gohilet al., “Listen Like a Teacher: Mitigating Whisper Hallucinations using Adap- tive Layer Attention and Knowledge Distillation,”arXiv preprint arXiv:2511.14219, 2025
arXiv 2025
-
[28]
CrisperWhisper: Accu- rate Timestamps on Verbatim Speech Transcriptions,
L. Wagner, B. Thallinger, and M. Zusag, “CrisperWhisper: Accu- rate Timestamps on Verbatim Speech Transcriptions,” inProceed- ings of Interspeech, 2024
2024
-
[29]
Earnings-22: A Practical Benchmark for Accents in the Wild,
M. D. Rio, P. Ha, Q. McNamara, C. Miller, and S. Chandra, “Earnings-22: A Practical Benchmark for Accents in the Wild,” 2022
2022
-
[30]
Binary codes capable of correcting deletions, insertions, and reversals,
V . I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” inSoviet physics doklady, vol. 10, no. 8, 1966
1966
-
[31]
Be- yond English-Centric Multilingual Machine Translation,
A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishkyet al., “Be- yond English-Centric Multilingual Machine Translation,”Journal of Machine Learning Research, vol. 22, no. 1, 2021
2021
-
[32]
Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition,
Q. Li, D. Qiu, Y . Zhang, B. Liet al., “Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
2021
-
[33]
The Regression Analysis of Binary Sequences,
D. R. Cox, “The Regression Analysis of Binary Sequences,”Jour- nal of the Royal Statistical Society: Series B (Methodological), vol. 20, no. 2, 1958
1958
-
[34]
Classification and Regression by ran- domForest,
A. Liaw and M. Wiener, “Classification and Regression by ran- domForest,”R News, vol. 2, no. 3, 2002
2002
-
[35]
XGBoost: A Scalable Tree Boosting System,
T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016
2016
-
[36]
Gene Selection for Cancer Classification Using Support Vector Machines,
I. Guyon, J. Weston, S. Barnhill, and V . Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,”Ma- chine Learning, vol. 46, 2002
2002
-
[37]
Framewise phoneme classifica- tion with bidirectional LSTM and other neural network architec- tures,
A. Graves and J. Schmidhuber, “Framewise phoneme classifica- tion with bidirectional LSTM and other neural network architec- tures,”Neural Networks, vol. 18, no. 5, 2005
2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.