HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
LLM decoders in speech recognition show no racial bias amplification and fewer repetition hallucinations under degradation than Whisper, with audio encoder design mattering more than model scale for fairness and robustness.
Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.
Modern ASR models with noisy training and language models correlate better with human WER for speech enhancement evaluation than simpler models, yet their robustness makes them less suitable for purely acoustic assessments.
citing papers explorer
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
-
Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition
LLM decoders in speech recognition show no racial bias amplification and fewer repetition hallucinations under degradation than Whisper, with audio encoder design mattering more than model scale for fairness and robustness.
-
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps
Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.
-
Too Good to Be True: A Study on Modern Automatic Speech Recognition for the Evaluation of Speech Enhancement
Modern ASR models with noisy training and language models correlate better with human WER for speech enhancement evaluation than simpler models, yet their robustness makes them less suitable for purely acoustic assessments.