arxiv: 2604.19300 · v1 · submitted 2026-04-21 · 💻 cs.SD · cs.AI

Recognition: unknown

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

Feiyu Zhao , Yiming Chen , Wenhuan Lu , Daipeng Zhang , Xianghu Yue , Jianguo Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:30 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords hallucination detectionaudio-language modelsbenchmarkspeechenvironmental soundmusicmodel evaluationadversarial prompts

0 comments

The pith

A new benchmark called HalluAudio shows that large audio-language models frequently generate responses unsupported by the input audio across speech, environmental sounds, and music.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the first large-scale benchmark to measure hallucinations in audio-language models, where outputs are semantically incorrect or lack acoustic support. It assembles more than five thousand human-verified question-answer pairs that span speech, environmental sound, and music, with tasks ranging from simple yes-no checks to open-ended reasoning. Adversarial prompts and mixed-audio inputs are used to trigger potential errors. This matters because models are advancing on general audio tasks yet their tendency to invent unsupported details has lacked systematic measurement until now. The evaluation tracks not only overall accuracy but also bias patterns, error categories, and refusal rates, exposing clear shortfalls in acoustic grounding and temporal understanding.

Core claim

We introduce HalluAudio, the first large-scale benchmark for evaluating hallucinations across speech, environmental sound, and music, comprising over 5K human-verified QA pairs that cover binary judgments, multi-choice reasoning, attribute verification, and open-ended QA. Adversarial prompts and mixed-audio conditions are designed to induce hallucinations, and the protocol measures hallucination rate, yes/no bias, error-type breakdown, and refusal rate. Benchmarking open-source and proprietary models reveals significant deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding.

What carries the argument

HalluAudio, a collection of human-verified QA pairs together with an evaluation protocol that quantifies hallucination rate, bias, error types, and refusal behavior in large audio-language models.

If this is right

Current large audio-language models exhibit notable shortfalls in grounding their outputs to actual acoustic content.
Temporal reasoning over audio sequences remains a consistent area of weakness across tested models.
Music attribute understanding lags behind performance on speech and environmental sound tasks.
Both open-source and proprietary models require targeted improvements to reduce unsupported generations.
The multi-dimensional protocol enables diagnosis of specific failure modes such as yes/no bias and refusal patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adversarial prompt designs could be adapted as stress tests during the training of future audio models to build greater robustness.
Extending similar benchmarks to combined audio-visual inputs might uncover additional cross-modal hallucination behaviors not visible in audio alone.
Widespread use of this evaluation approach could shift model development priorities toward reliability metrics alongside raw task accuracy.

Load-bearing premise

Human verification of the QA pairs reliably distinguishes responses that are semantically incorrect or acoustically unsupported, and the adversarial prompts plus mixed-audio conditions induce hallucinations without adding unrelated artifacts or biases.

What would settle it

Independent re-annotation of a subset of the QA pairs by new human reviewers that yields substantially different hallucination labels, or a model that scores low hallucination rates on HalluAudio yet produces many unsupported responses on fresh audio examples outside the benchmark.

Figures

Figures reproduced from arXiv: 2604.19300 by Daipeng Zhang, Feiyu Zhao, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Yiming Chen.

**Figure 1.** Figure 1: Overview of the HalluAudio framework. HalluAudio combines controlled multi-domain audio construction, unified prompting, and structured output validation to systematically analyze hallucination in LALMs. 3 HalluAudio Benchmark 3.1 Overview of HalluAudio The overall evaluation pipeline of HalluAudio is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Detailed task composition of the HalluAudio dataset across speech, sound, and music domains. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Yes/No Bias analysis of speech and environmental sound. From left to right: Yes-pred Ratio, Unrelated [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: False Refusal Rate in speech and environmen [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Yes/No Bias Analysis in music domain. From left to right: Yes-pred Ratio, Unrelated Error Ratio, and [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: False Refusal Rate in music domain. Higher [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Semantic hallucination examples in speech recognition tasks. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Incorrect task-grounded responses in a temporal comparison task, illustrating models’ tendencies to [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain. Existing hallucination benchmarks mainly focus on text or vision, while the few audio-oriented studies are limited in scale, modality coverage, and diagnostic depth. We therefore introduce HalluAudio, the first large-scale benchmark for evaluating hallucinations across speech, environmental sound, and music. HalluAudio comprises over 5K human-verified QA pairs and spans diverse task types, including binary judgments, multi-choice reasoning, attribute verification, and open-ended QA. To systematically induce hallucinations, we design adversarial prompts and mixed-audio conditions. Beyond accuracy, our evaluation protocol measures hallucination rate, yes/no bias, error-type analysis, and refusal rate, enabling a fine-grained analysis of LALM failure modes. We benchmark a broad range of open-source and proprietary models, providing the first large-scale comparison across speech, sound, and music. Our results reveal significant deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding, underscoring the need for reliable and robust LALMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HalluAudio gives us the first sizable benchmark for hallucinations in audio-language models, but the human verification step lacks the usual reliability checks.

read the letter

The paper's main contribution is HalluAudio, a new dataset of over 5K human-verified QA pairs that covers speech, environmental sound, and music. It includes several task formats, uses adversarial prompts and mixed-audio setups to provoke errors, and reports multiple metrics such as hallucination rate, yes/no bias, error types, and refusal rate. They evaluate a range of models and point out consistent weaknesses in acoustic grounding, temporal reasoning, and music attributes. That is a clear step beyond the small prior audio hallucination work mentioned in the abstract.

Referee Report

2 major / 2 minor

Summary. The paper introduces HalluAudio as the first large-scale benchmark for hallucination detection in Large Audio-Language Models (LALMs), comprising over 5K human-verified QA pairs spanning speech, environmental sound, and music. It covers diverse tasks (binary judgment, multi-choice, attribute verification, open-ended QA), employs adversarial prompts and mixed-audio conditions to induce hallucinations, and evaluates a range of open-source and proprietary models using metrics including hallucination rate, yes/no bias, error-type analysis, and refusal rate. Results highlight deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding.

Significance. If the human verification process proves reliable, HalluAudio would fill a notable gap by providing the first comprehensive, multi-domain audio hallucination benchmark with scale and diagnostic depth beyond existing text- or vision-focused resources. The inclusion of mixed-audio conditions, fine-grained error analysis, and broad model comparisons could serve as a useful foundation for future LALM robustness research.

major comments (2)

[Benchmark Construction] Benchmark Construction section: The manuscript provides no quantitative validation of the human verification process for the >5K QA pairs (e.g., inter-annotator agreement metrics, error rates on the benchmark itself, or confirmation that annotators had access to raw audio files). This is load-bearing for the central claim, as the benchmark's utility depends on labels correctly identifying semantically incorrect or acoustically unsupported outputs, especially under mixed-audio conditions.
[Abstract and Introduction] Abstract and Introduction: Hallucinations are defined as 'semantically incorrect or acoustically unsupported' responses, but no explicit annotation rubrics, guidelines for distinguishing acoustic vs. semantic errors, or operational details for adversarial/mixed-audio cases are supplied. Without these, downstream metrics (hallucination rates, model comparisons) risk being undermined by label noise or bias.

minor comments (2)

[Evaluation Protocol] Evaluation Protocol subsection: Provide explicit formulas or pseudocode for computing yes/no bias and refusal rate to ensure reproducibility.
[Related Work] Related Work: Add citations to any recent audio-specific hallucination studies not currently referenced to strengthen the positioning of HalluAudio as the first large-scale effort.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of benchmark reliability and clarity. We address each major comment below and will revise the manuscript to incorporate additional details on the human verification process and annotation guidelines.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section: The manuscript provides no quantitative validation of the human verification process for the >5K QA pairs (e.g., inter-annotator agreement metrics, error rates on the benchmark itself, or confirmation that annotators had access to raw audio files). This is load-bearing for the central claim, as the benchmark's utility depends on labels correctly identifying semantically incorrect or acoustically unsupported outputs, especially under mixed-audio conditions.

Authors: We agree that quantitative validation of the human verification process is essential. In the revised manuscript, we will add a dedicated subsection (or appendix) describing the verification procedure in detail. This will include inter-annotator agreement metrics (e.g., Fleiss' kappa), the number and qualifications of annotators, explicit confirmation that all annotators had access to the raw audio files, and any available error rates or post-verification analysis. These additions will directly address concerns about label reliability, especially for mixed-audio conditions. revision: yes
Referee: [Abstract and Introduction] Abstract and Introduction: Hallucinations are defined as 'semantically incorrect or acoustically unsupported' responses, but no explicit annotation rubrics, guidelines for distinguishing acoustic vs. semantic errors, or operational details for adversarial/mixed-audio cases are supplied. Without these, downstream metrics (hallucination rates, model comparisons) risk being undermined by label noise or bias.

Authors: We acknowledge that the current manuscript lacks explicit annotation rubrics and operational details. In the revision, we will expand the Methods section and add an appendix with the full annotation guidelines. This will specify criteria for distinguishing acoustic errors (e.g., misidentification of sound attributes or temporal events) from semantic errors (e.g., factual inaccuracies unsupported by the audio), along with precise procedures for annotating responses under adversarial prompts and mixed-audio conditions. We believe these clarifications will strengthen the validity of the reported metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is new data construction with external verification

full rationale

The paper introduces HalluAudio as a new benchmark comprising over 5K human-verified QA pairs spanning speech, sound, and music, with adversarial prompts and mixed-audio conditions to induce hallucinations. No mathematical derivations, fitted parameters, predictions, or self-citations are present in the abstract or described construction process. The central claim rests on independent data collection and human verification steps rather than reducing to prior inputs by definition or self-reference. Model evaluations are performed separately on this benchmark. This is a standard empirical benchmark paper with no load-bearing steps that collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the creation of a new benchmark whose quality depends on human annotation accuracy and the effectiveness of adversarial conditions rather than on mathematical derivations or prior fitted constants.

axioms (1)

domain assumption Human annotators can reliably identify hallucinations as semantically incorrect or acoustically unsupported responses in audio contexts
The benchmark construction explicitly relies on human verification of the 5K QA pairs.

invented entities (1)

HalluAudio benchmark no independent evidence
purpose: To provide a standardized test for hallucinations in large audio-language models
Newly assembled collection of human-verified QA pairs and evaluation protocol introduced in this work.

pith-pipeline@v0.9.0 · 5532 in / 1363 out tokens · 48737 ms · 2026-05-10T01:30:48.560795+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
eess.AS 2026-04 unverdicted novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...

Reference graph

Works this paper leans on

22 extracted references · 15 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Phi-4-mini tech- nical report: Compact yet powerful multimodal lan- guage models via mixture-of-loras.arXiv preprint arXiv:2503.01743. Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gre- gor Weber

work page internal anchor Pith review arXiv
[2]

InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 12161–12176

Visdiahalbench: A visual dia- logue benchmark for diagnosing hallucination in large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 12161–12176. Jianan Chen, Xiaoxue Gao, Tatsuya Kawahara, and Nancy F Chen. 2026a. Loasr-bench: Evaluating large speech languag...

work page arXiv
[3]

In Findings of the Association for Computational Lin- guistics: EMNLP 2024, pages 10917–10930, Miami, Florida, USA

Beyond single-audio: Advancing multi- audio processing in audio large language models. In Findings of the Association for Computational Lin- guistics: EMNLP 2024, pages 10917–10930, Miami, Florida, USA. Association for Computational Lin- guistics. Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. 2026b. V oicebench: Benchmark...

2024
[4]

Qwen2-Audio Technical Report

Qwen2-audio technical report.arXiv preprint arXiv:2407.10759. Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shil- iang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou

work page internal anchor Pith review arXiv
[5]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Qwen-audio: Advancing universal audio understanding via unified large-scale audio- language models.arXiv preprint arXiv:2311.07919. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others

work page internal anchor Pith review arXiv
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261. Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Kimi-Audio Technical Report

Kimi-audio technical report.arXiv preprint arXiv:2504.18425. Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng

work page internal anchor Pith review arXiv
[8]

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra

Llama-omni2: Llm-based real- time spoken chatbot with autoregressive streaming speech synthesis.arXiv preprint arXiv:2505.02625. Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra

work page arXiv
[9]

Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,

Hallucinations in neural automatic speech recognition: Identify- ing errors and hallucinatory models.arXiv preprint arXiv:2401.01572. Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang- gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Moham- mad Shoeybi, and 1 others

work page arXiv
[10]

Music flamingo: Scaling music understanding in audio language models

Music flamingo: Scaling music understanding in audio language mod- els.arXiv preprint arXiv:2511.10289. Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Ku- mar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and 1 others

work page arXiv
[11]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Audio flamingo 3: Advanc- ing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others

work page internal anchor Pith review arXiv
[12]

The Llama 3 Herd of Models

The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[13]

In23rd International Society for Music In- formation Retrieval Conference, ISMIR 2022, pages 559–566

Mu- lan: A joint embedding of music audio and natural language. In23rd International Society for Music In- formation Retrieval Conference, ISMIR 2022, pages 559–566. International Society for Music Informa- tion Retrieval. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radfo...

2022
[14]

GPT-4o System Card

Gpt-4o system card.arXiv preprint arXiv:2410.21276. Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Chun-Yi Kuan and Hung-yi Lee

Understanding sounds, missing the questions: The challenge of object hallucination in large audio- language models.arXiv preprint arXiv:2406.08402. Chun-Yi Kuan and Hung-yi Lee

work page arXiv
[16]

InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1–5

Can large audio- language models truly hear? tackling hallucinations with multi-task assessment and stepwise audio rea- soning. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1–5. IEEE. Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen

2025
[17]

InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 6449–6464

Halueval: A large- scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 6449–6464. Stephanie Lin, Jacob Hilton, and Owain Evans

2023
[18]

InNeurIPS 2021 Compe- titions and Demonstrations Track, pages 125–145

Hear: Holistic evaluation of audio representations. InNeurIPS 2021 Compe- titions and Demonstrations Track, pages 125–145. PMLR. Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, and 1 others

2021
[19]

Step-audio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

Step-audio 2 technical report.arXiv preprint arXiv:2507.16632. Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shri- vastava, Furong Huang, Jordan Boyd-Graber, and 1 others

work page arXiv
[20]

InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 8395–8419

Autohallusion: Automatic genera- tion of hallucination benchmarks for vision-language models. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 8395–8419. LLM-Core-Team Xiaomi

2024
[21]

Qwen2.5-Omni Technical Report

Qwen2.5-omni techni- cal report.arXiv preprint arXiv:2503.20215. Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and 1 others

work page internal anchor Pith review arXiv
[22]

InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1979–1998

Air- bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1979–1998. A Dataset Details A.1 Annotator Details HalluAudio is constructed through a rigor- ous human-in-the-loop annotation and validation pipeline ...

1979