Investigating Faithfulness in Large Audio Language Models

Cem Subakan; Lovenya Jain; Mirco Ravanelli; Pooneh Mousavi

arxiv: 2509.22363 · v4 · pith:KOA2A6CInew · submitted 2025-09-26 · 💻 cs.LG · eess.AS

Investigating Faithfulness in Large Audio Language Models

Pooneh Mousavi , Lovenya Jain , Mirco Ravanelli , Cem Subakan This is my paper

classification 💻 cs.LG eess.AS

keywords audiofaithfulnessmodelslanguagelargereasoningfinallalms

0 comments

read the original abstract

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness\footnote{The benchmarking interface and evaluation results are available at https://poonehmousavi.github.io/faithfulness/. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
cs.SD 2026-05 unverdicted novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition
cs.SD 2026-06 unverdicted novelty 4.0

Aligned acoustic concept tokens from eGeMAPS improve UAR in ALM-based SER on FAU-Aibo and IEMOCAP while shuffled or corrupted tokens reduce performance without collapsing predictions, indicating partial anchoring to audio.