pith. sign in

arxiv: 2509.22363 · v4 · pith:KOA2A6CInew · submitted 2025-09-26 · 💻 cs.LG · eess.AS

Investigating Faithfulness in Large Audio Language Models

classification 💻 cs.LG eess.AS
keywords audiofaithfulnessmodelslanguagelargereasoningfinallalms
0
0 comments X
read the original abstract

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness\footnote{The benchmarking interface and evaluation results are available at https://poonehmousavi.github.io/faithfulness/. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

    cs.SD 2026-05 unverdicted novelty 5.0

    A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

  2. Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

    cs.SD 2026-06 unverdicted novelty 4.0

    Aligned acoustic concept tokens from eGeMAPS improve UAR in ALM-based SER on FAU-Aibo and IEMOCAP while shuffled or corrupted tokens reduce performance without collapsing predictions, indicating partial anchoring to audio.