arxiv: 2604.02605 · v1 · submitted 2026-04-03 · 💻 cs.AI · cs.SD

Recognition: 2 theorem links

· Lean Theorem

Do Audio-Visual Large Language Models Really See and Hear?

Ramaneswaran Selvakumar , Kaousheik Jayakumar , S Sakshi , Sreyan Ghosh , Ruohan Gao , Dinesh Manocha

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:53 UTC · model grok-4.3

classification 💻 cs.AI cs.SD

keywords audio-visual large language modelsmodality biasmechanistic interpretabilitymultimodal fusionvision dominancelayer-wise probingaudio semantics suppression

0 comments

The pith

AVLLMs encode audio semantics in intermediate layers but vision suppresses them in final text outputs on conflicts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts the first mechanistic study of how audio-visual large language models process and combine sight and sound information across their layers. It establishes that rich audio understanding is captured in the middle layers of these models. However, when audio and vision disagree, the final generated text follows the visual input instead of using the audio information. The study traces this to deeper layers that prioritize visual features and to training that does not add much audio-specific alignment beyond the original vision-language model. A sympathetic reader would care because these models are promoted as unified interfaces for multimodal perception, yet they may not truly integrate both modalities equally in practice.

Core claim

Although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. The AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision.

What carries the argument

Layer-wise probing of evolving audio and visual features combined with tests where audio and vision inputs conflict to observe which modality influences the final output.

If this is right

AVLLMs will produce vision-dominated responses in scenarios with conflicting audio-visual information.
Latent audio representations exist but are not utilized in the model's text generation process.
Current training approaches for AVLLMs provide insufficient audio alignment relative to their vision-language foundations.
Modality fusion occurs unevenly, with visual information overriding audio in deeper layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers might need to introduce explicit balancing mechanisms during training to ensure audio contributes equally.
Similar probing techniques could reveal biases in other multimodal models handling text, images, or video.
This finding points to potential improvements by intervening at the fusion layers to amplify suppressed audio signals.

Load-bearing premise

The specific layer-wise probing and audio-visual conflict tests used are sufficient to reveal the general fusion mechanism in AVLLMs rather than being limited to the models and cases tested.

What would settle it

If an AVLLM is retrained with additional audio-focused supervision and the conflict tests then show final outputs incorporating audio semantics instead of being suppressed by vision.

Figures

Figures reproduced from arXiv: 2604.02605 by Dinesh Manocha, Kaousheik Jayakumar, Ramaneswaran Selvakumar, Ruohan Gao, Sreyan Ghosh, S Sakshi.

**Figure 1.** Figure 1: Illustration of visual bias. AVLLMs exhibit a critical modality bias, often prioritizing visual cues over vital audio cues. The diagram illustrates a counterfactual scene, visible objects (a blue car and a woman walking a dog) are silent and the only audible sound is an out-of-view ambulance siren. When prompted to describe the scene, the AVLLM hallucinates audio events (car engine, dog barking) and misse… view at source ↗

**Figure 2.** Figure 2: Audio Understanding Performance. Audio understanding severely degrades under audio-visual conflict. To address these issues, we systematically analyze why AVLLMs fail to utilize audio inputs effectively. Most AVLLMs adopt an adapter-based architecture [9], where pretrained audio and vision encoders feed representations through learned adapters into the LLM token spaceextending designs such as Llava [33]… view at source ↗

**Figure 4.** Figure 4: Mean attention from generated to input tokens. Generated tokens allocate high attention to audio in early layers (40- 50% in layers 0-5), which drops to near-zero afterward. Video attention steadily increases through deeper layers, reaching 20-40% in layers 15-30. and visual captions. Traditional metrics (BLEU [48], METEOR [6], CIDEr [6], ROUGE [32]) fail to capture semantic variability, while embeddin… view at source ↗

**Figure 5.** Figure 5: Probing Audio Representations. We decode intermediate layer audio representations using the base LLM’s unembedding matrix and observe that they decode into meaningful concepts describing sound events and their sources, in multiple languages (e.g., 键盘/keyboard, typing). ing audio-visual captioning. We prompt models to ”describe what you see and hear” and track the average attention allocated by generate… view at source ↗

**Figure 6.** Figure 6: Blocking audio-visual information flow to generated tokens We block attention from generated tokens to vision (G↛V) or audio (G↛A) and measure impact on caption fidelity. 1) G↛A for factual samples does not degrade video understanding⃝A as expected (ideally audio cues should not be used for video understanding) 2) surprisingly audio understanding⃝B does not degrade either, (AVLLM instead correlates audio c… view at source ↗

**Figure 7.** Figure 7: Illustration of AVLLM captions and comparison with LVLMs. We compare token distributions between Qwen2.5Omni and Qwen2.5VL, and measure shift (η) in token distribution unshifted tokens (η = 1), marginal tokens (1 < η ≤ 3), and shifted tokens (η > 3) (Refer Sec 8). Left: The model strongly attends to visual cues and predicts “helicopter” sound event confidently asserting it’s ”clear and distinct.”. Notably,… view at source ↗

**Figure 8.** Figure 8: Factual Sample A video where the visual content and audio content are highly correlated. The audio events in a factual sample can potentially be inferred by using visual cues [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Counterfactual Sample A video where the visual content and audio content are in conflict. The audio events cannot be inferred from visual cues To obtain visual descriptions, we generate captions using GPT-4.1 [46]. We manually review and correct these generated captions where necessary. However, we find that videos in AudioCaps tend to feature relatively simple visual scenes, and GPT-4.1’s outputs are g… view at source ↗

**Figure 10.** Figure 10: World Sense Task An example task from World Sense. These tasks couple perception and reasoning. with minimal intervention required [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: AVHBench Task An example task from AVHBench. These tasks evaluate perception capabilities. ple in [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of Information Flow Blocking. Blocking audio-visual information flow, for both factual and counterfactual samples in VideoLlama 2.1 (top) versus MiniCPM-o2.6 (bottom) [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: The prompt used to measure audio caption fidelity. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: The prompt used to measure video caption fidelity. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative examples. We illustrate some outputs generated by VideoLlama along with its LLM-judge score and reasoning, for factual and counterfactal samples and under attention knockout. The examples in the first row capture audio fidelity and the ones below capture video fidelity [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative examples. We illustrate some outputs generated by MiniCPM-2.6-o along with its LLM-judge score and reasoning, for factual and counterfactal samples and under attention knockout. The examples in the first row capture audio fidelity and the ones below capture video fidelity [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

read the original abstract

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVLLMs encode audio semantics mid-model but suppress them in final outputs when vision conflicts, with the bias inherited from vision-language pretraining rather than fixed in multimodal training.

read the letter

The main thing to know is that this paper reports audio features get encoded in intermediate layers of AVLLMs but then get overridden by vision when the two inputs disagree, and the authors link that override to the models' vision-language base rather than to insufficient audio fine-tuning. They position the work as the first mechanistic interpretability study of these models, using layer-wise probes and conflict tests to show the imbalance. That tracing to the base model is the clearest new piece: it suggests the audio alignment step did not override the earlier vision-dominant behavior. The probing results themselves look like a useful empirical observation for anyone trying to diagnose why current AVLLMs can be unreliable in real scenes where sound and sight clash. The comparison to the base model gives a concrete handle on where the problem originates. The soft spots are mostly about verification. The abstract describes the probes and the training-match analysis, but the exact construction of conflict cases and the quantitative thresholds for calling something suppressed are not spelled out here, so it is still possible the pattern is tied to the particular checkpoints or input formatting rather than a general fusion rule. The stress-test concern about probe artifacts is reasonable to check against the full methods. This paper is aimed at people working on multimodal interpretability and alignment. Readers who already care about modality bias will find the layer-wise evidence and base-model comparison worth looking at, even if they want to rerun the tests themselves. It shows straightforward engagement with the existing literature on vision dominance and does not rely on circular definitions or unfalsifiable claims. I would send it to peer review rather than desk-reject; the core observation is concrete enough that referees can ask for the missing details and test generality without starting from scratch.

Referee Report

3 major / 2 minor

Summary. The paper presents the first mechanistic interpretability study of Audio-Visual Large Language Models (AVLLMs). Through layer-wise probing and audio-visual conflict tests, it claims that rich audio semantics are encoded at intermediate layers but largely fail to surface in final text outputs when audio conflicts with vision; this is traced to vision-dominant fusion layers and limited additional audio alignment during training, as the AVLLM's behavior closely matches its vision-language base model.

Significance. If the central claims hold after methodological clarification, the work would be significant as the first detailed mechanistic analysis of audio-visual fusion in LLMs. It identifies a modality bias with potential implications for training more balanced multimodal models and supplies concrete evidence via probing and training-match comparisons that could guide future alignment techniques.

major comments (3)

[Conflict Tests] The description of conflict-test construction (how audio-visual pairs are selected, formatted, and balanced) is insufficient to rule out test-specific artifacts favoring vision; without this, the claim that audio is actively suppressed by fusion layers rather than by input design remains hard to verify.
[Probing Analyses] Layer-wise probing results lack details on probe training (architecture, data, regularization) and quantitative metrics for 'rich audio semantics' and 'suppression'; this makes it difficult to assess whether the intermediate-layer encoding is robust or probe-dependent.
[Training Comparison] The training-match analysis to the vision-language base model is presented without explicit similarity metrics or controls for shared pre-training data; this weakens the causal link to 'limited additional alignment to audio supervision'.

minor comments (2)

[Figures] Figure captions and axis labels for layer-wise activation plots should explicitly state the probe type and normalization used.
[Abstract] The term 'parameter-free' is used in the abstract for certain ratios; clarify whether any hyperparameters were tuned in the probing setup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify key aspects of our methodology. We address each major point below and will revise the manuscript to incorporate additional details and controls as outlined.

read point-by-point responses

Referee: [Conflict Tests] The description of conflict-test construction (how audio-visual pairs are selected, formatted, and balanced) is insufficient to rule out test-specific artifacts favoring vision; without this, the claim that audio is actively suppressed by fusion layers rather than by input design remains hard to verify.

Authors: We agree that the original description of conflict-test construction was insufficiently detailed. In the revised manuscript, we will expand Section 3.2 to explicitly describe the pair selection process (drawing from balanced subsets of AudioSet and Visual Genome with explicit cross-modal conflict criteria), formatting steps (including prompt templates and tokenization), and balancing procedures (ensuring equal numbers of vision-dominant, audio-dominant, and neutral pairs with statistical verification of label distributions). We will also add controls such as vision-only and audio-only baselines to demonstrate that observed suppression is not an artifact of input design but arises from the fusion layers. revision: yes
Referee: [Probing Analyses] Layer-wise probing results lack details on probe training (architecture, data, regularization) and quantitative metrics for 'rich audio semantics' and 'suppression'; this makes it difficult to assess whether the intermediate-layer encoding is robust or probe-dependent.

Authors: We acknowledge this gap in methodological transparency. The revised version will include a new appendix detailing the probe architecture (linear classifiers and 2-layer MLPs), training data (held-out audio classification subsets with 10k examples), regularization (L2 with coefficient 0.01 and early stopping), and quantitative metrics (layer-wise accuracy, precision-recall AUC, and a suppression index defined as the drop in audio probe performance from middle to final layers). These additions will allow direct assessment of robustness independent of specific probe choices. revision: yes
Referee: [Training Comparison] The training-match analysis to the vision-language base model is presented without explicit similarity metrics or controls for shared pre-training data; this weakens the causal link to 'limited additional alignment to audio supervision'.

Authors: We will strengthen the training comparison analysis by adding explicit similarity metrics, including cosine similarity between corresponding layer activations on audio-only inputs and KL divergence between output distributions on conflict tasks. We will also include a discussion of controls for shared pre-training data (noting the base model's vision-language corpus and the limited audio fine-tuning steps in the AVLLM), along with ablation results comparing behavior before and after audio alignment phases. This will provide stronger evidence for the limited additional audio supervision claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probing and comparison study

full rationale

The paper conducts a mechanistic interpretability analysis using layer-wise probing, conflict tests, and direct comparisons between the AVLLM and its vision-language base model. No mathematical derivations, equations, or self-referential definitions appear in the claims; results are presented as empirical observations from model activations and outputs. These are independently verifiable by replicating the probes on the same or similar checkpoints, with no reduction of predictions to fitted inputs or load-bearing self-citations. The central finding on modality bias follows directly from the observed layer behaviors without circular construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that standard probing techniques faithfully extract latent audio semantics and that the tested conflict cases are representative of general AVLLM behavior.

axioms (1)

domain assumption Probing classifiers can accurately recover latent audio information from intermediate layers without introducing artifacts
Invoked when claiming rich audio semantics are present but suppressed.

pith-pipeline@v0.9.0 · 5466 in / 1182 out tokens · 33449 ms · 2026-05-13T20:53:25.546330+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues... attention knockout experiments... blocking visual pathways recovers audio understanding
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We observe that these decoded tokens meaningfully capture sound sources... meaningful audio information emerging in the last 5 layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Accessed: 2025-11-12

Internomni: Extending internvl with audio modality, 2024. Accessed: 2025-11-12. 2, 14

work page 2024
[2]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. 3

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Rana AlShaikh, Norah Al-Malki, and Maida Almasre. The implementation of the cognitive theory of multimedia learn- ing in the design and evaluation of an ai educational video assistant utilising large language models.Heliyon, 10(3): e25361, 2024. 1

work page 2024
[4]

Is your large language model knowledgeable or a choices-only cheater?ArXiv, abs/2407.01992, 2024

Nishant Balepur and Rachel Rudinger. Is your large language model knowledgeable or a choices-only cheater?ArXiv, abs/2407.01992, 2024. 3

work page arXiv 2024
[5]

Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. Artifacts or abduction: How do llms answer multiple-choice questions without the question? InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10308–10330, 2024. 3

work page 2024
[6]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 4

work page 2005
[7]

Lvlm-intrepret: An interpretability tool for large vision-language models

Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, and Vasudev Lal. Lvlm-intrepret: An interpretability tool for large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8182–8187, ...

work page 2024
[8]

Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Ma˜nas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, M...

work page 2024
[9]

The revo- lution of multimodal large language models: A survey

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. The revo- lution of multimodal large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13590–13618, Bangkok, Thailand, 2024. Association for Comput...

work page 2024
[10]

Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023

David Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023. 4

work page arXiv 2023
[11]

Emova: Empowering language models to see, hear and speak with vivid emotions

Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, et al. Emova: Empowering language models to see, hear and speak with vivid emotions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5455–5466, 2025. 1

work page 2025
[12]

Om- nixr: Evaluating omni-modality language models on reason- ing across modalities

Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, Y ANDONG LI, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, and Boqing Gong. Om- nixr: Evaluating omni-modality language models on reason- ing across modalities. InThe Thirteenth International Con- ference on Learning Representations, 2025. 1

work page 2025
[13]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2, 15

work page 2024
[14]

Avtrustbench: Assessing and enhancing relia- bility and robustness in audio-visual llms.arXiv preprint arXiv:2501.02135, 2025

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaot- ing Wang, Mohamed Elhoseiny, Ruohan Gao, and Dinesh Manocha. Avtrustbench: Assessing and enhancing relia- bility and robustness in audio-visual llms.arXiv preprint arXiv:2501.02135, 2025. 4

work page arXiv 2025
[15]

Clotho: An audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virta- nen. Clotho: An audio captioning dataset. InICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE,

work page 2020
[16]

Transcoders find interpretable llm feature circuits

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits. InAd- vances in Neural Information Processing Systems, pages 24375–24410. Curran Associates, Inc., 2024. 1

work page 2024
[17]

A mathemati- cal framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathemati- cal framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021. 2, 5

work page 2021
[18]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021. 2

work page 2021
[19]

2023 , archivePrefix=

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023. 1

work page arXiv 2023
[20]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 confer- ence on empirical methods in natural language processing, pages 7514–7528, 2021. 4

work page 2021
[21]

Musashi Hinck, Carolin Holtermann, Matthew Lyle Olson, Florian Schneider, Sungduk Yu, Anahita Bhiwandiwalla, Anne Lauscher, Shao-Yen Tseng, and Vasudev Lal. Why do LLaV A vision-language models reply to images in English? InFindings of the Association for Computational Linguis- tics: EMNLP 2024, pages 13402–13421, Miami, Florida, USA, 2024. Association fo...

work page 2024
[22]

Omnivla: An omni-modal vision- language-action model for robot navigation.arXiv preprint arXiv:2509.19480, 2025

Noriaki Hirose, Catherine Glossop, Dhruv Shah, and Sergey Levine. Omnivla: An omni-modal vision- language-action model for robot navigation.arXiv preprint arXiv:2509.19480, 2025. 1

work page arXiv 2025
[23]

Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025. 13

work page arXiv 2025
[24]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024. 2, 4, 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Mmevalpro: Calibrating multimodal benchmarks towards trustworthy and efficient evaluation.ArXiv, abs/2407.00468, 2024

Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, and Ming Zhang. Mmevalpro: Calibrating multimodal benchmarks towards trustworthy and efficient evaluation.ArXiv, abs/2407.00468, 2024. 3

work page arXiv 2024
[26]

Vlind- bench: Measuring language priors in large vision-language models

Kang il Lee, Minbeom Kim, Seunghyun Yoon, Minsu Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind- bench: Measuring language priors in large vision-language models. InNorth American Chapter of the Association for Computational Linguistics, 2024. 4

work page 2024
[27]

What’s in the im- age? a deep-dive into the vision of vision language models

Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the im- age? a deep-dive into the vision of vision language models. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 14549–14558, 2025. 1, 2, 4

work page 2025
[28]

Text encoders bottleneck compositionality in contrastive vision- language models

Amita Kamath, Jack Hessel, and Kai-Wei Chang. Text encoders bottleneck compositionality in contrastive vision- language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4933–4944, 2023. 4

work page 2023
[29]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InNAACL-HLT, 2019. 4, 12

work page 2019
[30]

Prob- ing classifiers are unreliable for concept removal and detec- tion.Advances in Neural Information Processing Systems, 35:17994–18008, 2022

Abhinav Kumar, Chenhao Tan, and Amit Sharma. Prob- ing classifiers are unreliable for concept removal and detec- tion.Advances in Neural Information Processing Systems, 35:17994–18008, 2022. 3

work page 2022
[31]

The unlocking spell on base llms: Re- thinking alignment via in-context learning

Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavat- ula, and Yejin Choi. The unlocking spell on base llms: Re- thinking alignment via in-context learning. InThe Twelfth International Conference on Learning Representations. 7

work page
[32]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 4

work page 2004
[33]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

work page 2023
[34]

Recall: A benchmark for llms robustness against external counterfac- tual knowledge.ArXiv, abs/2311.08147, 2023

Yi Liu, Lianzhe Huang, Shicheng Li, Sishuo Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Recall: A benchmark for llms robustness against external counterfac- tual knowledge.ArXiv, abs/2311.08147, 2023. 4

work page arXiv 2023
[35]

arXiv preprint arXiv:2306.09093 , year=

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration.arXiv preprint arXiv:2306.09093, 2023. 1

work page arXiv 2023
[36]

Behind the scenes: Mechanistic interpretability of lora-adapted whis- per for speech emotion recognition.arXiv preprint arXiv:2509.08454, 2025

Yujian Ma, Jinqiu Sang, and Ruizhe Li. Behind the scenes: Mechanistic interpretability of lora-adapted whis- per for speech emotion recognition.arXiv preprint arXiv:2509.08454, 2025. 1

work page arXiv 2025
[37]

Locating and editing factual associations in gpt.Ad- vances in neural information processing systems, 35:17359– 17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in gpt.Ad- vances in neural information processing systems, 35:17359– 17372, 2022. 2

work page 2022
[38]

The impact of multimodal large language models on health care’s future.Journal of medical Internet research, 25:e52865, 2023

Bertalan Mesk ´o. The impact of multimodal large language models on health care’s future.Journal of medical Internet research, 25:e52865, 2023. 1

work page 2023
[39]

The quest for the right mediator: A history, survey, and the- oretical grounding of causal interpretability.arXiv preprint arXiv:2408.01416, 2024

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, et al. The quest for the right mediator: A history, survey, and the- oretical grounding of causal interpretability.arXiv preprint arXiv:2408.01416, 2024. 3

work page arXiv 2024
[40]

Soundscapes and deep learning enable tracking biodiversity recovery in tropical forests.Nature communications, 14(1):6191, 2023

J ¨org M ¨uller, Oliver Mitesser, H Martin Schaefer, Sebastian Seibold, Annika Busse, Peter Kriegel, Dominik Rabl, Rudy Gelis, Alejandro Arteaga, Juan Freile, et al. Soundscapes and deep learning enable tracking biodiversity recovery in tropical forests.Nature communications, 14(1):6191, 2023. 1

work page 2023
[41]

Fact finding: Attempting to reverse-engineer factual recall on the neuron level

Neel Nanda, Senthooran Rajamanoharan, Janos Kramar, and Rohin Shah. Fact finding: Attempting to reverse-engineer factual recall on the neuron level. InAlignment Forum, page 6, 2023. 2

work page 2023
[42]

Towards interpreting visual in- formation processing in vision-language models

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual in- formation processing in vision-language models. InThe Thirteenth International Conference on Learning Represen- tations. 1, 2, 5

work page
[43]

Sfr-rag: Towards contextually faithful llms.ArXiv, abs/2409.09916, 2024

Xuan-Phi Nguyen, Shrey Pandit, Senthil Purushwalkam, Austin Xu, Hailin Chen, Yifei Ming, Zixuan Ke, Silvio Savarese, Caiming Xong, and Shafiq Joty. Sfr-rag: Towards contextually faithful llms.ArXiv, abs/2409.09916, 2024. 4

work page arXiv 2024
[44]

Interpreting gpt: The logit lens

Nostalgebraist. Interpreting gpt: The logit lens. https : / / www . alignmentforum . org / posts / AcKRB8wDpdaN6v6ru/interpreting- gpt- the- logit-lens, 2020. Accessed: 2024-09-23. 3

work page 2020
[45]

In-context learning and induction heads.CoRR, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.CoRR, 2022. 2

work page 2022
[46]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024. 12

work page 2024
[47]

Learning interpretable features in audio latent spaces via sparse autoencoders

Nathan Paek, Yongyi Zang, Qihui Yang, and Randal Leis- tikow. Learning interpretable features in audio latent spaces via sparse autoencoders. InMechanistic Interpretability Workshop at NeurIPS 2025. 1

work page 2025
[48]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page
[49]

Object hallucination in image cap- tioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

work page 2018
[50]

Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023. 2

work page arXiv 2023
[51]

video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024. 1

work page arXiv 2024
[52]

Avhbench: A cross- modal hallucination benchmark for audio-visual large lan- guage models.arXiv preprint arXiv:2410.18325, 2024

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross- modal hallucination benchmark for audio-visual large lan- guage models.arXiv preprint arXiv:2410.18325, 2024. 4, 13

work page arXiv 2024
[53]

Causal mediation analysis for in- terpreting neural nlp: The case of gender bias.arXiv preprint arXiv:2004.12265, 2020

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. Causal mediation analysis for in- terpreting neural nlp: The case of gender bias.arXiv preprint arXiv:2004.12265, 2020. 3

work page arXiv 2004
[54]

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh International Conference on Learning Repre- sentations. 2

work page
[55]

Do llamas work in English? on the latent language of multilingual transformers

Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in English? on the latent language of multilingual transformers. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 15366– 15394, Bangkok, Thailand, 2024. Association for Computa- tional Linguistics. 5

work page 2024
[56]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 1, 2, 4, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Qwen3-omni technical report, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page 2025
[58]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Audiolens: A closer look at auditory attribute percep- tion of large audio-language models.arXiv preprint arXiv:2506.05140, 2025

Chih-Kai Yang, Neo Ho, Yi-Jyun Lee, and Hung-yi Lee. Audiolens: A closer look at auditory attribute percep- tion of large audio-language models.arXiv preprint arXiv:2506.05140, 2025. 1, 2

work page arXiv 2025
[60]

Roboego system card: An omnimodal model with native full duplexity.arXiv preprint arXiv:2506.01934, 2025

Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Aixin Sun, and Yequan Wang. Roboego system card: An omnimodal model with native full duplexity.arXiv preprint arXiv:2506.01934, 2025. 1

work page arXiv 2025
[61]

When and why vision- language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations. 4

work page
[62]

2023 , archivePrefix=

Fred Zhang and Neel Nanda. Towards best practices of ac- tivation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042, 2023. 1

work page arXiv 2023
[63]

Video-llama: An instruction-tuned audio-visual language model for video un- derstanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, 2023. 1, 2, 4, 14

work page 2023
[64]

Qwen3 em- bedding: Advancing text embedding and reranking through foundation models, 2025

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 em- bedding: Advancing text embedding and reranking through foundation models, 2025. 13

work page 2025
[65]

Cross-modal information flow in multimodal large language models

Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19781–19791, 2025. 2

work page 2025
[66]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862, 2025. 1, 4 Do Audio-Visual Large Language Models Really See and Hear? Supplementary Material Contents

work page arXiv 2025
[67]

Experimental Setup 3

work page
[68]

Investigating Attention Pattern 4

work page
[69]

Probing Audio Representations 5

work page
[70]

Factual Audio-Visual Understanding

Investigating Information Flow 5 7.1. Factual Audio-Visual Understanding . . . . 6 7.2. Counter-Factual Audio-Visual Understanding 7

work page
[71]

Investigating Origins of Visual Bias 7

work page
[72]

Dataset 12 A.1

Conclusion, Limitations, and Future Work 8 A . Dataset 12 A.1 . Data Source . . . . . . . . . . . . . . . . . . 12 A.2 . Counterfactual Sample Curation . . . . . . . 13 A.3 . Existing Benchmarks . . . . . . . . . . . . . 13 B . Evaluation 14 B.1. Human Evaluation Study . . . . . . . . . . . 14 B.2. LLM Judge Prompts . . . . . . . . . . . . . 14 C . Qualit...

work page