From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
Pith reviewed 2026-06-27 16:10 UTC · model grok-4.3
The pith
AVLLMs follow sequential information flow for audio-visual videos but switch to parallel streams for interleaved items, and can discard sensory tokens after transfer to the LLM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement.
What carries the argument
Tracing of information flow through the network layers to identify how audio and visual tokens are routed and integrated into the final decision.
If this is right
- Audio-visual tokens can be pruned after information transfer for more efficient inference without significant performance loss.
- The routing adapts to input configuration, using sequential paths for unified video and parallel for separate items.
- Modality contributions scale with how much each is needed for the specific task.
- The patterns hold across different model sizes and types, suggesting a general structure in AVLLMs.
Where Pith is reading between the lines
- These flow patterns could inform the design of more efficient multimodal architectures by prioritizing certain connections.
- Similar tracing methods might be applied to understand information flow in other combinations of modalities.
- Token discarding opens possibilities for dynamic input management during model operation.
Load-bearing premise
The technique for tracing information flow correctly identifies the causal contributions of audio and visual inputs rather than just their statistical associations with the output.
What would settle it
Ablating the attention or activation paths identified as carrying the audio-visual information and checking whether the model's accuracy on relevant tasks decreases as expected.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines internal information flow in Audio-Visual Large Language Models (AVLLMs) for two configurations: audio-visual video inputs and multiple interleaved audio-visual items. It reports that AVLLMs follow the sequential pathway established for VLMs/VideoLLMs in the video case, with audio/visual contributions scaling proportionally to task modality reliance; routing shifts to parallel streams under interleaving. It further claims that audio-visual and other token types can be discarded after information transfer to the LLM with minimal or no performance degradation (sometimes improvement), generalizing across Qwen2.5-Omni and Video-SALMONN2 Plus at 3B/7B scales and multiple tasks/datasets.
Significance. If the tracing procedure is shown to be causal, the work supplies the first coherent empirical map of audio-visual integration inside AVLLMs and directly supports efficiency gains via post-transfer pruning. Cross-model and cross-scale consistency is a positive feature. The correlational risk highlighted in the stress-test note, however, keeps the immediate significance modest until the methods are clarified.
major comments (1)
- [Methods (information flow tracing)] The central claims about sequential/parallel routing and the precise 'transfer points' at which tokens can be discarded rest on the tracing technique. The abstract and available description give no indication that the procedure employs causal interventions (activation patching, counterfactual replacement, or layer-wise ablation) rather than attention weights or activation correlations. Without such interventions, the reported pathways and discarding results remain correlational and the load-bearing claims about information flow are not yet secured.
minor comments (1)
- [Abstract] The abstract states results 'hold across multiple models and scales' but does not list the exact tasks/datasets or report error bars; adding these details would strengthen readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the tracing methodology. We address the major comment below and will revise the manuscript to improve clarity and transparency.
read point-by-point responses
-
Referee: The central claims about sequential/parallel routing and the precise 'transfer points' at which tokens can be discarded rest on the tracing technique. The abstract and available description give no indication that the procedure employs causal interventions (activation patching, counterfactual replacement, or layer-wise ablation) rather than attention weights or activation correlations. Without such interventions, the reported pathways and discarding results remain correlational and the load-bearing claims about information flow are not yet secured.
Authors: We agree that the tracing procedure described in the manuscript relies on attention weight analysis and activation correlations rather than causal interventions such as activation patching or layer-wise ablation. This makes the reported pathways and transfer points correlational in nature. In the revised version we will (1) provide a more detailed description of the exact tracing steps, (2) explicitly state the correlational character of the evidence, and (3) moderate the language of the claims about information flow to align with the strength of the supporting observations. These changes will not alter the empirical results but will make the evidential basis transparent. revision: yes
Circularity Check
No circularity: empirical measurements on existing models
full rationale
The paper reports observational findings from tracing information flow in AVLLMs (sequential pathways for video, parallel for interleaved inputs, and token discarding post-transfer). These are direct measurements on Qwen2.5-Omni and Video-SALMONN2 models at multiple scales, not quantities derived from equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing step reduces by construction to its own inputs; the work is self-contained against external benchmarks via generalization across tasks and datasets.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
X. An, Y . Xie, K. Yang, W. Zhang, X. Zhao, Z. Cheng, Y . Wang, S. Xu, C. Chen, D. Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025
Pith/arXiv arXiv 2025
-
[2]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[3]
S. Basu, M. Grayson, C. Morrison, B. Nushi, S. Feizi, and D. Massiceti. Understanding information storage and transfer in multi-modal large language models.Advances in Neural Information Processing Systems, 37:7400–7426, 2024
2024
-
[4]
Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024
Pith/arXiv arXiv 2024
-
[5]
Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024
Pith/arXiv arXiv 2024
-
[6]
A. Das, A. Bulat, A. Baldrati, I. M. Metaxas, B. Schiele, G. Tzimiropoulos, and B. Martinez. More images, more problems? a controlled analysis of vlm failure modes.arXiv preprint arXiv:2601.07812, 2026
arXiv 2026
-
[7]
Y . Ding, Y . Ji, J. Li, X. Liu, X. Chen, J. Wu, B. Li, B. Zeng, Y . Shi, Y . Guan, et al. Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026
Pith/arXiv arXiv 2026
-
[8]
Elhage, N
N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021
2021
-
[9]
C. Fu, H. Lin, X. Wang, Y .-F. Zhang, Y . Shen, X. Liu, H. Cao, Z. Long, H. Gao, K. Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025
Pith/arXiv arXiv 2025
-
[10]
M. Geva, J. Bastings, K. Filippova, and A. Globerson. Dissecting recall of factual associations in auto- regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023
2023
-
[11]
Ghosh, S
S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha. Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313, 2024
2024
-
[12]
A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025
Pith/arXiv arXiv 2025
-
[13]
C. Gong, D. Wang, Z. Wei, Y . Guo, H. Zhu, and J. Chen. Echoingpixels: Cross-modal adaptive token reduction for efficient audio-visual llms.arXiv preprint arXiv:2512.10324, 2025
Pith/arXiv arXiv 2025
-
[14]
K. Gong, K. Feng, B. Li, Y . Wang, M. Cheng, S. Yang, J. Han, B. Wang, Y . Bai, Z. Yang, et al. Av- odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611, 2024. 10
arXiv 2024
-
[15]
Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass. Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023
arXiv 2023
-
[16]
X. Gu, T. Pang, C. Du, Q. Liu, F. Zhang, C. Du, Y . Wang, and M. Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024
Pith/arXiv arXiv 2024
-
[17]
J. Hong, S. Yan, J. Cai, X. Jiang, Y . Hu, and W. Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025
Pith/arXiv arXiv 2025
-
[18]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
Pith/arXiv arXiv 2024
-
[19]
Kaduri, S
O. Kaduri, S. Bagon, and T. Dekel. What’s in the image? a deep-dive into the vision of vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14549–14558, 2025
2025
-
[20]
S. Kang, J. Kim, J. Kim, and S. J. Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025
arXiv 2025
-
[21]
M. Kim, T. Kim, and B. Han. Map the flow: Revealing hidden pathways of information in videollms. arXiv preprint arXiv:2510.13251, 2025
arXiv 2025
-
[22]
M. Lee, Y . Park, D. Hwang, Y . Kim, S. J. Oh, and J. Choe. Enhancing multi-image understanding through delimiter token scaling.arXiv preprint arXiv:2602.01984, 2026
arXiv 2026
- [23]
-
[24]
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
Pith/arXiv arXiv 2024
-
[25]
C. Li, Y . Chen, Y . Ji, J. Xu, Z. Cui, S. Li, Y . Zhang, W. Wang, Z. Song, D. Zhang, et al. Omnivideobench: Towards audio-visual understanding evaluation for omni mllms.arXiv preprint arXiv:2510.10689, 2025
arXiv 2025
-
[26]
Y . Li, X. Chen, S. Jiang, H. Shi, Z. Liu, X. Zhang, N. Deng, Z. Xu, Y . Ma, M. Zhang, et al. Uni-moe- 2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data.arXiv preprint arXiv:2511.12609, 2025
arXiv 2025
-
[27]
Y . Li, Y . Ma, G. Zhang, R. Yuan, K. Zhu, H. Guo, Y . Liang, J. Liu, Z. Wang, J. Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024
arXiv 2024
-
[28]
H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024
2024
-
[29]
Z. Liu, Y . Dong, J. Wang, Z. Liu, W. Hu, J. Lu, and Y . Rao. Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025
arXiv 2025
- [30]
-
[31]
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217, 2023
Pith/arXiv arXiv 2023
-
[32]
C. Neo, L. Ong, P. Torr, M. Geva, D. Krueger, and F. Barez. Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024
arXiv 2024
-
[33]
L. T. P. Nguyen, Z. Yu, S. L. Y . Hang, S. An, J. Lee, Y . Ban, S. Chung, T.-H. Nguyen, J. Maeng, S. Lee, et al. See, hear, and understand: Benchmarking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231, 2025
Pith/arXiv arXiv 2025
-
[34]
Y . Nikankin, D. Arad, Y . Gandelsman, and Y . Belinkov. Same task, different circuits: Disentangling modality-specific mechanisms in vlms.arXiv preprint arXiv:2506.09047, 2025
arXiv 2025
-
[35]
Y . Park, M. Lee, S. Chun, and J. Choe. Mitigating cross-image information leakage in lvlms for multi-image tasks.arXiv preprint arXiv:2508.13744, 2025
Pith/arXiv arXiv 2025
-
[36]
D. Rai, Y . Zhou, S. Feng, A. Saparov, and Z. Yao. A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024. 11
arXiv 2024
-
[37]
R. Selvakumar, K. Jayakumar, S. Sakshi, S. Ghosh, R. Gao, and D. Manocha. Do audio-visual large language models really see and hear?arXiv preprint arXiv:2604.02605, 2026
Pith/arXiv arXiv 2026
-
[38]
L. Sharkey, B. Chughtai, J. Batson, J. Lindsey, J. Wu, L. Bushnaq, N. Goldowsky-Dill, S. Heimersheim, A. Ortega, J. Bloom, et al. Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496, 2025
Pith/arXiv arXiv 2025
-
[39]
M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024
Pith/arXiv arXiv 2024
-
[40]
C. Tang, Y . Li, Y . Yang, J. Zhuang, G. Sun, W. Li, Z. Ma, and C. Zhang. video-salmonn 2: Caption- enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025
arXiv 2025
-
[41]
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023
Pith/arXiv arXiv 2023
-
[42]
K. Tao, K. Shao, B. Yu, W. Wang, H. Wang, et al. Omnizip: Audio-guided dynamic token compression for fast omnimodal large language models.arXiv preprint arXiv:2511.14582, 2025
Pith/arXiv arXiv 2025
-
[43]
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
Pith/arXiv arXiv 2023
-
[44]
Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026
Pith/arXiv arXiv 2026
-
[45]
S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024
2024
-
[46]
W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
Pith/arXiv arXiv 2025
-
[47]
Y . Wei, Y . Miao, D. Zhou, and D. Hu. Moka: Multimodal low-rank adaptation for mllms.arXiv preprint arXiv:2506.05191, 2025
arXiv 2025
-
[48]
G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023
Pith/arXiv arXiv 2023
-
[49]
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025
Pith/arXiv arXiv 2025
-
[50]
J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
Pith/arXiv arXiv 2025
-
[51]
Y . Yang, J. Zhuang, G. Sun, C. Tang, Y . Li, P. Li, Y . Jiang, W. Li, Z. Ma, and C. Zhang. Audio-centric video understanding benchmark without text shortcut. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6580–6598, 2025
2025
- [52]
-
[53]
Q. Ye, Z. Yu, R. Shao, X. Xie, P. Torr, and X. Cao. Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios. InEuropean Conference on Computer Vision, pages 146–164. Springer, 2024
2024
-
[54]
B. Zhang, K. Li, Z. Cheng, Z. Hu, Y . Yuan, G. Chen, S. Leng, Y . Jiang, H. Zhang, X. Li, et al. Vide- ollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025
Pith/arXiv arXiv 2025
-
[55]
Zhang, S
Z. Zhang, S. Yadav, F. Han, and E. Shutova. Cross-modal information flow in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19781–19791, 2025
2025
-
[56]
after the man in the grey shirt wiggles his fingers
Z. Zhou, R. Wang, Z. Wu, and Y .-G. Jiang. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862, 2025. 12 A Related works A.1 Audio-visual large language models (A VLLMs) Audio-visual large language models (A VLLMs) extend the multimodal LLM paradigm to jointly pro- cess audio and visual inpu...
arXiv 2025
-
[57]
Parse media placeholders.We scan the prompt to locate each media placeholder (e.g., [img1], [audio1]) and identify the single reference media along with the candidate media in the opposite modality
-
[58]
Split into ordered segments.Using the located media placeholders as boundaries, we split the prompt into an ordered sequence of segments alternating between media spans ( candidates and reference ) and text spans
-
[59]
which best matches
Identify the question text.Among the text segments, we assign the role of question textto the segment that describes the actual matching task. If the reference media is the final media in the prompt, we prefer the text segment immediately preceding it (which typically contains the matching prompt, e.g.,“which best matches”); otherwise, we select 16 the lo...
-
[60]
Construct the structure label.The final segment ordering, for example candidates , question , reference , is recorded as the sample’s structure label. We use the resulting structure labels to bucket samples for both the main-paper analysis (which uses the most common ordering, candidates , question , reference ) and the per-ordering breakdowns reported in...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.