From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Muhammad Awais; Sara Atito; Wish Suharitdamrong; Xiatian Zhu

arxiv: 2606.10147 · v1 · pith:6HWKNL52new · submitted 2026-06-08 · 💻 cs.AI · cs.CL· cs.CV· cs.SD

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Wish Suharitdamrong , Muhammad Awais , Xiatian Zhu , Sara Atito This is my paper

Pith reviewed 2026-06-27 16:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.SD

keywords audio-visual LLMsinformation flowmultimodal perceptionmodel interpretabilitytoken pruningAVLLMssequential routing

0 comments

The pith

AVLLMs follow sequential information flow for audio-visual videos but switch to parallel streams for interleaved items, and can discard sensory tokens after transfer to the LLM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces how audio and visual signals travel through Audio-Visual Large Language Models to affect final predictions. It studies two input setups: single audio-visual videos and multiple interleaved audio-visual items. In the video case, the models use the same sequential pathway as vision-language models, with each modality contributing according to task demands. Interleaved inputs lead to information traveling in separate parallel streams. The findings also show that audio-visual tokens can be removed after their information has been passed to the language model, causing little or no drop in performance across tasks.

Core claim

For audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement.

What carries the argument

Tracing of information flow through the network layers to identify how audio and visual tokens are routed and integrated into the final decision.

If this is right

Audio-visual tokens can be pruned after information transfer for more efficient inference without significant performance loss.
The routing adapts to input configuration, using sequential paths for unified video and parallel for separate items.
Modality contributions scale with how much each is needed for the specific task.
The patterns hold across different model sizes and types, suggesting a general structure in AVLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These flow patterns could inform the design of more efficient multimodal architectures by prioritizing certain connections.
Similar tracing methods might be applied to understand information flow in other combinations of modalities.
Token discarding opens possibilities for dynamic input management during model operation.

Load-bearing premise

The technique for tracing information flow correctly identifies the causal contributions of audio and visual inputs rather than just their statistical associations with the output.

What would settle it

Ablating the attention or activation paths identified as carrying the audio-visual information and checking whether the model's accuracy on relevant tasks decreases as expected.

Figures

Figures reproduced from arXiv: 2606.10147 by Muhammad Awais, Sara Atito, Wish Suharitdamrong, Xiatian Zhu.

**Figure 2.** Figure 2: Vision sinks share the same hidden-state activation as the language sinks. (Left) Hidden state L2 norm at layer 31, with vision sink tokens marked by red circles. (Right) Magnitude per hidden dimension for a system sink and a vision sink, with massive activation peaks at dimensions 318, 1874, and 1819 for both, on Qwen2.5-Omni 3B. Finding 1: Attention allocation is not a reliable indicator of information f… view at source ↗

**Figure 3.** Figure 3: Within- and cross-modal interactions concentrate at early-to-middle layers. Change in prediction probability when disconnecting within-modality (Cross-frame, Cross-audio) and direct cross-modal (Audio̸↔Video) pathways, across layers and five AV-SpeakerBench tasks. Cross-frame attention contributes across all tasks, while cross-modal effects vary by task. First, we investigate whether and where the two moda… view at source ↗

**Figure 4.** Figure 4: Overall audio-visual information flow in AVLLMs. Change in prediction probability across knockouts targeting the question and the last token. Source̸→Target indicates blocking attention edges from source tokens to target tokens. The flow follows a single sequential pathway with no direct flow from the modalities to the last token. Next, we trace how audio and visual information reach the prediction [PITH_… view at source ↗

**Figure 5.** Figure 5: Multi-input interleaved information aggregates at the late-positioned token. At mid layers, candidates interact among themselves (Cross-Candidate), and both candidates and question transfer their content to the reference. At late layers, only the reference reaches the last token. To trace this flow, we apply attention knockout (Section 4.1) across candidates, question, and reference [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 6.** Figure 6: The prediction flows through the option letters. (a-b) At mid layers, the correct option letter (CorrectOpt) aggregates from the correct and incorrect candidates (CorrectCand, IncorrectCand) and the reference. (c-d) At late layers, the last token reads from both correct and incorrect option letters, with the competition between them driving the prediction. Next, we trace how the model selects the correct o… view at source ↗

**Figure 7.** Figure 7: L2 norm distribution across token positions at four representative layers (0, 15, 30, 31). [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Window size ablation on the Speaker Recognition task. Each panel shows the relative change [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Additional knockout analysis on Qwen2.5-Omni 3B. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Component-level question-internal knockouts on Qwen2.5-Omni 3B. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Qwen2.5-Omni 3B on WorldSense. Knockout of within- and cross-modal pathways (Cross-frame, Cross-audio segment, Audio↔Video) and of modality and question pathways into the last token (Video̸→Question, Audio̸→Question, Question̸→Last, Video̸→Last, Audio̸→Last). Source̸→Target indicates blocking attention edges from source tokens to target tokens. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Qwen2.5-Omni 3B on WorldSense. Modality and question pathways into the correct option letter (Video̸→TrueOpt, Audio̸→TrueOpt, NonOptQ̸→TrueOpt, V+A̸→TrueOpt); modality pathways into the non-option question text (Video̸→NonOptQ, Audio̸→NonOptQ, V+A̸→NonOptQ); and question-internal pathways into the last token (TrueOpt̸→Last, FalseOpt̸→Last, NonOptQ̸→Last). Source̸→Target indicates blocking attention edges … view at source ↗

**Figure 13.** Figure 13: covers pathways into the Reference, the Question, and the last token, while [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Qwen2.5-Omni 3B per-task multi-input knockout (AV-Odyssey). Pathways into the correct option letter (CorrectCand̸→CorrectOpt, IncorrectCand̸→CorrectOpt, Reference̸→CorrectOpt, Question̸→CorrectOpt) and finer-grained pathways into the last token (CorrectCand̸→Last, IncorrectCand̸→Last, CorrectOpt̸→Last, IncorrectOpt̸→Last). Each panel shows one task under one input ordering (I Ref → A Cand or A Ref → I Ca… view at source ↗

**Figure 15.** Figure 15: Qwen2.5-Omni 7B on AV-SpeakerBench. Knockout of within- and cross-modal pathways (Cross-frame, Cross-audio segment, Audio↔Video) and of modality and question pathways into the last token (Video̸→Question, Audio̸→Question, Question̸→Last, Video̸→Last, Audio̸→Last). Source̸→Target indicates blocking attention edges from source tokens to target tokens. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Qwen2.5-Omni 7B on WorldSense. Knockout of within- and cross-modal pathways (Cross-frame, Cross-audio segment, Audio↔Video) and of modality and question pathways into the last token (Video̸→Question, Audio̸→Question, Question̸→Last, Video̸→Last, Audio̸→Last). Source̸→Target indicates blocking attention edges from source tokens to target tokens. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Qwen2.5-Omni 7B on WorldSense. Modality and question pathways into the correct option letter (Video̸→TrueOpt, Audio̸→TrueOpt, NonOptQ̸→TrueOpt, V+A̸→TrueOpt); modality pathways into the non-option question text (Video̸→NonOptQ, Audio̸→NonOptQ, V+A̸→NonOptQ); and question-internal pathways into the last token (TrueOpt̸→Last, FalseOpt̸→Last, NonOptQ̸→Last). Source̸→Target indicates blocking attention edges … view at source ↗

**Figure 19.** Figure 19: reports the multi-input knockout averaged across tasks, matching the reporting style of the main paper for Qwen2.5-Omni 3B. The same set of pathways into the Reference, Question, correct option letter, and last token are shown. 0 10 20 Layer −60 −50 −40 −30 −20 −10 0 10 % Change in Probability (a) I Ref → A Cand 0 10 20 Layer (b) A Ref → I Cand 0 10 20 Layer −12 −10 −8 −6 −4 −2 0 2 % Change in Probability… view at source ↗

**Figure 20.** Figure 20: Qwen2.5-Omni 7B per-task multi-input knockout (AV-Odyssey). Pathways into the Reference and Question (Cross-Candidate, Candidates̸→Reference, Question̸→Reference, Candidates̸→Question) and pathways into the last token (Reference̸→Last, Candidates̸→Last, Question̸→Last). Each panel shows one task under one input ordering (I Ref → A Cand or A Ref → I Cand). Source̸→Target indicates blocking attention edges … view at source ↗

**Figure 21.** Figure 21: Qwen2.5-Omni 7B per-task multi-input knockout (AV-Odyssey). Pathways into the correct option letter (CorrectCand̸→CorrectOpt, IncorrectCand̸→CorrectOpt, Reference̸→CorrectOpt, Question̸→CorrectOpt) and finer-grained pathways into the last token (CorrectCand̸→Last, IncorrectCand̸→Last, CorrectOpt̸→Last, IncorrectOpt̸→Last). Each panel shows one task under one input ordering (I Ref → A Cand or A Ref → I Ca… view at source ↗

**Figure 22.** Figure 22: Video-SALMONN2 3B Plus on AV-SpeakerBench. Knockout of within- and crossmodal pathways (Cross-frame, Cross-audio segment, Audio↔Video) and of modality and question pathways into the last token (Video̸→Question, Audio̸→Question, Question̸→Last, Video̸→Last, Audio̸→Last). Source̸→Target indicates blocking attention edges from source tokens to target tokens. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23 [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗

**Figure 24.** Figure 24: Video-SALMONN2 3B Plus on WorldSense. Knockout of within- and cross-modal pathways (Cross-frame, Cross-audio segment, Audio↔Video) and of modality and question pathways into the last token (Video̸→Question, Audio̸→Question, Question̸→Last, Video̸→Last, Audio̸→Last). Source̸→Target indicates blocking attention edges from source tokens to target tokens. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗

**Figure 25.** Figure 25: Video-SALMONN2 3B Plus on WorldSense. Modality and question pathways into the correct option letter (Video̸→TrueOpt, Audio̸→TrueOpt, NonOptQ̸→TrueOpt, V+A̸→TrueOpt); modality pathways into the non-option question text (Video̸→NonOptQ, Audio̸→NonOptQ, V+A̸→NonOptQ); and question-internal pathways into the last token (TrueOpt̸→Last, FalseOpt̸→Last, NonOptQ̸→Last). Source̸→Target indicates blocking attentio… view at source ↗

**Figure 26.** Figure 26: Video-SALMONN2 7B Plus on AV-SpeakerBench. Knockout of within- and crossmodal pathways (Cross-frame, Cross-audio segment, Audio↔Video) and of modality and question pathways into the last token (Video̸→Question, Audio̸→Question, Question̸→Last, Video̸→Last, Audio̸→Last). Source̸→Target indicates blocking attention edges from source tokens to target tokens. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗

**Figure 27.** Figure 27 [PITH_FULL_IMAGE:figures/full_fig_p038_27.png] view at source ↗

**Figure 28.** Figure 28: Video-SALMONN2 7B Plus on WorldSense. Knockout of within- and cross-modal pathways (Cross-frame, Cross-audio segment, Audio↔Video) and of modality and question pathways into the last token (Video̸→Question, Audio̸→Question, Question̸→Last, Video̸→Last, Audio̸→Last). Source̸→Target indicates blocking attention edges from source tokens to target tokens. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_28.png] view at source ↗

**Figure 29.** Figure 29: Video-SALMONN2 7B Plus on WorldSense. Modality and question pathways into the correct option letter (Video̸→TrueOpt, Audio̸→TrueOpt, NonOptQ̸→TrueOpt, V+A̸→TrueOpt); modality pathways into the non-option question text (Video̸→NonOptQ, Audio̸→NonOptQ, V+A̸→NonOptQ); and question-internal pathways into the last token (TrueOpt̸→Last, FalseOpt̸→Last, NonOptQ̸→Last). Source̸→Target indicates blocking attentio… view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps sequential vs parallel audio-visual routing in AVLLMs plus a pruning trick that holds across models, but the tracing looks correlational rather than causal.

read the letter

The main point is that AVLLMs route audio and visual tokens sequentially along the established VLM path when the input is a video clip, with contributions scaling to task needs, but switch to separate parallel streams under interleaved audio-visual items. Tokens of all types can then be dropped after their information reaches the LLM layers, with little or no drop in accuracy and sometimes a small gain. This pattern appears in both Qwen2.5-Omni and Video-SALMONN2 at 3B and 7B scales.

What is actually new is the explicit contrast between the two input regimes and the claim that the pruning generalizes across tasks and datasets. The work builds directly on prior VLM and VideoLLM flow studies rather than inventing a new framework, which keeps the contribution focused.

The soft spot is the tracing procedure itself. If the measurements rely on attention weights or activation correlations without layer-wise interventions such as patching or token replacement, the pathways remain descriptive rather than proven causal drivers. The pruning result could then be sensitive to how the transfer point is chosen. The abstract does not supply enough method detail to settle this, so the stress-test concern applies.

The paper is for people working on multimodal interpretability and efficiency. It has enough cross-model empirical consistency to merit a serious referee, though the methods section will need close examination on the causal status of the flow claims.

Referee Report

1 major / 1 minor

Summary. The manuscript examines internal information flow in Audio-Visual Large Language Models (AVLLMs) for two configurations: audio-visual video inputs and multiple interleaved audio-visual items. It reports that AVLLMs follow the sequential pathway established for VLMs/VideoLLMs in the video case, with audio/visual contributions scaling proportionally to task modality reliance; routing shifts to parallel streams under interleaving. It further claims that audio-visual and other token types can be discarded after information transfer to the LLM with minimal or no performance degradation (sometimes improvement), generalizing across Qwen2.5-Omni and Video-SALMONN2 Plus at 3B/7B scales and multiple tasks/datasets.

Significance. If the tracing procedure is shown to be causal, the work supplies the first coherent empirical map of audio-visual integration inside AVLLMs and directly supports efficiency gains via post-transfer pruning. Cross-model and cross-scale consistency is a positive feature. The correlational risk highlighted in the stress-test note, however, keeps the immediate significance modest until the methods are clarified.

major comments (1)

[Methods (information flow tracing)] The central claims about sequential/parallel routing and the precise 'transfer points' at which tokens can be discarded rest on the tracing technique. The abstract and available description give no indication that the procedure employs causal interventions (activation patching, counterfactual replacement, or layer-wise ablation) rather than attention weights or activation correlations. Without such interventions, the reported pathways and discarding results remain correlational and the load-bearing claims about information flow are not yet secured.

minor comments (1)

[Abstract] The abstract states results 'hold across multiple models and scales' but does not list the exact tasks/datasets or report error bars; adding these details would strengthen readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the tracing methodology. We address the major comment below and will revise the manuscript to improve clarity and transparency.

read point-by-point responses

Referee: The central claims about sequential/parallel routing and the precise 'transfer points' at which tokens can be discarded rest on the tracing technique. The abstract and available description give no indication that the procedure employs causal interventions (activation patching, counterfactual replacement, or layer-wise ablation) rather than attention weights or activation correlations. Without such interventions, the reported pathways and discarding results remain correlational and the load-bearing claims about information flow are not yet secured.

Authors: We agree that the tracing procedure described in the manuscript relies on attention weight analysis and activation correlations rather than causal interventions such as activation patching or layer-wise ablation. This makes the reported pathways and transfer points correlational in nature. In the revised version we will (1) provide a more detailed description of the exact tracing steps, (2) explicitly state the correlational character of the evidence, and (3) moderate the language of the claims about information flow to align with the strength of the supporting observations. These changes will not alter the empirical results but will make the evidential basis transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on existing models

full rationale

The paper reports observational findings from tracing information flow in AVLLMs (sequential pathways for video, parallel for interleaved inputs, and token discarding post-transfer). These are direct measurements on Qwen2.5-Omni and Video-SALMONN2 models at multiple scales, not quantities derived from equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing step reduces by construction to its own inputs; the work is self-contained against external benchmarks via generalization across tasks and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical interpretability study; it introduces no new mathematical axioms, free parameters fitted to data, or postulated entities. All claims rest on standard attention-based tracing and ablation methods applied to existing open models.

pith-pipeline@v0.9.1-grok · 5841 in / 1135 out tokens · 18161 ms · 2026-06-27T16:10:57.078939+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 27 linked inside Pith

[1]

X. An, Y . Xie, K. Yang, W. Zhang, X. Zhao, Z. Cheng, Y . Wang, S. Xu, C. Chen, D. Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

Pith/arXiv arXiv 2025
[2]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[3]

S. Basu, M. Grayson, C. Morrison, B. Nushi, S. Feizi, and D. Massiceti. Understanding information storage and transfer in multi-modal large language models.Advances in Neural Information Processing Systems, 37:7400–7426, 2024

2024
[4]

Cheng, S

Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

Pith/arXiv arXiv 2024
[5]

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

Pith/arXiv arXiv 2024
[6]

A. Das, A. Bulat, A. Baldrati, I. M. Metaxas, B. Schiele, G. Tzimiropoulos, and B. Martinez. More images, more problems? a controlled analysis of vlm failure modes.arXiv preprint arXiv:2601.07812, 2026

arXiv 2026
[7]

Y . Ding, Y . Ji, J. Li, X. Liu, X. Chen, J. Wu, B. Li, B. Zeng, Y . Shi, Y . Guan, et al. Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026

Pith/arXiv arXiv 2026
[8]

Elhage, N

N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021

2021
[9]

C. Fu, H. Lin, X. Wang, Y .-F. Zhang, Y . Shen, X. Liu, H. Cao, Z. Long, H. Gao, K. Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Pith/arXiv arXiv 2025
[10]

M. Geva, J. Bastings, K. Filippova, and A. Globerson. Dissecting recall of factual associations in auto- regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

2023
[11]

Ghosh, S

S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha. Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313, 2024

2024
[12]

A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

Pith/arXiv arXiv 2025
[13]

C. Gong, D. Wang, Z. Wei, Y . Guo, H. Zhu, and J. Chen. Echoingpixels: Cross-modal adaptive token reduction for efficient audio-visual llms.arXiv preprint arXiv:2512.10324, 2025

Pith/arXiv arXiv 2025
[14]

K. Gong, K. Feng, B. Li, Y . Wang, M. Cheng, S. Yang, J. Han, B. Wang, Y . Bai, Z. Yang, et al. Av- odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611, 2024. 10

arXiv 2024
[15]

Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass. Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023

arXiv 2023
[16]

X. Gu, T. Pang, C. Du, Q. Liu, F. Zhang, C. Du, Y . Wang, and M. Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

Pith/arXiv arXiv 2024
[17]

J. Hong, S. Yan, J. Cai, X. Jiang, Y . Hu, and W. Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

Pith/arXiv arXiv 2025
[18]

Hurst, A

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[19]

Kaduri, S

O. Kaduri, S. Bagon, and T. Dekel. What’s in the image? a deep-dive into the vision of vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14549–14558, 2025

2025
[20]

S. Kang, J. Kim, J. Kim, and S. J. Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

arXiv 2025
[21]

M. Kim, T. Kim, and B. Han. Map the flow: Revealing hidden pathways of information in videollms. arXiv preprint arXiv:2510.13251, 2025

arXiv 2025
[22]

M. Lee, Y . Park, D. Hwang, Y . Kim, S. J. Oh, and J. Choe. Enhancing multi-image understanding through delimiter token scaling.arXiv preprint arXiv:2602.01984, 2026

arXiv 2026
[23]

Li and T

B. Li and T. Huang. Dash: Dynamic audio-driven semantic chunking for efficient omnimodal token compression.arXiv preprint arXiv:2603.15685, 2026

arXiv 2026
[24]

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

Pith/arXiv arXiv 2024
[25]

C. Li, Y . Chen, Y . Ji, J. Xu, Z. Cui, S. Li, Y . Zhang, W. Wang, Z. Song, D. Zhang, et al. Omnivideobench: Towards audio-visual understanding evaluation for omni mllms.arXiv preprint arXiv:2510.10689, 2025

arXiv 2025
[26]

Y . Li, X. Chen, S. Jiang, H. Shi, Z. Liu, X. Zhang, N. Deng, Z. Xu, Y . Ma, M. Zhang, et al. Uni-moe- 2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data.arXiv preprint arXiv:2511.12609, 2025

arXiv 2025
[27]

Y . Li, Y . Ma, G. Zhang, R. Yuan, K. Zhu, H. Guo, Y . Liang, J. Liu, Z. Wang, J. Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024

arXiv 2024
[28]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024
[29]

Z. Liu, Y . Dong, J. Wang, Z. Liu, W. Hu, J. Lu, and Y . Rao. Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

arXiv 2025
[30]

Luo, W.-C

J. Luo, W.-C. Fan, L. Wang, X. He, T. Rahman, P. Abolmaesumi, and L. Sigal. To sink or not to sink: Visual information pathways in large vision-language models.arXiv preprint arXiv:2510.08510, 2025

arXiv 2025
[31]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217, 2023

Pith/arXiv arXiv 2023
[32]

C. Neo, L. Ong, P. Torr, M. Geva, D. Krueger, and F. Barez. Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024

arXiv 2024
[33]

L. T. P. Nguyen, Z. Yu, S. L. Y . Hang, S. An, J. Lee, Y . Ban, S. Chung, T.-H. Nguyen, J. Maeng, S. Lee, et al. See, hear, and understand: Benchmarking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231, 2025

Pith/arXiv arXiv 2025
[34]

Nikankin, D

Y . Nikankin, D. Arad, Y . Gandelsman, and Y . Belinkov. Same task, different circuits: Disentangling modality-specific mechanisms in vlms.arXiv preprint arXiv:2506.09047, 2025

arXiv 2025
[35]

Y . Park, M. Lee, S. Chun, and J. Choe. Mitigating cross-image information leakage in lvlms for multi-image tasks.arXiv preprint arXiv:2508.13744, 2025

Pith/arXiv arXiv 2025
[36]

D. Rai, Y . Zhou, S. Feng, A. Saparov, and Z. Yao. A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024. 11

arXiv 2024
[37]

Selvakumar, K

R. Selvakumar, K. Jayakumar, S. Sakshi, S. Ghosh, R. Gao, and D. Manocha. Do audio-visual large language models really see and hear?arXiv preprint arXiv:2604.02605, 2026

Pith/arXiv arXiv 2026
[38]

Sharkey, B

L. Sharkey, B. Chughtai, J. Batson, J. Lindsey, J. Wu, L. Bushnaq, N. Goldowsky-Dill, S. Heimersheim, A. Ortega, J. Bloom, et al. Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496, 2025

Pith/arXiv arXiv 2025
[39]

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

Pith/arXiv arXiv 2024
[40]

C. Tang, Y . Li, Y . Yang, J. Zhuang, G. Sun, W. Li, Z. Ma, and C. Zhang. video-salmonn 2: Caption- enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

arXiv 2025
[41]

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

Pith/arXiv arXiv 2023
[42]

K. Tao, K. Shao, B. Yu, W. Wang, H. Wang, et al. Omnizip: Audio-guided dynamic token compression for fast omnimodal large language models.arXiv preprint arXiv:2511.14582, 2025

Pith/arXiv arXiv 2025
[43]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023
[44]

Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

Pith/arXiv arXiv 2026
[45]

S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

2024
[46]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025
[47]

Y . Wei, Y . Miao, D. Zhou, and D. Hu. Moka: Multimodal low-rank adaptation for mllms.arXiv preprint arXiv:2506.05191, 2025

arXiv 2025
[48]

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

Pith/arXiv arXiv 2023
[49]

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025
[50]

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025
[51]

Y . Yang, J. Zhuang, G. Sun, C. Tang, Y . Li, P. Li, Y . Jiang, W. Li, Z. Ma, and C. Zhang. Audio-centric video understanding benchmark without text shortcut. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6580–6598, 2025

2025
[52]

Ye, C.-H

H. Ye, C.-H. H. Yang, A. Goel, W. Huang, L. Zhu, Y . Su, S. Lin, A.-C. Cheng, Z. Wan, J. Tian, et al. Omnivinci: Enhancing architecture and data for omni-modal understanding llm.arXiv preprint arXiv:2510.15870, 2025

arXiv 2025
[53]

Q. Ye, Z. Yu, R. Shao, X. Xie, P. Torr, and X. Cao. Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios. InEuropean Conference on Computer Vision, pages 146–164. Springer, 2024

2024
[54]

Zhang, K

B. Zhang, K. Li, Z. Cheng, Z. Hu, Y . Yuan, G. Chen, S. Leng, Y . Jiang, H. Zhang, X. Li, et al. Vide- ollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

Pith/arXiv arXiv 2025
[55]

Zhang, S

Z. Zhang, S. Yadav, F. Han, and E. Shutova. Cross-modal information flow in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19781–19791, 2025

2025
[56]

after the man in the grey shirt wiggles his fingers

Z. Zhou, R. Wang, Z. Wu, and Y .-G. Jiang. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862, 2025. 12 A Related works A.1 Audio-visual large language models (A VLLMs) Audio-visual large language models (A VLLMs) extend the multimodal LLM paradigm to jointly pro- cess audio and visual inpu...

arXiv 2025
[57]

Parse media placeholders.We scan the prompt to locate each media placeholder (e.g., [img1], [audio1]) and identify the single reference media along with the candidate media in the opposite modality
[58]

Split into ordered segments.Using the located media placeholders as boundaries, we split the prompt into an ordered sequence of segments alternating between media spans ( candidates and reference ) and text spans
[59]

which best matches

Identify the question text.Among the text segments, we assign the role of question textto the segment that describes the actual matching task. If the reference media is the final media in the prompt, we prefer the text segment immediately preceding it (which typically contains the matching prompt, e.g.,“which best matches”); otherwise, we select 16 the lo...
[60]

Construct the structure label.The final segment ordering, for example candidates , question , reference , is recorded as the sample’s structure label. We use the resulting structure labels to bucket samples for both the main-paper analysis (which uses the most common ordering, candidates , question , reference ) and the per-ordering breakdowns reported in...

[1] [1]

X. An, Y . Xie, K. Yang, W. Zhang, X. Zhao, Z. Cheng, Y . Wang, S. Xu, C. Chen, D. Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

Pith/arXiv arXiv 2025

[2] [2]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[3] [3]

S. Basu, M. Grayson, C. Morrison, B. Nushi, S. Feizi, and D. Massiceti. Understanding information storage and transfer in multi-modal large language models.Advances in Neural Information Processing Systems, 37:7400–7426, 2024

2024

[4] [4]

Cheng, S

Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

Pith/arXiv arXiv 2024

[5] [5]

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

Pith/arXiv arXiv 2024

[6] [6]

A. Das, A. Bulat, A. Baldrati, I. M. Metaxas, B. Schiele, G. Tzimiropoulos, and B. Martinez. More images, more problems? a controlled analysis of vlm failure modes.arXiv preprint arXiv:2601.07812, 2026

arXiv 2026

[7] [7]

Y . Ding, Y . Ji, J. Li, X. Liu, X. Chen, J. Wu, B. Li, B. Zeng, Y . Shi, Y . Guan, et al. Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026

Pith/arXiv arXiv 2026

[8] [8]

Elhage, N

N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021

2021

[9] [9]

C. Fu, H. Lin, X. Wang, Y .-F. Zhang, Y . Shen, X. Liu, H. Cao, Z. Long, H. Gao, K. Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Pith/arXiv arXiv 2025

[10] [10]

M. Geva, J. Bastings, K. Filippova, and A. Globerson. Dissecting recall of factual associations in auto- regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

2023

[11] [11]

Ghosh, S

S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha. Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313, 2024

2024

[12] [12]

A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

Pith/arXiv arXiv 2025

[13] [13]

C. Gong, D. Wang, Z. Wei, Y . Guo, H. Zhu, and J. Chen. Echoingpixels: Cross-modal adaptive token reduction for efficient audio-visual llms.arXiv preprint arXiv:2512.10324, 2025

Pith/arXiv arXiv 2025

[14] [14]

K. Gong, K. Feng, B. Li, Y . Wang, M. Cheng, S. Yang, J. Han, B. Wang, Y . Bai, Z. Yang, et al. Av- odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611, 2024. 10

arXiv 2024

[15] [15]

Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass. Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023

arXiv 2023

[16] [16]

X. Gu, T. Pang, C. Du, Q. Liu, F. Zhang, C. Du, Y . Wang, and M. Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

Pith/arXiv arXiv 2024

[17] [17]

J. Hong, S. Yan, J. Cai, X. Jiang, Y . Hu, and W. Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

Pith/arXiv arXiv 2025

[18] [18]

Hurst, A

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[19] [19]

Kaduri, S

O. Kaduri, S. Bagon, and T. Dekel. What’s in the image? a deep-dive into the vision of vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14549–14558, 2025

2025

[20] [20]

S. Kang, J. Kim, J. Kim, and S. J. Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

arXiv 2025

[21] [21]

M. Kim, T. Kim, and B. Han. Map the flow: Revealing hidden pathways of information in videollms. arXiv preprint arXiv:2510.13251, 2025

arXiv 2025

[22] [22]

M. Lee, Y . Park, D. Hwang, Y . Kim, S. J. Oh, and J. Choe. Enhancing multi-image understanding through delimiter token scaling.arXiv preprint arXiv:2602.01984, 2026

arXiv 2026

[23] [23]

Li and T

B. Li and T. Huang. Dash: Dynamic audio-driven semantic chunking for efficient omnimodal token compression.arXiv preprint arXiv:2603.15685, 2026

arXiv 2026

[24] [24]

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

Pith/arXiv arXiv 2024

[25] [25]

C. Li, Y . Chen, Y . Ji, J. Xu, Z. Cui, S. Li, Y . Zhang, W. Wang, Z. Song, D. Zhang, et al. Omnivideobench: Towards audio-visual understanding evaluation for omni mllms.arXiv preprint arXiv:2510.10689, 2025

arXiv 2025

[26] [26]

Y . Li, X. Chen, S. Jiang, H. Shi, Z. Liu, X. Zhang, N. Deng, Z. Xu, Y . Ma, M. Zhang, et al. Uni-moe- 2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data.arXiv preprint arXiv:2511.12609, 2025

arXiv 2025

[27] [27]

Y . Li, Y . Ma, G. Zhang, R. Yuan, K. Zhu, H. Guo, Y . Liang, J. Liu, Z. Wang, J. Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024

arXiv 2024

[28] [28]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024

[29] [29]

Z. Liu, Y . Dong, J. Wang, Z. Liu, W. Hu, J. Lu, and Y . Rao. Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

arXiv 2025

[30] [30]

Luo, W.-C

J. Luo, W.-C. Fan, L. Wang, X. He, T. Rahman, P. Abolmaesumi, and L. Sigal. To sink or not to sink: Visual information pathways in large vision-language models.arXiv preprint arXiv:2510.08510, 2025

arXiv 2025

[31] [31]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217, 2023

Pith/arXiv arXiv 2023

[32] [32]

C. Neo, L. Ong, P. Torr, M. Geva, D. Krueger, and F. Barez. Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024

arXiv 2024

[33] [33]

L. T. P. Nguyen, Z. Yu, S. L. Y . Hang, S. An, J. Lee, Y . Ban, S. Chung, T.-H. Nguyen, J. Maeng, S. Lee, et al. See, hear, and understand: Benchmarking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231, 2025

Pith/arXiv arXiv 2025

[34] [34]

Nikankin, D

Y . Nikankin, D. Arad, Y . Gandelsman, and Y . Belinkov. Same task, different circuits: Disentangling modality-specific mechanisms in vlms.arXiv preprint arXiv:2506.09047, 2025

arXiv 2025

[35] [35]

Y . Park, M. Lee, S. Chun, and J. Choe. Mitigating cross-image information leakage in lvlms for multi-image tasks.arXiv preprint arXiv:2508.13744, 2025

Pith/arXiv arXiv 2025

[36] [36]

D. Rai, Y . Zhou, S. Feng, A. Saparov, and Z. Yao. A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024. 11

arXiv 2024

[37] [37]

Selvakumar, K

R. Selvakumar, K. Jayakumar, S. Sakshi, S. Ghosh, R. Gao, and D. Manocha. Do audio-visual large language models really see and hear?arXiv preprint arXiv:2604.02605, 2026

Pith/arXiv arXiv 2026

[38] [38]

Sharkey, B

L. Sharkey, B. Chughtai, J. Batson, J. Lindsey, J. Wu, L. Bushnaq, N. Goldowsky-Dill, S. Heimersheim, A. Ortega, J. Bloom, et al. Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496, 2025

Pith/arXiv arXiv 2025

[39] [39]

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

Pith/arXiv arXiv 2024

[40] [40]

C. Tang, Y . Li, Y . Yang, J. Zhuang, G. Sun, W. Li, Z. Ma, and C. Zhang. video-salmonn 2: Caption- enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

arXiv 2025

[41] [41]

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

Pith/arXiv arXiv 2023

[42] [42]

K. Tao, K. Shao, B. Yu, W. Wang, H. Wang, et al. Omnizip: Audio-guided dynamic token compression for fast omnimodal large language models.arXiv preprint arXiv:2511.14582, 2025

Pith/arXiv arXiv 2025

[43] [43]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023

[44] [44]

Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

Pith/arXiv arXiv 2026

[45] [45]

S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

2024

[46] [46]

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025

[47] [47]

Y . Wei, Y . Miao, D. Zhou, and D. Hu. Moka: Multimodal low-rank adaptation for mllms.arXiv preprint arXiv:2506.05191, 2025

arXiv 2025

[48] [48]

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

Pith/arXiv arXiv 2023

[49] [49]

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025

[50] [50]

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025

[51] [51]

Y . Yang, J. Zhuang, G. Sun, C. Tang, Y . Li, P. Li, Y . Jiang, W. Li, Z. Ma, and C. Zhang. Audio-centric video understanding benchmark without text shortcut. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6580–6598, 2025

2025

[52] [52]

Ye, C.-H

H. Ye, C.-H. H. Yang, A. Goel, W. Huang, L. Zhu, Y . Su, S. Lin, A.-C. Cheng, Z. Wan, J. Tian, et al. Omnivinci: Enhancing architecture and data for omni-modal understanding llm.arXiv preprint arXiv:2510.15870, 2025

arXiv 2025

[53] [53]

Q. Ye, Z. Yu, R. Shao, X. Xie, P. Torr, and X. Cao. Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios. InEuropean Conference on Computer Vision, pages 146–164. Springer, 2024

2024

[54] [54]

Zhang, K

B. Zhang, K. Li, Z. Cheng, Z. Hu, Y . Yuan, G. Chen, S. Leng, Y . Jiang, H. Zhang, X. Li, et al. Vide- ollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

Pith/arXiv arXiv 2025

[55] [55]

Zhang, S

Z. Zhang, S. Yadav, F. Han, and E. Shutova. Cross-modal information flow in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19781–19791, 2025

2025

[56] [56]

after the man in the grey shirt wiggles his fingers

Z. Zhou, R. Wang, Z. Wu, and Y .-G. Jiang. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862, 2025. 12 A Related works A.1 Audio-visual large language models (A VLLMs) Audio-visual large language models (A VLLMs) extend the multimodal LLM paradigm to jointly pro- cess audio and visual inpu...

arXiv 2025

[57] [57]

Parse media placeholders.We scan the prompt to locate each media placeholder (e.g., [img1], [audio1]) and identify the single reference media along with the candidate media in the opposite modality

[58] [58]

Split into ordered segments.Using the located media placeholders as boundaries, we split the prompt into an ordered sequence of segments alternating between media spans ( candidates and reference ) and text spans

[59] [59]

which best matches

Identify the question text.Among the text segments, we assign the role of question textto the segment that describes the actual matching task. If the reference media is the final media in the prompt, we prefer the text segment immediately preceding it (which typically contains the matching prompt, e.g.,“which best matches”); otherwise, we select 16 the lo...

[60] [60]

Construct the structure label.The final segment ordering, for example candidates , question , reference , is recorded as the sample’s structure label. We use the resulting structure labels to bucket samples for both the main-paper analysis (which uses the most common ordering, candidates , question , reference ) and the per-ordering breakdowns reported in...