arxiv: 2605.09906 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.SD

Recognition: no theorem link

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Chenrui Cui, Jianwu Dang, Longbiao Wang, Long Zhou, Tianrui Wang, Xuanchen Li, Yuheng Lu, Yu Jiang, Zikang Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:27 UTC · model grok-4.3

classification 💻 cs.AI cs.SD

keywords audio-visual LLMscross-modal interferencemodality-specific chain-of-thoughthallucination reductionaudio-visual question answeringreinforcement learningmodality fusion

0 comments

The pith

Enforcing separate audio and visual chain-of-thought reasoning before evidence fusion mitigates cross-modal interference in audio-visual LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Audio-visual large language models can experience interference where details from one sense distort the other, producing incorrect answers or hallucinations. The authors propose Separate First, Fuse Later, which makes the model generate independent reasoning traces for the audio input and the visual input before combining the evidence. They build labels that indicate which modality is preferred for each question and use them as an extra reward signal in reinforcement learning to promote suitable modality use. Tests on standard benchmarks show steady gains in accuracy together with stronger resistance to cross-modal errors.

Core claim

The central claim is that cross-modal interference stems from uncontrolled mixing of audio and visual information during intermediate reasoning steps, and that requiring modality-specific chain-of-thought traces produced separately and then fused, supported by reinforcement learning on modality-preference labels, reduces hallucinations while retaining complementary information from both modalities.

What carries the argument

Modality-specific chain-of-thought reasoning that keeps audio and visual processing isolated in the reasoning stage before allowing full cross-modal access only at the final evidence fusion stage, guided by an auxiliary RL reward.

Load-bearing premise

Enforcing separate modality-specific reasoning traces will reduce interference without causing the loss of useful information that only appears when modalities interact early.

What would settle it

An experiment that applies the full SFFL pipeline to the cross-modal hallucination benchmark and finds no reduction or an increase in hallucination rate compared to a standard fused reasoning baseline.

Figures

Figures reproduced from arXiv: 2605.09906 by Chenrui Cui, Jianwu Dang, Longbiao Wang, Long Zhou, Tianrui Wang, Xuanchen Li, Yuheng Lu, Yu Jiang, Zikang Huang.

**Figure 2.** Figure 2: Overview of Separate First, Fuse Later (SFFL) reasoning framework. et al., 2025; Yuan et al., 2024). Recent works have extended this into two-stage frameworks, such as MultimodalCoT (Zhang et al., 2024), which generate a rationale before the final answer to improve robustness. Furthermore, refinement and verification strategies like Vision-SR1 (Li et al., 2025) and MoT (Zheng et al., 2025a) attempt to v… view at source ↗

**Figure 3.** Figure 3: Layer-wise attention allocation from <sum> tokens to audio vs. visual reasoning traces. We report the normalized attention mass across the last 16 layers, grouped by predicted PEM. 4.4.3. INTERPRETABILITY OF PREFERRED EVIDENCE MODALITY We further validate the interpretability of PEM and examine how it influences the model’s reasoning. Specifically, as shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: The data pipeline prompts. You are an expert multimodal reasoning assistant. You will be given: - A Question and Multi Choices - A Video with Audio Your task: 1. Analyze the question and identify which modality (Audio, Visual, or Audio-Visual) provides the perceptual evidence required to answer the question correctly. 2. Produce separate step-by-step reasoning for Visual and for Audio based only on what is… view at source ↗

**Figure 6.** Figure 6: The training/inference instruction prompts. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Cases between our method and Qwen3-Omni-Thinking. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFFL gives a clean engineering fix for cross-modal interference in AV LLMs by separating reasoning traces first then fusing, with reported benchmark lifts that look plausible but rest on thin experimental detail.

read the letter

The main takeaway is that enforcing separate audio and visual chain-of-thought before any fusion cuts hallucinations in these models without obvious loss of complementary signals. The gains are modest but consistent: 5% relative on standard AVQA sets and 11% on a hallucination-specific test. That matches the practical problem the authors flag, where one modality can derail interpretation of the other during intermediate steps. The framework itself is straightforward: they generate modality-preference labels by running controlled input ablations, feed those into an RL auxiliary reward to bias the model toward the stronger cue per example, and add an isolation step that keeps reasoning traces apart until the final evidence merge. None of this is revolutionary on its own, but the specific combination of preference labeling plus relaxed fusion is not just a rehash of prior AVQA work. It does a solid job of making the pipeline explicit and tying the reward directly to observable modality strength. The soft spot is the lack of visible controls or variance numbers in the reported results. Without error bars, multiple random seeds, or fuller ablation tables, it's hard to know whether the lift holds under different model scales or data distributions. The core assumption that separation preserves useful cross-modal info seems to survive their tests, but a reader would want to see whether the gains shrink when the modalities are more entangled. This paper is aimed at people building or debugging multimodal LLMs for video or audio-visual tasks who need reliability tweaks rather than new architectures. It is worth a serious referee pass so the full experimental setup and any hidden sensitivities can be checked.

Referee Report

2 major / 1 minor

Summary. The paper claims that audio-visual LLMs suffer from cross-modal interference during intermediate reasoning, which induces hallucinations by allowing one modality to misguide another. It proposes the Separate First, Fuse Later (SFFL) framework that enforces modality-specific chain-of-thought reasoning to produce separate audio and visual reasoning traces before fusing evidence for the final answer. Modality-preference labels are constructed via a data pipeline based on controlled input ablations under different modality settings and used as an auxiliary reward signal in reinforcement learning to encourage instance-dependent modality cue preference. A modality-specific reasoning mechanism maintains isolation during the separated reasoning stage while permitting full cross-modal access at the evidence fusion stage. Experiments report consistent improvements, with an average relative gain of 5.16% on general AVQA benchmarks and 11.17% on a cross-modal hallucination benchmark.

Significance. If the reported gains prove robust, SFFL would offer a practical and targeted approach to controlling cross-modal interactions in multimodal LLMs, addressing a recognized source of hallucinations while retaining complementary information across modalities. The combination of ablation-derived preference labels as an RL auxiliary reward and staged isolation/fusion provides a concrete training and inference recipe that could generalize to other multimodal settings. This would be a useful contribution to the literature on reliable audio-visual reasoning.

major comments (2)

[Abstract] Abstract: The central empirical claims of 5.16% relative gain on AVQA benchmarks and 11.17% on the hallucination benchmark are stated without any mention of the number of runs, error bars, statistical significance tests, or baseline implementation details. These omissions make it impossible to determine whether the improvements exceed experimental variance and are load-bearing for the paper's main result.
[Data pipeline] Data pipeline section: The construction of modality-preference labels from controlled input ablations is described at a high level, but the manuscript provides no validation procedure (e.g., human agreement, consistency checks across ablations, or sensitivity analysis) to confirm that the labels accurately reflect genuine modality preferences rather than artifacts of the ablation protocol. This directly affects the reliability of the RL auxiliary reward and therefore the soundness of the training procedure.

minor comments (1)

[Abstract] The abstract would be clearer if it named the specific AVQA datasets and the size of the hallucination benchmark used to obtain the reported relative gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the presentation of our empirical results and strengthens the validation of our data construction process. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims of 5.16% relative gain on AVQA benchmarks and 11.17% on the hallucination benchmark are stated without any mention of the number of runs, error bars, statistical significance tests, or baseline implementation details. These omissions make it impossible to determine whether the improvements exceed experimental variance and are load-bearing for the paper's main result.

Authors: We agree that the abstract would benefit from explicit mention of experimental rigor to support the reported gains. Due to length constraints, we will revise the abstract to include a brief qualifier (e.g., 'averaged over 3 runs with full details, error bars, and significance tests in Section 4'). The Experiments section already reports results over multiple seeds with standard deviations; we will add explicit statements on the number of runs, baseline re-implementation details, and statistical tests (paired t-tests) to make this information immediately accessible and confirm the gains exceed variance. revision: partial
Referee: [Data pipeline] Data pipeline section: The construction of modality-preference labels from controlled input ablations is described at a high level, but the manuscript provides no validation procedure (e.g., human agreement, consistency checks across ablations, or sensitivity analysis) to confirm that the labels accurately reflect genuine modality preferences rather than artifacts of the ablation protocol. This directly affects the reliability of the RL auxiliary reward and therefore the soundness of the training procedure.

Authors: We acknowledge that the current high-level description lacks explicit validation, which is a valid concern for the reliability of the auxiliary reward. In the revised manuscript, we will add a new subsection under Data Pipeline that includes: consistency checks by re-running ablations with modality swaps, sensitivity analysis on ablation thresholds, and human agreement evaluation on a sampled subset of labels (reporting Cohen's kappa). These additions will demonstrate that the labels capture genuine preferences and are not artifacts, thereby supporting the soundness of the RL training procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical framework

full rationale

The paper describes an empirical pipeline (modality-preference labels from input ablations, RL auxiliary reward, modality-isolation during CoT followed by fusion) whose performance is reported via measured accuracy gains on AVQA and hallucination benchmarks. No equations, uniqueness theorems, or derivations appear in the provided text; the reported 5.16% and 11.17% relative improvements are experimental outcomes rather than quantities forced by construction from fitted inputs or self-citations. The central claim therefore remains externally falsifiable and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available, so ledger is minimal; main assumption is that cross-modal interference arises from uncontrolled interactions in intermediate reasoning.

axioms (1)

domain assumption Modality-specific chain-of-thought reduces cross-modal interference while preserving complementary evidence
Core premise of the SFFL design stated in the abstract

pith-pipeline@v0.9.0 · 5530 in / 1109 out tokens · 35708 ms · 2026-05-12T04:27:35.143362+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

[1]

2004 , issn =

Merging the senses into a robust percept , journal =. 2004 , issn =. doi:https://doi.org/10.1016/j.tics.2004.02.002 , url =

work page doi:10.1016/j.tics.2004.02.002 2004
[2]

Nature Reviews Neuroscience , volume =

Multisensory integration: current issues from the perspective of the single neuron , author =. Nature Reviews Neuroscience , volume =. 2008 , publisher =

work page 2008
[3]

2018 , eprint=

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , author=. 2018 , eprint=

work page 2018
[4]

Quantifying uncertainty in answers from any language model and enhancing their trustworthiness,

Chen, Jiuhai and Mueller, Jonas. Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.283

work page doi:10.18653/v1/2024.acl-long.283 2024
[5]

Proceedings of the 30th ACM International Conference on Multimedia , pages=

AVQA: A Dataset for Audio-Visual Question Answering on Videos , author=. Proceedings of the 30th ACM International Conference on Multimedia , pages=

work page
[6]

2022 , eprint=

Learning to Answer Questions in Dynamic Audio-Visual Scenarios , author=. 2022 , eprint=

work page 2022
[7]

2025 , isbn =

Riahi, Ines and Radman, Abduljalil and Guo, Zixin and Hedjam, Rachid and Laaksonen, Jorma , title =. 2025 , isbn =. doi:10.1145/3746027.3758261 , booktitle =

work page doi:10.1145/3746027.3758261 2025
[8]

2025 , isbn =

Zhao, Xujian and Wang, Yixin and Jin, Peiquan , title =. 2025 , isbn =. doi:10.1609/aaai.v39i10.33138 , booktitle =

work page doi:10.1609/aaai.v39i10.33138 2025
[9]

2023 , eprint=

Dynamic Multimodal Fusion , author=. 2023 , eprint=

work page 2023
[10]

2025 , eprint=

AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering , author=. 2025 , eprint=

work page 2025
[11]

2025 , eprint=

Question-Aware Gaussian Experts for Audio-Visual Question Answering , author=. 2025 , eprint=

work page 2025
[12]

2024 , eprint=

Multimodal Chain-of-Thought Reasoning in Language Models , author=. 2024 , eprint=

work page 2024
[13]

2025 , eprint=

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models , author=. 2025 , eprint=

work page 2025
[14]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Event-specific audio-visual fusion layers: A simple and new perspective on video understanding , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page
[15]

2020 , eprint=

What Makes Training Multi-Modal Classification Networks Hard? , author=. 2020 , eprint=

work page 2020
[16]

2025 , eprint=

Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning , author=. 2025 , eprint=

work page 2025
[17]

2025 , eprint=

AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding , author=. 2025 , eprint=

work page 2025
[18]

A Survey on Multimodal Large Language Models,

Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong , year=. A survey on multimodal large language models , volume=. National Science Review , publisher=. doi:10.1093/nsr/nwae403 , number=

work page doi:10.1093/nsr/nwae403
[19]

2024 , eprint=

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model , author=. 2024 , eprint=

work page 2024
[20]

2025 , eprint=

Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation , author=. 2025 , eprint=

work page 2025
[22]

2024 , eprint=

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio , author=. 2024 , eprint=

work page 2024
[23]

Abhimanyu Dubey et al

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and others , year=. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=. Nature , publisher=. doi:10.1038/s41586-025-09422-z , number=

work page doi:10.1038/s41586-025-09422-z
[24]

2024 , eprint=

Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning , author=. 2024 , eprint=

work page 2024
[25]

2023 , eprint=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. 2023 , eprint=

work page 2023
[26]

2025 , eprint=

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[27]

2025 , eprint=

Self-Rewarding Vision-Language Model via Reasoning Decomposition , author=. 2025 , eprint=

work page 2025
[28]

2025 , eprint=

Learning to Reason via Mixture-of-Thought for Logical Reasoning , author=. 2025 , eprint=

work page 2025
[29]

2025 , eprint=

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video , author=. 2025 , eprint=

work page 2025
[30]

2025 , eprint=

Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models , author=. 2025 , eprint=

work page 2025
[31]

2025 , eprint=

OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination , author=. 2025 , eprint=

work page 2025
[32]

2024 , eprint=

Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models , author=. 2024 , eprint=

work page 2024
[33]

Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models

Sakib, Fardin Ahsan and Zhu, Ziwei and Grace, Karen Trister and Yetisgen, Meliha and Uzuner, Ozlem. Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18...

work page doi:10.18653/v1/2025.acl-short.86 2025
[34]

2024 , month = aug, howpublished =

work page 2024
[35]

2025 , eprint=

Qwen3-Omni Technical Report , author=. 2025 , eprint=

work page 2025
[36]

2025 , eprint=

Qwen2.5-Omni Technical Report , author=. 2025 , eprint=

work page 2025
[37]

2023 , eprint=

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author=. 2023 , eprint=

work page 2023
[38]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs , author=. arXiv preprint arXiv:2406.07476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

2024 , eprint=

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models , author=. 2024 , eprint=

work page 2024
[40]

2025 , eprint=

video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models , author=. 2025 , eprint=

work page 2025
[41]

2024 , eprint=

SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning , author=. 2024 , eprint=

work page 2024
[42]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025
[43]

2025 , eprint=

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[44]

2017 , eprint=

Towards A Rigorous Science of Interpretable Machine Learning , author=. 2017 , eprint=

work page 2017