arxiv: 2604.14520 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

Ziyang Luo , Nian Liu , Junwei Han

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords omni-mllmschain of modalitydynamic orchestrationmultimodal fusionpositional biasalignment trapsdirect-decidereason-decide

0 comments

The pith

Chain of Modality dynamically switches among parallel, sequential, and interleaved input topologies to eliminate positional biases and alignment traps that degrade multimodal inference below unimodal baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current Omni-MLLMs show a performance paradox in which adding modalities often reduces accuracy compared with single-modality processing. The paper traces this to static fusion topologies that impose positional bias on sequential inputs and alignment traps on interleaved formats, distorting attention irrespective of task demands. Chain of Modality counters the rigidity by adaptively selecting input topologies and splitting execution into a fast Direct-Decide path for perception and a structured Reason-Decide path for auditing. The method operates training-free or with limited supervised fine-tuning yet delivers consistent gains across benchmarks.

Core claim

Omni-MLLMs suffer degraded joint inference because static fusion topologies create positional bias in sequential streams and alignment traps in interleaved formats. Chain of Modality resolves this by adaptively orchestrating among parallel, sequential, and interleaved pathways and by bifurcating cognitive execution into Direct-Decide and Reason-Decide routes, yielding robust generalization in either training-free or data-efficient SFT regimes.

What carries the argument

Chain of Modality (CoM), an agentic framework that selects input topologies on the fly and routes execution through task-aligned Direct-Decide and Reason-Decide pathways.

If this is right

Multimodal models can retain or exceed unimodal accuracy by choosing topology per task rather than defaulting to concatenation.
The two-pathway split allows simple perception tasks to bypass unnecessary reasoning steps while complex tasks receive explicit auditing.
Training-free deployment becomes viable, lowering the data and compute cost of adapting existing Omni-MLLMs.
Consistent cross-benchmark gains imply that structural bias, not modality count itself, is the dominant limiter on current fusion designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same orchestration logic could be tested on sensor-fusion pipelines outside language models, such as robotics or medical imaging, where input order and alignment are similarly critical.
If the topology switch proves stable, it offers a route to reduce reliance on ever-larger training corpora for multimodal alignment.
A natural next measurement is whether CoM alters the distribution of attention heads across modalities in a way that can be inspected directly.

Load-bearing premise

The observed performance paradox arises primarily from positional bias and alignment traps in static fusion, and dynamic orchestration can neutralize those distortions without introducing comparable new biases or computational overhead.

What would settle it

Apply CoM to a model that currently exhibits the paradox and check whether multimodal accuracy rises above the unimodal baseline on the same held-out benchmarks while latency and error patterns remain comparable or better.

Figures

Figures reproduced from arXiv: 2604.14520 by Junwei Han, Nian Liu, Ziyang Luo.

**Figure 2.** Figure 2: Empirical analysis of modality bias and functional rigidity in Qwen-Omni. (a,b) Layer-wise modality attention [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the CoM framework. Our model reconfigures a single Omni-MLLM backbone into three distinct [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: An example of the CoM agentic workflow: from task decomposition (Planner) and evidence-based auditing (Reasoner) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation on visual sampling density. We compare [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract sketches a dynamic Chain of Modality framework to fix multimodal fusion biases via adaptive switching and split decision paths, but supplies no methods or results to check if it works.

read the letter

The main thing to know is that this paper identifies a performance paradox in Omni-MLLMs where joint multimodal inference underperforms unimodal baselines, traces it to positional bias and alignment traps in static fusion, and proposes Chain of Modality to handle it through on-the-fly topology changes plus Direct-Decide and Reason-Decide routes. The adaptive orchestration idea and the bifurcated paths are the genuinely new pieces; they move beyond fixed concatenation toward something more task-responsive, and the training-free option adds a practical note if the switching logic holds up. It frames the structural problems in a clear way that could spark follow-up work on dynamic routing. The soft spots are the complete absence of implementation details, equations, benchmarks, ablations, or numbers in the abstract. Claims of robust generalization across settings rest on assertion alone, so there is no way to tell whether the switching neutralizes the stated biases or creates new ones like decision overhead or inconsistent routing. The full manuscript would need to show the actual mechanics and data for any of this to land. This is aimed at multimodal architecture researchers who care about fusion strategies and agentic control. Someone hunting for fresh conceptual handles on orchestration might find the framework worth reading even if the execution details are still missing. It deserves peer review so referees can examine the full experiments and code rather than desk-rejecting an untested idea.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies a performance paradox in Omni-MLLMs where unimodal baselines outperform joint multimodal inference, tracing it to positional bias in sequential inputs and alignment traps in interleaved formats arising from static fusion topologies. It proposes Chain of Modality (CoM), an agentic framework that dynamically orchestrates among parallel, sequential, and interleaved pathways and bifurcates execution into Direct-Decide (direct perception) and Reason-Decide (analytical auditing) paths, claiming robust generalization in either training-free or data-efficient SFT regimes.

Significance. If the dynamic orchestration mechanism were shown to neutralize the identified biases without introducing new overhead or inconsistencies, the work could meaningfully advance multimodal model design by shifting from passive concatenation to adaptive topology selection. The bifurcation into task-aligned pathways is a conceptually distinct idea, but the absence of any implementation details, derivations, or results prevents determining whether this constitutes a substantive advance.

major comments (2)

[Abstract] Abstract: the central claim that CoM 'achieves robust and consistent generalization across diverse benchmarks' is unsupported by any methods description, experimental protocol, benchmark list, ablation results, or quantitative metrics; this directly undermines the assertion that the framework resolves the performance paradox.
[Abstract] Abstract: no definition or operationalization is given for how the agentic framework decides among parallel/sequential/interleaved topologies or switches between Direct-Decide and Reason-Decide paths, leaving the core mechanism of 'dynamic orchestration' unspecified and untestable.

minor comments (1)

[Abstract] The abstract introduces several new terms (Chain of Modality, Direct-Decide path, Reason-Decide path) without immediate clarification of their scope or relation to existing agentic or routing techniques.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for recognizing the potential of dynamic orchestration to address the performance paradox in Omni-MLLMs. We address the major comments on the abstract point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that CoM 'achieves robust and consistent generalization across diverse benchmarks' is unsupported by any methods description, experimental protocol, benchmark list, ablation results, or quantitative metrics; this directly undermines the assertion that the framework resolves the performance paradox.

Authors: We agree that the abstract, being a concise summary, does not embed the supporting experimental details. The full manuscript presents the methods, protocols, benchmark evaluations, ablations, and quantitative results demonstrating generalization in both training-free and data-efficient SFT regimes. To address the concern directly, we will revise the abstract to include a high-level reference to these key findings or to qualify the generalization claim, ensuring it is explicitly tied to the evidence in the body of the paper. revision: yes
Referee: [Abstract] Abstract: no definition or operationalization is given for how the agentic framework decides among parallel/sequential/interleaved topologies or switches between Direct-Decide and Reason-Decide paths, leaving the core mechanism of 'dynamic orchestration' unspecified and untestable.

Authors: We acknowledge that the abstract does not provide an operational description of the decision logic for topology selection or the bifurcation between Direct-Decide and Reason-Decide pathways. The full manuscript details the agentic orchestration process, including the criteria and switching mechanisms. We will revise the abstract to incorporate a brief, concrete operationalization of these components so that the dynamic orchestration mechanism is more clearly specified and testable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's abstract and described framework introduce Chain of Modality (CoM) as a proposed agentic solution that dynamically orchestrates input topologies and bifurcates cognitive pathways to address claimed structural pathologies in static fusion. No equations, parameter fittings, self-citations, or derivations are visible that reduce the central claims (e.g., neutralization of positional bias or alignment traps) back to the inputs by construction. The approach is presented as an independent methodological contribution applicable in training-free or SFT regimes, with generalization claims resting on empirical benchmarks rather than tautological redefinitions or imported uniqueness theorems. The derivation chain remains self-contained without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Ledger based exclusively on abstract; full paper may detail additional parameters or assumptions.

axioms (1)

domain assumption Static fusion topologies in Omni-MLLMs cause positional bias and alignment traps that distort attention regardless of task semantics.
This is the core traced cause of the performance paradox stated in the abstract.

invented entities (3)

Chain of Modality (CoM) no independent evidence
purpose: Agentic framework transitioning from static fusion to dynamic orchestration of input topologies
Newly introduced system to address functional rigidity.
Direct-Decide path no independent evidence
purpose: Streamlined pathway for direct perception tasks
One branch of the bifurcated cognitive execution.
Reason-Decide path no independent evidence
purpose: Structured pathway for analytical auditing
One branch of the bifurcated cognitive execution.

pith-pipeline@v0.9.0 · 5480 in / 1397 out tokens · 44215 ms · 2026-05-10T12:01:59.796775+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 31 canonical work pages · 9 internal anchors

[1]

Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxi- ang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. 2025. Ming-Omni: A Unified Multimodal Model for Perception and Generation.arXiv preprint arXiv:2506.09344(2025)

work page arXiv 2025
[2]

Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Chao Sun, Rongzhou Zhang, Guanyu Zhou, Lijie Wen, and Xuming Hu. 2025. OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination.arXiv preprint arXiv:2509.00723(2025)

work page arXiv 2025
[3]

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, et al. 2026. OmniVideo- R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention.arXiv preprint arXiv:2602.05847(2026)

work page arXiv 2026
[4]

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. 2024. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476(2024)

work page internal anchor Pith review arXiv 2024
[5]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Henghui Du, Guangyao Li, Chang Zhou, Chunjie Zhang, Alan Zhao, and Di Hu
[7]

InProceedings of the Computer Vision and Pattern Recognition Conference

Crab: A unified audio-visual scene understanding model with explicit coop- eration. InProceedings of the Computer Vision and Pattern Recognition Conference. 18804–18814
[8]

Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. 2024. AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? arXiv preprint arXiv:2412.02611(2024)

work page arXiv 2024
[9]

Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, and Xiang Bai. 2026. ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding.arXiv preprint arXiv:2602.23306(2026)

work page arXiv 2026
[10]

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. 2024. Onellm: One framework to align all modalities with language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26584–26595

2024
[11]

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. 2025. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326(2025)

work page arXiv 2025
[12]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Hongyeob Kim, Inyoung Jung, Dayoon Suh, Youjia Zhang, Sangmin Lee, and Sungeun Hong. 2025. Question-Aware Gaussian Experts for Audio-Visual Ques- tion Answering. InProceedings of the Computer Vision and Pattern Recognition Conference. 13681–13690

2025
[14]

Yogesh Kulkarni and Pooyan Fazli. 2025. Avatar: Reinforcement learning to see, hear, and reason over video.arXiv preprint arXiv:2508.03100(2025)

work page arXiv 2025
[15]

Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio.arXiv preprint arXiv:2410.12787(2024)

work page arXiv 2024
[16]

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. 2022. Learning to answer questions in dynamic audio-visual scenarios. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19108–19118

2022
[17]

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, et al. 2026. OmniGAIA: Towards Native Omni-Modal AI Agents.arXiv preprint arXiv:2602.22897(2026)

work page arXiv 2026
[18]

Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. 2025. Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368(2025)

work page arXiv 2025
[19]

Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. 2024. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272 (2024)

work page arXiv 2024
[20]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Xiulong Liu, Zhikang Dong, and Peng Zhang. 2024. Tackling data bias in music- avqa: Crafting a balanced dataset for unbiased question-answering. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4478–4487

2024
[22]

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. 2025. Ola: Pushing the frontiers of omni-modal language model. arXiv preprint arXiv:2502.04328(2025)

work page arXiv 2025
[23]

Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. 2025. Visual Agentic Reinforcement Fine-Tuning.arXiv preprint arXiv:2505.14246(2025)

work page arXiv 2025
[24]

Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, and Tong Lu. 2025. AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs. arXiv preprint arXiv:2506.05328(2025)

work page arXiv 2025
[25]

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. 2024. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models.arXiv preprint arXiv:2410.18325(2024)

work page arXiv 2024
[26]

Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Huan Wang, et al. 2025. OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models.arXiv preprint arXiv:2511.14582(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. 2025. Ego- R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning.arXiv preprint arXiv:2506.13654(2025)

work page arXiv 2025
[28]

Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, and Pheng- Ann Heng. 2025. Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning.arXiv preprint arXiv:2505.04623(2025)

work page arXiv 2025
[29]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215(2025)

work page internal anchor Pith review arXiv 2025
[30]

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. 2025. Qwen3-omni technical report. arXiv preprint arXiv:2509.17765(2025)

work page internal anchor Pith review arXiv 2025
[31]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. 2022. Avqa: A dataset for audio-visual question answering on videos. InProceedings of the 30th ACM international conference on multimedia. 3480–3491

2022
[33]

Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. 2025. Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277(2025)

work page arXiv 2025
[34]

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuan- hang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, et al . 2025. OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM.arXiv preprint arXiv:2510.15870(2025)

work page arXiv 2025
[35]

Qilang Ye, Zitong Yu, Rui Shao, Yawen Cui, Xiangui Kang, Xin Liu, Philip Torr, and Xiaochun Cao. 2025. Cat+: Investigating and enhancing audio-visual under- standing in large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

2025
[36]

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao. 2024. Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios. InEuropean Conference on Computer Vision. Springer, 146– 164

2024
[37]

Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zitong Yu, and Yu Zhou
[38]

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confu- sion?arXiv preprint arXiv:2511.10059(2025)

work page arXiv 2025
[39]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. 2025. DeepEyes: Incentivizing" Thinking with Images" via Reinforcement Learning.arXiv preprint arXiv:2505.14362(2025)

work page internal anchor Pith review arXiv 2025
[40]

Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. 2025. Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256(2025)

work page arXiv 2025
[41]

Ziwei Zhou, Rui Wang, and Zuxuan Wu. 2025. Daily-omni: Towards audio- visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862(2025)

work page arXiv 2025