pith. machine review for the scientific record. sign in

arxiv: 2502.04326 · v3 · submitted 2025-02-06 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords WorldSenseomnimodal understandingmultimodal LLMsaudio-visual benchmarkvideo understandingreal-world scenariosmodel evaluationcross-modal synergy
0
0 comments X

The pith

The WorldSense benchmark shows that current multimodal models reach at most 65.1 percent accuracy on tasks requiring tight audio-visual synergy in real-world videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

WorldSense is presented as the first benchmark to evaluate multimodal large language models on omnimodal video understanding that combines visual, audio, and text inputs. It includes 1,662 audio-visual synchronized videos drawn from eight primary domains and 67 subcategories, together with 3,172 manually annotated multi-choice questions spread across 26 tasks. The design emphasizes strong coupling between audio and video so that successful answers depend on using information from both modalities together. When state-of-the-art models are tested, the highest accuracy obtained is 65.1 percent, revealing clear difficulties in handling coherent real-world contexts built from multiple sensory streams. The benchmark is offered as a platform to expose these limitations and to steer future model development toward better omnimodal perception.

Core claim

Existing multimodal large language models encounter substantial challenges when they must understand real-world scenarios that require the synergistic perception of audio and visual information at the same time, as shown by their performance ceiling of 65.1 percent on the WorldSense collection of 1,662 synchronized videos and 26 tasks.

What carries the argument

The WorldSense benchmark, a set of audio-visual synchronized videos and expert-annotated QA pairs that enforce collaborative use of omni-modality.

Load-bearing premise

The 26 selected tasks and the manually written question-answer pairs faithfully represent the demands of real-world omnimodal understanding without annotation or task-selection bias.

What would settle it

A new model that scores well above 65.1 percent specifically on the audio-video coupling tasks while performing at comparable levels on prior benchmarks would show that the reported challenges are overstated.

read the original abstract

We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i)collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii)diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii)high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (65.1% best accuracy). By analyzing the limitations of current models, we aim to provide valuable insight to guide development of real-world understanding. We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces WorldSense, the first benchmark for real-world omnimodal understanding in multimodal LLMs that integrates visual, audio, and text inputs. It features 1,662 synchronized audio-visual videos across 8 primary domains and 67 subcategories, 3,172 multi-choice QA pairs spanning 26 tasks, and manual annotations by 80 experts with multiple correction rounds. The authors evaluate state-of-the-art models on this benchmark and report that the best accuracy is 65.1%, concluding that existing models face significant challenges in synergistic omni-modal perception of real-world scenarios.

Significance. If the tasks genuinely require audio-video-text synergy without single-modality shortcuts or annotation artifacts, WorldSense could fill an important gap in existing video and multimodal benchmarks by providing a diverse, high-quality platform for evaluating coherent context construction from omni-modal inputs. The broad domain coverage and expert annotation process are strengths that could help guide model development toward better real-world understanding.

major comments (3)
  1. [Abstract] Abstract: The central claim that tasks are designed with 'strong coupling of audio and video' requiring 'synergistic perception of omni-modality' is not supported by any modality-ablation experiments, single-modality baselines, or analysis showing that performance drops substantially when one modality is removed.
  2. [Dataset and Annotation] The paper does not report inter-annotator agreement statistics or details on the multi-round correction process for the 3,172 QA pairs annotated by 80 experts, which is load-bearing for claims about high-quality annotations and the reliability of the reported 65.1% accuracy gap.
  3. [Experiments] Experiments section: No human performance baseline is provided on the 26 tasks, making it impossible to determine whether the 65.1% model accuracy reflects genuine omnimodal integration failures or simply the inherent difficulty of the chosen real-world scenarios.
minor comments (2)
  1. [Related Work] A more detailed comparison table with prior audio-visual and video QA benchmarks (e.g., in the related work section) would better highlight the claimed novelty in task coupling and diversity.
  2. [Task Design] Ensure all 26 task definitions include explicit examples of the required audio-video synergy to allow readers to assess shortcut risks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We agree that the suggested additions will strengthen the paper and plan to incorporate them in the revised version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that tasks are designed with 'strong coupling of audio and video' requiring 'synergistic perception of omni-modality' is not supported by any modality-ablation experiments, single-modality baselines, or analysis showing that performance drops substantially when one modality is removed.

    Authors: We acknowledge that while the dataset construction section describes the intentional design of tasks with strong audio-video coupling (e.g., questions that require integrating audio cues like speech or sound events with visual context), we did not include explicit modality ablation studies in the current manuscript. To directly address this, we will add ablation experiments in the revised Experiments section, evaluating models on video-only, audio-only, and full omnimodal inputs across a subset of tasks to quantify performance drops and demonstrate the synergistic requirement. revision: yes

  2. Referee: [Dataset and Annotation] The paper does not report inter-annotator agreement statistics or details on the multi-round correction process for the 3,172 QA pairs annotated by 80 experts, which is load-bearing for claims about high-quality annotations and the reliability of the reported 65.1% accuracy gap.

    Authors: We agree that reporting inter-annotator agreement and more granular details on the annotation process is important for substantiating the quality claims. In the revised manuscript, we will expand the Dataset and Annotation section to include inter-annotator agreement metrics (such as Fleiss' kappa across the 80 experts) and a step-by-step description of the multi-round correction workflow, including how conflicts were resolved and the criteria for final approval. revision: yes

  3. Referee: [Experiments] Experiments section: No human performance baseline is provided on the 26 tasks, making it impossible to determine whether the 65.1% model accuracy reflects genuine omnimodal integration failures or simply the inherent difficulty of the chosen real-world scenarios.

    Authors: We concur that a human performance baseline is essential for contextualizing the model results and distinguishing between task difficulty and model limitations. We will add human evaluation results on the 26 tasks (conducted with a separate group of annotators following the same protocol) to the Experiments section in the revised manuscript, allowing direct comparison with the best model accuracy of 65.1%. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark paper with self-contained dataset and evaluation results

full rationale

This is an empirical benchmark paper whose central contribution is the introduction of the WorldSense dataset (1,662 videos, 3,172 QA pairs, 26 tasks, manual annotations by 80 experts) and the reporting of model accuracies on it (best 65.1%). There are no derivations, equations, fitted parameters, predictions, uniqueness theorems, or ansatzes in the provided text. The performance numbers are direct empirical measurements on the newly created data rather than results that reduce by construction to inputs or self-citations. The paper is self-contained against external benchmarks because its claims rest on the existence and difficulty of the dataset itself, with no load-bearing steps that collapse to prior author work or definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or new physical entities are introduced; the work rests on the domain assumption that expert manual annotation produces reliable ground truth for omnimodal understanding.

axioms (1)
  • domain assumption Expert annotators with multiple correction rounds produce unbiased and comprehensive QA pairs that reflect real-world omnimodal requirements.
    Invoked in the description of high-quality annotations and the claim that the benchmark enables comprehensive evaluation.

pith-pipeline@v0.9.0 · 5548 in / 1264 out tokens · 27127 ms · 2026-05-17T05:48:39.250838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  2. Do Audio-Visual Large Language Models Really See and Hear?

    cs.AI 2026-04 unverdicted novelty 8.0

    AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.

  3. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  4. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  5. Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.

  6. Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.

  7. PolyReal: A Benchmark for Real-World Polymer Science Workflows

    cs.CV 2026-04 unverdicted novelty 7.0

    PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.

  8. Motion-o: Trajectory-Grounded Video Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.

  9. VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

    cs.CV 2025-12 unverdicted novelty 7.0

    VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.

  10. See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

    cs.CV 2025-12 unverdicted novelty 7.0

    AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.

  11. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 6.0

    Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.

  12. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

    cs.CL 2026-04 unverdicted novelty 6.0

    MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

  13. POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

  14. A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Coordinated multi-modal typographic attacks on MLLMs achieve 83.43% success rate versus 34.93% for single-modality attacks.

  15. DeepEyesV2: Toward Agentic Multimodal Model

    cs.CV 2025-11 unverdicted novelty 6.0

    DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

  16. SmolVLM: Redefining small and efficient multimodal models

    cs.AI 2025-04 unverdicted novelty 6.0

    SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

  17. OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

  18. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  19. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

  20. Qwen3.5-Omni Technical Report

    cs.CL 2026-04 unverdicted novelty 5.0

    Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 17 Pith papers · 32 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

  2. [2]

    Introducing the next generation of Claude

    Anthropic. Introducing the next generation of Claude. https://www.anthropic.com/news/ claude-3-family, 2024. Accessed: 2024-10-22

  3. [3]

    Hourvideo: 1-hour video-language understanding

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video-language understanding. arXiv preprint arXiv:2411.04998, 2024

  4. [4]

    Driving with llms: Fusing object-level vector modality for explainable autonomous driving

    Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14093–14100. IEEE, 2024

  5. [5]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  6. [6]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024

  7. [7]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

  8. [8]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024. 10

  9. [9]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  11. [11]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, D Li, AMH Tiong, J Zhao, W Wang, B Li, P Fung, and S Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023.arXiv preprint arXiv:2305.06500, 2, 2023

  12. [12]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  13. [13]

    Instructseq: Unifying vision tasks with instruction-conditioned multi-modal sequence generation.arXiv preprint arXiv:2311.18835, 2023

    Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian, Jifeng Dai, and Hongsheng Li. Instructseq: Unifying vision tasks with instruction-conditioned multi-modal sequence generation.arXiv preprint arXiv:2311.18835, 2023

  14. [14]

    MMBench-Video: A long-form multi-shot benchmark for holistic video under- standing

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.arXiv preprint arXiv:2406.14515, 2024

  15. [15]

    Vila2: Vila augmented vila.arXiv preprint arXiv:2407.17453, 2024

    Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, and Hongxu Yin. Vila2: Vila augmented vila.arXiv preprint arXiv:2407.17453, 2024

  16. [16]

    Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo, 2024

    Miquel Farré, Andi Marafioti, Lewis Tunstall, Leandro V on Werra, and Thomas Wolf. Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo, 2024

  17. [17]

    Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024

  18. [18]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024

  19. [19]

    Vita-1.5: Towards gpt-4o level real-time vision and speech interaction

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction. arXiv preprint arXiv:2501.01957, 2025

  20. [20]

    Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos.arXiv preprint arXiv:2411.19772, 2024

    Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, and Feng Zheng. Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos.arXiv preprint arXiv:2411.19772, 2024

  21. [21]

    Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611, 2024

    Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611, 2024

  22. [22]

    The llama 3 herd of models, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, and et al. The llama 3 herd of models, 2024

  23. [23]

    Onellm: One framework to align all modalities with language

    Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584–26595, 2024

  24. [24]

    Multi-modal instruction tuned llms with fine-grained visual perception

    Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie. Multi-modal instruction tuned llms with fine-grained visual perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13980–13990, 2024

  25. [25]

    Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos.arXiv preprint arXiv:2406.08407, 2024

    Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, et al. Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos.arXiv preprint arXiv:2406.08407, 2024

  26. [26]

    DeepEyesV2: Toward Agentic Multimodal Model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025. 11

  27. [27]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  28. [28]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024

  29. [29]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  30. [30]

    Seed-bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024

  31. [31]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

  32. [32]

    Learning to answer questions in dynamic audio-visual scenarios

    Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. Learning to answer questions in dynamic audio-visual scenarios. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19108–19118, 2022

  33. [33]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  34. [34]

    Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024

    Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024

  35. [35]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023

  36. [36]

    Streamingbench: Assessing the gap for mllms to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024

    Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024

  37. [37]

    Llava-next: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024

  38. [38]

    Visual instruction tuning.Advances in neural information processing systems, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

  39. [39]

    Revisiting mllms: An in-depth analysis of image classification abilities.arXiv preprint arXiv:2412.16418, 2024

    Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang, and Jingdong Wang. Revisiting mllms: An in-depth analysis of image classification abilities.arXiv preprint arXiv:2412.16418, 2024

  40. [40]

    Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025

  41. [41]

    TempCompass: Do Video LLMs Really Understand Videos?

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024

  42. [42]

    NVILA: Efficient Frontier Visual Language Models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models.arXiv preprint arXiv:2412.04468, 2024

  43. [43]

    MMDU: A multi-turn multi-image dialog un- derstanding benchmark and instruction-tuning dataset for lvlms

    Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, et al. Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms.arXiv preprint arXiv:2406.11833, 2024

  44. [44]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

  45. [45]

    Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26439–26455, 2024. 12

  46. [46]

    Visual perception by large language model’s weights

    Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, and Xiaoyan Sun. Visual perception by large language model’s weights. arXiv preprint arXiv:2405.20339, 2024

  47. [47]

    Ee-mllm: A data-efficient and compute-efficient multimodal large language model.arXiv preprint arXiv:2408.11795, 2024

    Feipeng Ma, Yizhou Zhou, Zheyu Zhang, Shilin Yan, Hebei Li, Zilong He, Siying Wu, Fengyun Rao, Yueyi Zhang, and Xiaoyan Sun. Ee-mllm: A data-efficient and compute-efficient multimodal large language model.arXiv preprint arXiv:2408.11795, 2024

  48. [48]

    Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

  49. [49]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  50. [50]

    Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving

    Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. InEuropean Conference on Computer Vision, pages 292–308. Springer, 2025

  51. [51]

    Video-bench: A comprehensive benchmark and toolkit for evaluat- ing video-based large language models

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models.arXiv preprint arXiv:2311.16103, 2023

  52. [52]

    Gpt-4v(ision) system card, 2023

    OpenAI. Gpt-4v(ision) system card, 2023

  53. [53]

    GPT-4 Technical Report

    R OpenAI. Gpt-4 technical report. arxiv 2303.08774.View in Article, 2(5), 2023

  54. [54]

    X-instructblip: A framework for aligning x-modal instruction- aware representations to llms and emergent cross-modal reasoning.arXiv preprint arXiv:2311.18799, 2023

    Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. X-instructblip: A framework for aligning x-modal instruction- aware representations to llms and emergent cross-modal reasoning.arXiv preprint arXiv:2311.18799, 2023

  55. [55]

    Detgpt: Detect what you need via reasoning.arXiv preprint arXiv:2305.14167, 2023

    Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, et al. Detgpt: Detect what you need via reasoning.arXiv preprint arXiv:2305.14167, 2023

  56. [56]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In European Conference on Computer Vision, pages 256–274. Springer, 2025

  57. [57]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

  58. [58]

    video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024

    Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024

  59. [59]

    video-salmonn 2: Captioning-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

    Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-salmonn 2: Captioning-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

  60. [60]

    Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

  61. [61]

    Mtvqa: Benchmarking multilingual text-centric visual question answering.arXiv preprint arXiv:2405.11985, 2024

    Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, et al. Mtvqa: Benchmarking multilingual text-centric visual question answering.arXiv preprint arXiv:2405.11985, 2024

  62. [62]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  63. [63]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 13

  64. [64]

    Reka core, flash, and edge: A series of powerful multimodal language models.arXiv preprint arXiv:2404.12387, 2024

    Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, et al. Reka core, flash, and edge: A series of powerful multimodal language models.arXiv preprint arXiv:2404.12387, 2024

  65. [65]

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.arXiv preprint arXiv:2406.16860, 2024

  66. [67]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  67. [68]

    Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

  68. [69]

    Visionllm: Large language model is also an open-ended decoder for vision-centric tasks.Advances in Neural Information Processing Systems, 36, 2024

    Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks.Advances in Neural Information Processing Systems, 36, 2024

  69. [70]

    Longllava: Scaling multi- modal llms to 1000 images efficiently via a hybrid architecture.arXiv preprint arXiv:2409.02889, 2024

    Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi- modal llms to 1000 images efficiently via a hybrid architecture.arXiv preprint arXiv:2409.02889, 2024

  70. [71]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  71. [72]

    Gsva: Generalized segmentation via multimodal large language models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

  72. [73]

    Video question answering via gradually refined attention over appearance and motion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. InProceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017

  73. [74]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

  74. [75]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  75. [76]

    Slowfast-llava: A strong training-free base- line for video large language models.arXiv preprint arXiv:2407.15841, 2024

    Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free baseline for video large language models.arXiv preprint arXiv:2407.15841, 2024

  76. [77]

    Crosslmm: Decoupling long video sequences from lmms via dual cross-attention mechanisms

    Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, and Ray Zhang. Crosslmm: Decoupling long video sequences from lmms via dual cross-attention mechanisms. arXiv preprint arXiv:2505.17020, 2025

  77. [78]

    Avqa: A dataset for audio-visual question answering on videos

    Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. Avqa: A dataset for audio-visual question answering on videos. InProceedings of the 30th ACM international conference on multimedia, pages 3480–3491, 2022

  78. [79]

    mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840, 2024

  79. [80]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127–9134, 2019

  80. [81]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 14

Showing first 80 references.