AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Cheng Liang; Guangyao Li; Hao Fei; Henghui Ding; Shaoxuan Xu; Weijun Wang; Wenjie Du; Wenming Tu; Yaoting Wang; Yuanchao Li

arxiv: 2606.07643 · v1 · pith:T5JGLSQFnew · submitted 2026-06-01 · 💻 cs.CV · cs.AI· cs.SD· eess.AS

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Yaoting Wang , Ziyi Zhang , Wenming Tu , Shaoxuan Xu , Wenjie Du , Cheng Liang , Weijun Wang , Yuanchao Li

show 5 more authors

Guangyao Li Hao Fei Yuanchun Li Henghui Ding Yunxin Liu

This is my paper

Pith reviewed 2026-06-28 14:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.SDeess.AS

keywords audio-visual intelligenceOmni-MLLMsbenchmarkcross-modal tasksperceptionunderstandingreasoninggeneralization

0 comments

The pith

Omni-MLLMs exhibit substantial limitations in audio-visual intelligence when tested on a new three-stage cross-modal benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces AVI-Bench to evaluate Omni-MLLMs on their ability to integrate audio and visual information across three progressive stages of perception, understanding, and reasoning. The benchmark relies on tasks that demand joint interpretation of both modalities rather than separate processing. An added component, AVI-Bench-PriSe, uses unfamiliar low-semantic stimuli to probe whether models can generalize beyond patterns seen in training data. Experiments on multiple open-source and closed-source models show consistent shortcomings in handling these integrated tasks. The authors derive a four-level taxonomy of audio-visual intelligence from the observed failure modes to organize future evaluation.

Core claim

The paper claims that current Omni-MLLMs lack robust audio-visual intelligence, as shown by their performance on AVI-Bench which measures joint audio-visual interpretation through staged cross-modal tasks and on AVI-Bench-PriSe which tests primitive sensation with unfamiliar stimuli, leading directly to the definition of a four-level AVI taxonomy that classifies model capabilities and gaps.

What carries the argument

AVI-Bench, a cognitively inspired benchmark that structures evaluation into three stages of cross-modal tasks plus an extension for primitive unfamiliar stimuli.

If this is right

Models must improve joint audio-visual processing at the reasoning stage rather than relying on unimodal strengths.
Performance drops sharply on unfamiliar stimuli, indicating that current training leaves models vulnerable to distribution shifts.
The four-level taxonomy supplies a concrete scale for tracking progress toward more integrated audio-visual capabilities.
Fine-grained stage-wise results allow targeted diagnosis of whether failures occur at perception, understanding, or reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be adapted to measure whether improvements in one stage transfer to others without retraining the entire model.
Using low-semantic stimuli to test primitive sensation might help separate memorized patterns from genuine cross-modal binding in other multimodal settings.
If future models close the gaps identified here, applications that depend on simultaneous sound and image understanding, such as scene analysis in video, would become more reliable.
The taxonomy offers a possible shared vocabulary for comparing audio-visual progress across different model families without relying solely on task accuracy numbers.

Load-bearing premise

The selected cross-modal tasks and unfamiliar stimuli accurately stand in for human-like audio-visual intelligence and generalization outside training data.

What would settle it

If a range of Omni-MLLMs achieve high accuracy across all three stages and the primitive-stimulus extension yet continue to fail on everyday audio-visual tasks that humans handle easily, or if humans score low on the same benchmark items, the claim that the benchmark diagnoses meaningful limitations would be undermined.

Figures

Figures reproduced from arXiv: 2606.07643 by Cheng Liang, Guangyao Li, Hao Fei, Henghui Ding, Shaoxuan Xu, Weijun Wang, Wenjie Du, Wenming Tu, Yaoting Wang, Yuanchao Li, Yuanchun Li, Yunxin Liu, Ziyi Zhang.

**Figure 1.** Figure 1: The AVI taxonomy and what it reveals. AVI-Bench arranges audio-visual intelligence into four nested levels: per-task performance (Task Adaptive, Section 5.1), cross-modal balance (Modal Adaptive, Section 5.2), cognitive-stage composition (Stage Adaptive, Section 5.3), and unfamiliar-domain adaptation (Domain Adaptive, Section 5.4). Each level isolates a distinct failure mode hidden by aggregate evluation. … view at source ↗

**Figure 2.** Figure 2: Data samples spanning the three cognitively inspired stages of AVI-Bench: perception, understanding, and reasoning. Furthermore, we introduce AVI-Bench-PriSe, an extension aim at evaluating whether Omni-MLLMs exhibit human-like audio-visual capabilities by adapting to unfamiliar and low-semantic data [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmap showing the rankings of Omni-MLLMs across different stages. Darker red indicates higher rankings and stronger performance. The Gemini series consistently demonstrates strong performance throughout AVI-Bench. Among open-source models, the Qwen-2.5-Omni series also exhibits notable AVI. 5.2. Level-2: Modality-Adaptive Intelligence As mentioned in Section 4.2, our Observation 3 reveals a pronounced d… view at source ↗

**Figure 4.** Figure 4: Task scores per model across different evaluation stages. Zoom-in for better visualization. Shapiro-Wilk test statistic and corresponding p-values for the performance scores of the perception, understanding, reasoning, primitive sensation stages, and the average performance. The results indicate that, with the exception of Understanding (p = 0.041, marginally below the 0.05 threshold but well above the con… view at source ↗

**Figure 5.** Figure 5: Task ranks per model across different evaluation stages. Zoom-in for better visualization. ranging from 0.725 to 0.973) with highly significant p-values (all near zero), indicating that performance across individual stages is closely aligned with the overall model performance. These findings demonstrate the consistency and relevance of the model’s capabilities across different stages. E.3. Stability and Re… view at source ↗

**Figure 6.** Figure 6: Visualized comparison of absolute and relative modality imbalance metrics among example data points. • For Model C, the difference (0.2) represents ∼ 22% for Audio (0.9) and ∼ 29% for Vision (0.7), which is relatively moderate. • For Model D, the difference (0.2) represents ∼ 66% of Audio (0.3) and ∼ 200% of Vision (0.1), which indicates a substantially greater imbalance. This shows that using ∆m provides … view at source ↗

**Figure 7.** Figure 7: Visualized comparison of using harmonic mean and ∆-based penalty to calculate Level-4 score. Crucially, ct is determined by the structure of task t itself, not by the cohort of evaluated models, so adding new models in future work does not retroactively change reported scores. G.2.3. LEVEL-4: DOMAIN-ADAPTIVE Definition: Assesses the model’s ability to adapt its familiar-domain capabilities to unfamiliar-do… view at source ↗

**Figure 8.** Figure 8: AVI-Bench construction pipeline. The media data collected online is assigned as familiar domain data with high semantics, while the manually constructed media data is considered unfamiliar domain data with low semantics. Both types will undergo manual verification, and for the online collected data, re-annotation and organization will be required as necessary. used training domain, we construct the dataset… view at source ↗

read the original abstract

Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVI-Bench adds a three-stage cross-modal evaluation plus a low-semantic PriSe extension, but the abstract gives no task construction details or human baselines so the claim of measuring human-like AVI stays unverified.

read the letter

The new piece here is AVI-Bench itself: a three-stage setup that splits audio-visual tasks into perception, understanding, and reasoning, plus the PriSe add-on that uses unfamiliar low-semantic stimuli to test generalization. That structure is not just another video QA set; it tries to give a diagnostic taxonomy at the end.

What works is the intent to move past single-modality tests and look at joint interpretation failures across open and closed models. If the full paper shows concrete task examples and consistent scoring, this could be a practical tool for people iterating on Omni-MLLMs.

The soft spot is exactly the one the stress-test flags. The abstract says the tasks are cognitively inspired and that experiments show substantial limitations, yet it supplies no human performance numbers, no ablation on stimulus familiarity, and no description of how the cross-modal items were built or validated. Without those, it is impossible to tell whether model failures reflect missing AVI or just tasks that are hard for other reasons. The four-level taxonomy therefore rests on an assumption that has not been checked in the provided text.

This is the kind of paper that belongs in a reading group focused on evaluation methods. A serious editor should send it to referees so the methods section can be examined; the idea is worth testing even if the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The paper claims to introduce AVI-Bench, a cognitively inspired benchmark evaluating Omni-MLLMs on audio-visual intelligence through three stages of perception, understanding, and reasoning using cross-modal tasks. It further proposes AVI-Bench-PriSe to probe primitive sensation with unfamiliar stimuli, reports extensive experiments showing substantial limitations in current models, and derives a four-level AVI taxonomy to guide future development.

Significance. If the benchmark's tasks are confirmed as appropriate proxies for human-like audio-visual intelligence and generalization, the work offers a systematic framework for diagnosing model shortcomings and a taxonomy that could inform the design of more capable Omni-MLLMs. The inclusion of both open- and closed-source models strengthens the empirical scope.

major comments (2)

[Abstract] Abstract: The assertion that 'extensive experiments ... reveal substantial limitations' is not supported by details on task construction, scoring metrics, model selection criteria, or statistical controls, which are necessary to verify the central claim of limitations in human-like AVI.
[Benchmark design] Benchmark design: The validity of the three-stage tasks and AVI-Bench-PriSe's unfamiliar low-semantic stimuli as proxies for human-like AVI and out-of-distribution generalization is not substantiated by human baselines, cognitive validation studies, or ablations demonstrating that model failures reflect capability gaps rather than benchmark artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of clarity and validation. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'extensive experiments ... reveal substantial limitations' is not supported by details on task construction, scoring metrics, model selection criteria, or statistical controls, which are necessary to verify the central claim of limitations in human-like AVI.

Authors: The abstract is a concise summary; the full manuscript details task construction in Section 3, scoring metrics in Section 4.2, model selection criteria in Section 5.1, and statistical controls in Section 5.3 with results across models. These sections directly support the reported limitations. We will revise the abstract to include a brief reference to the evaluation framework for improved standalone readability. revision: partial
Referee: [Benchmark design] Benchmark design: The validity of the three-stage tasks and AVI-Bench-PriSe's unfamiliar low-semantic stimuli as proxies for human-like AVI and out-of-distribution generalization is not substantiated by human baselines, cognitive validation studies, or ablations demonstrating that model failures reflect capability gaps rather than benchmark artifacts.

Authors: The three-stage structure and AVI-Bench-PriSe draw directly from cognitive models of audio-visual processing, as described in Sections 2 and 3, with low-semantic stimuli chosen to probe generalization beyond training distributions. Consistent failure patterns across open- and closed-source models indicate capability gaps. We acknowledge the value of human baselines and will add further ablations (e.g., stimulus variation tests) in revision to strengthen artifact exclusion, while noting that full human validation studies fall outside the current model-focused scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity; external benchmark framework

full rationale

The paper introduces AVI-Bench as an independent evaluation framework with three stages (perception/understanding/reasoning) and the PriSe extension using unfamiliar stimuli; no equations, fitted parameters, self-referential predictions, or derivation chains are present. Claims of model limitations rest on experimental application of this benchmark rather than any reduction to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are quoted or evident. The structure is a standard benchmark paper whose central evaluation is self-contained against external model testing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark construction paper; no free parameters, mathematical axioms, or invented entities are introduced or fitted.

pith-pipeline@v0.9.1-grok · 5779 in / 1054 out tokens · 30484 ms · 2026-06-28T14:34:53.733998+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 2 canonical work pages

[1]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024
[2]

arXiv preprint arXiv:2410.21276 , year=

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2403.05530 , year=

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

Pith/arXiv arXiv
[4]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

2023
[5]

arXiv preprint arXiv:2307.09288 , year=

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[7]

2023 , eprint=

Qwen Technical Report , author=. 2023 , eprint=

2023
[8]

5 technical report , author=

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2409.12191 , year=

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2304.10592 , year=

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

Pith/arXiv arXiv
[11]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=
[12]

2023 , url =

GPT-4V(ision) System Card , author =. 2023 , url =

2023
[13]

Advances in Neural Information Processing Systems , volume=

Pengi: An audio language model for audio tasks , author=. Advances in Neural Information Processing Systems , volume=
[14]

arXiv preprint arXiv:2504.18425 , year=

Kimi-Audio Technical Report , author=. arXiv preprint arXiv:2504.18425 , year=

Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2407.10759 , year=

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2305.11000 , year=

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities , author=. arXiv preprint arXiv:2305.11000 , year=

arXiv
[17]

arXiv preprint arXiv:2305.16355 , year=

Pandagpt: One model to instruction-follow them all , author=. arXiv preprint arXiv:2305.16355 , year=

Pith/arXiv arXiv
[18]

Advances in Neural Information Processing Systems , volume=

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset , author=. Advances in Neural Information Processing Systems , volume=
[19]

Forty-first International Conference on Machine Learning , year=

Next-gpt: Any-to-any multimodal llm , author=. Forty-first International Conference on Machine Learning , year=
[20]

arXiv preprint arXiv:2402.12226 , year=

Anygpt: Unified multimodal llm with discrete sequence modeling , author=. arXiv preprint arXiv:2402.12226 , year=

arXiv
[21]

arXiv preprint arXiv:2306.02858 , year=

Video-llama: An instruction-tuned audio-visual language model for video understanding , author=. arXiv preprint arXiv:2306.02858 , year=

Pith/arXiv arXiv
[22]

2025 , eprint=

DeepSeek-V3 Technical Report , author=. 2025 , eprint=

2025
[23]

arXiv preprint arXiv:2410.18325 , year=

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models , author=. arXiv preprint arXiv:2410.18325 , year=

arXiv
[24]

2025 , eprint=

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs , author=. 2025 , eprint=

2025
[25]

2023 , eprint=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

2023
[26]

2024 , eprint=

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? , author=. 2024 , eprint=

2024
[27]

arXiv preprint arXiv:2501.15111 , year=

HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding , author=. arXiv preprint arXiv:2501.15111 , year=

arXiv
[28]

arXiv preprint arXiv:2410.08565 , year=

baichuan-omni: To Understand the World with Omni-modality , author=. arXiv preprint arXiv:2410.08565 , year=

arXiv
[29]

5-omni technical report , author=

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

Pith/arXiv arXiv
[30]

2024 , eprint=

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models , author=. 2024 , eprint=

2024
[31]

2025 , eprint=

OmniBench: Towards The Future of Universal Omni-Language Models , author=. 2025 , eprint=

2025
[32]

arXiv preprint arXiv:2410.12219 , year=

Omnixr: Evaluating omni-modality language models on reasoning across modalities , author=. arXiv preprint arXiv:2410.12219 , year=

arXiv
[33]

2022 , isbn =

Yang, Pinci and Wang, Xin and Duan, Xuguang and Chen, Hong and Hou, Runze and Jin, Cong and Zhu, Wenwu , title =. 2022 , isbn =. doi:10.1145/3503161.3548291 , booktitle =

work page doi:10.1145/3503161.3548291 2022
[34]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Learning to Answer Questions in Dynamic Audio-Visual Scenarios , author =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =
[35]

arXiv preprint arXiv:2405.03272 , year=

Worldqa: Multimodal world knowledge in videos through long-chain reasoning , author=. arXiv preprint arXiv:2405.03272 , year=

arXiv
[36]

arXiv preprint arXiv:2503.12605 , year=

Multimodal chain-of-thought reasoning: A comprehensive survey , author=. arXiv preprint arXiv:2503.12605 , year=

Pith/arXiv arXiv
[37]

Journal of Artificial General Intelligence , volume=

Artificial general intelligence: concept, state of the art, and future prospects , author=. Journal of Artificial General Intelligence , volume=. 2014 , publisher=

2014
[38]

Nature Communications , volume=

Towards artificial general intelligence via a multimodal foundation model , author=. Nature Communications , volume=. 2022 , publisher=

2022
[39]

2023 , publisher=

Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. 2023 , publisher=

2023
[40]

2007 , publisher=

Artificial general intelligence , author=. 2007 , publisher=

2007
[41]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=
[42]

Forty-first International Conference on Machine Learning , year=

Chatbot arena: An open platform for evaluating llms by human preference , author=. Forty-first International Conference on Machine Learning , year=
[43]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv
[45]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[46]

Proceedings of the European conference on computer vision (ECCV) , pages=

Audio-visual event localization in unconstrained videos , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
[47]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Localizing visual sounds the hard way , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[48]

European Conference on Computer Vision , pages=

Audio--visual segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[49]

European Conference on Computer Vision , pages=

Ref-avs: Refer and segment objects in audio-visual scenes , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[50]

Nature reviews neuroscience , volume=

Multisensory integration: current issues from the perspective of the single neuron , author=. Nature reviews neuroscience , volume=. 2008 , publisher=

2008
[51]

The Auditory Cortex - Neuroscience - NCBI Bookshelf , author=
[52]

The Visual Cortex - Neuroscience - NCBI Bookshelf , author=
[53]

Annual review of vision science , volume=

The organization and operation of inferior temporal cortex , author=. Annual review of vision science , volume=. 2018 , publisher=

2018
[54]

Current Biology , volume=

Multimodal spatial representations engaged in human parietal cortex during both saccadic and manual spatial orienting , author=. Current Biology , volume=. 2003 , publisher=

2003
[55]

Neuropsychopharmacology , volume=

The role of prefrontal cortex in cognitive control and executive function , author=. Neuropsychopharmacology , volume=. 2022 , publisher=

2022
[56]

IEEE Open Journal of Signal Processing , year=

AVCaps: An Audio-visual Dataset with Modality-specific Captions , author=. IEEE Open Journal of Signal Processing , year=
[57]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Valor: Vision-audio-language omni-perception pretraining model and dataset , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[58]

arXiv preprint arXiv:2502.04328 , year=

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment , author=. arXiv preprint arXiv:2502.04328 , year=

arXiv
[59]

arXiv preprint arXiv:2501.15368 , year=

Baichuan-Omni-1.5 Technical Report , author=. arXiv preprint arXiv:2501.15368 , year=

arXiv
[60]

arXiv preprint arXiv:2503.05379 , year=

R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning , author=. arXiv preprint arXiv:2503.05379 , year=

arXiv
[61]

arXiv preprint arXiv:2503.01743 , year=

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=

Pith/arXiv arXiv
[62]

arXiv preprint arXiv:2505.04921 , year=

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models , author=. arXiv preprint arXiv:2505.04921 , year=

arXiv
[63]

arXiv preprint arXiv:2505.04620 , year=

On Path to Multimodal Generalist: General-Level and General-Bench , author=. arXiv preprint arXiv:2505.04620 , year=

arXiv
[64]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[65]

arXiv preprint arXiv:2307.16125 , year=

Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. arXiv preprint arXiv:2307.16125 , year=

Pith/arXiv arXiv
[66]

arXiv preprint arXiv:2410.19168 , year=

Mmau: A massive multi-task audio understanding and reasoning benchmark , author=. arXiv preprint arXiv:2410.19168 , year=

Pith/arXiv arXiv
[67]

European Conference on Computer Vision , pages=

Audio-visual mismatch-aware video retrieval via association and adjustment , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[68]

European Conference on Computer Vision , pages=

Localizing visual sounds the easy way , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[69]

IEEE Access , volume=

A survey of audio classification using deep learning , author=. IEEE Access , volume=. 2023 , publisher=

2023
[70]

Advances in neural information processing systems , volume=

Unsupervised feature learning for audio classification using convolutional deep belief networks , author=. Advances in neural information processing systems , volume=
[71]

International journal of Remote sensing , volume=

A survey of image classification methods and techniques for improving classification performance , author=. International journal of Remote sensing , volume=. 2007 , publisher=

2007
[72]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[73]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

What does a platypus look like? generating customized prompts for zero-shot image classification , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[74]

ACM Transactions on Multimedia Computing, Communications and Applications , volume=

Variational autoencoder with cca for audio--visual cross-modal retrieval , author=. ACM Transactions on Multimedia Computing, Communications and Applications , volume=. 2023 , publisher=

2023
[75]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Pano-avqa: Grounded audio-visual question answering on 360deg videos , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[76]

2024 , isbn =

Wang, Yaoting and Liu, Weisong and Li, Guangyao and Ding, Jian and Hu, Di and Li, Xi , title =. 2024 , isbn =. doi:10.1609/aaai.v38i6.28378 , booktitle =

work page doi:10.1609/aaai.v38i6.28378 2024
[77]

European Conference on Computer Vision , pages=

Can Textual Semantics Mitigate Sounding Object Segmentation Preference? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[78]

arXiv preprint arXiv:2406.12793 , year=

Chatglm: A family of large language models from glm-130b to glm-4 all tools , author=. arXiv preprint arXiv:2406.12793 , year=

Pith/arXiv arXiv
[79]

arXiv preprint arXiv:2502.00358 , year=

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects? , author=. arXiv preprint arXiv:2502.00358 , year=

arXiv
[80]

arXiv preprint arXiv:2407.00634 , year=

Tarsier: Recipes for training and evaluating large video description models , author=. arXiv preprint arXiv:2407.00634 , year=

arXiv

Showing first 80 references.

[1] [1]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024

[2] [2]

arXiv preprint arXiv:2410.21276 , year=

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2403.05530 , year=

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

Pith/arXiv arXiv

[4] [4]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

2023

[5] [5]

arXiv preprint arXiv:2307.09288 , year=

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[7] [7]

2023 , eprint=

Qwen Technical Report , author=. 2023 , eprint=

2023

[8] [8]

5 technical report , author=

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2409.12191 , year=

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2304.10592 , year=

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

Pith/arXiv arXiv

[11] [11]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

[12] [12]

2023 , url =

GPT-4V(ision) System Card , author =. 2023 , url =

2023

[13] [13]

Advances in Neural Information Processing Systems , volume=

Pengi: An audio language model for audio tasks , author=. Advances in Neural Information Processing Systems , volume=

[14] [14]

arXiv preprint arXiv:2504.18425 , year=

Kimi-Audio Technical Report , author=. arXiv preprint arXiv:2504.18425 , year=

Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2407.10759 , year=

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2305.11000 , year=

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities , author=. arXiv preprint arXiv:2305.11000 , year=

arXiv

[17] [17]

arXiv preprint arXiv:2305.16355 , year=

Pandagpt: One model to instruction-follow them all , author=. arXiv preprint arXiv:2305.16355 , year=

Pith/arXiv arXiv

[18] [18]

Advances in Neural Information Processing Systems , volume=

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset , author=. Advances in Neural Information Processing Systems , volume=

[19] [19]

Forty-first International Conference on Machine Learning , year=

Next-gpt: Any-to-any multimodal llm , author=. Forty-first International Conference on Machine Learning , year=

[20] [20]

arXiv preprint arXiv:2402.12226 , year=

Anygpt: Unified multimodal llm with discrete sequence modeling , author=. arXiv preprint arXiv:2402.12226 , year=

arXiv

[21] [21]

arXiv preprint arXiv:2306.02858 , year=

Video-llama: An instruction-tuned audio-visual language model for video understanding , author=. arXiv preprint arXiv:2306.02858 , year=

Pith/arXiv arXiv

[22] [22]

2025 , eprint=

DeepSeek-V3 Technical Report , author=. 2025 , eprint=

2025

[23] [23]

arXiv preprint arXiv:2410.18325 , year=

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models , author=. arXiv preprint arXiv:2410.18325 , year=

arXiv

[24] [24]

2025 , eprint=

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs , author=. 2025 , eprint=

2025

[25] [25]

2023 , eprint=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

2023

[26] [26]

2024 , eprint=

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? , author=. 2024 , eprint=

2024

[27] [27]

arXiv preprint arXiv:2501.15111 , year=

HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding , author=. arXiv preprint arXiv:2501.15111 , year=

arXiv

[28] [28]

arXiv preprint arXiv:2410.08565 , year=

baichuan-omni: To Understand the World with Omni-modality , author=. arXiv preprint arXiv:2410.08565 , year=

arXiv

[29] [29]

5-omni technical report , author=

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

Pith/arXiv arXiv

[30] [30]

2024 , eprint=

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models , author=. 2024 , eprint=

2024

[31] [31]

2025 , eprint=

OmniBench: Towards The Future of Universal Omni-Language Models , author=. 2025 , eprint=

2025

[32] [32]

arXiv preprint arXiv:2410.12219 , year=

Omnixr: Evaluating omni-modality language models on reasoning across modalities , author=. arXiv preprint arXiv:2410.12219 , year=

arXiv

[33] [33]

2022 , isbn =

Yang, Pinci and Wang, Xin and Duan, Xuguang and Chen, Hong and Hou, Runze and Jin, Cong and Zhu, Wenwu , title =. 2022 , isbn =. doi:10.1145/3503161.3548291 , booktitle =

work page doi:10.1145/3503161.3548291 2022

[34] [34]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Learning to Answer Questions in Dynamic Audio-Visual Scenarios , author =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

[35] [35]

arXiv preprint arXiv:2405.03272 , year=

Worldqa: Multimodal world knowledge in videos through long-chain reasoning , author=. arXiv preprint arXiv:2405.03272 , year=

arXiv

[36] [36]

arXiv preprint arXiv:2503.12605 , year=

Multimodal chain-of-thought reasoning: A comprehensive survey , author=. arXiv preprint arXiv:2503.12605 , year=

Pith/arXiv arXiv

[37] [37]

Journal of Artificial General Intelligence , volume=

Artificial general intelligence: concept, state of the art, and future prospects , author=. Journal of Artificial General Intelligence , volume=. 2014 , publisher=

2014

[38] [38]

Nature Communications , volume=

Towards artificial general intelligence via a multimodal foundation model , author=. Nature Communications , volume=. 2022 , publisher=

2022

[39] [39]

2023 , publisher=

Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. 2023 , publisher=

2023

[40] [40]

2007 , publisher=

Artificial general intelligence , author=. 2007 , publisher=

2007

[41] [41]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

[42] [42]

Forty-first International Conference on Machine Learning , year=

Chatbot arena: An open platform for evaluating llms by human preference , author=. Forty-first International Conference on Machine Learning , year=

[43] [43]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[44] [44]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[45] [45]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[46] [46]

Proceedings of the European conference on computer vision (ECCV) , pages=

Audio-visual event localization in unconstrained videos , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

[47] [47]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Localizing visual sounds the hard way , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[48] [48]

European Conference on Computer Vision , pages=

Audio--visual segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[49] [49]

European Conference on Computer Vision , pages=

Ref-avs: Refer and segment objects in audio-visual scenes , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[50] [50]

Nature reviews neuroscience , volume=

Multisensory integration: current issues from the perspective of the single neuron , author=. Nature reviews neuroscience , volume=. 2008 , publisher=

2008

[51] [51]

The Auditory Cortex - Neuroscience - NCBI Bookshelf , author=

[52] [52]

The Visual Cortex - Neuroscience - NCBI Bookshelf , author=

[53] [53]

Annual review of vision science , volume=

The organization and operation of inferior temporal cortex , author=. Annual review of vision science , volume=. 2018 , publisher=

2018

[54] [54]

Current Biology , volume=

Multimodal spatial representations engaged in human parietal cortex during both saccadic and manual spatial orienting , author=. Current Biology , volume=. 2003 , publisher=

2003

[55] [55]

Neuropsychopharmacology , volume=

The role of prefrontal cortex in cognitive control and executive function , author=. Neuropsychopharmacology , volume=. 2022 , publisher=

2022

[56] [56]

IEEE Open Journal of Signal Processing , year=

AVCaps: An Audio-visual Dataset with Modality-specific Captions , author=. IEEE Open Journal of Signal Processing , year=

[57] [57]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Valor: Vision-audio-language omni-perception pretraining model and dataset , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

[58] [58]

arXiv preprint arXiv:2502.04328 , year=

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment , author=. arXiv preprint arXiv:2502.04328 , year=

arXiv

[59] [59]

arXiv preprint arXiv:2501.15368 , year=

Baichuan-Omni-1.5 Technical Report , author=. arXiv preprint arXiv:2501.15368 , year=

arXiv

[60] [60]

arXiv preprint arXiv:2503.05379 , year=

R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning , author=. arXiv preprint arXiv:2503.05379 , year=

arXiv

[61] [61]

arXiv preprint arXiv:2503.01743 , year=

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=

Pith/arXiv arXiv

[62] [62]

arXiv preprint arXiv:2505.04921 , year=

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models , author=. arXiv preprint arXiv:2505.04921 , year=

arXiv

[63] [63]

arXiv preprint arXiv:2505.04620 , year=

On Path to Multimodal Generalist: General-Level and General-Bench , author=. arXiv preprint arXiv:2505.04620 , year=

arXiv

[64] [64]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[65] [65]

arXiv preprint arXiv:2307.16125 , year=

Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. arXiv preprint arXiv:2307.16125 , year=

Pith/arXiv arXiv

[66] [66]

arXiv preprint arXiv:2410.19168 , year=

Mmau: A massive multi-task audio understanding and reasoning benchmark , author=. arXiv preprint arXiv:2410.19168 , year=

Pith/arXiv arXiv

[67] [67]

European Conference on Computer Vision , pages=

Audio-visual mismatch-aware video retrieval via association and adjustment , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[68] [68]

European Conference on Computer Vision , pages=

Localizing visual sounds the easy way , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[69] [69]

IEEE Access , volume=

A survey of audio classification using deep learning , author=. IEEE Access , volume=. 2023 , publisher=

2023

[70] [70]

Advances in neural information processing systems , volume=

Unsupervised feature learning for audio classification using convolutional deep belief networks , author=. Advances in neural information processing systems , volume=

[71] [71]

International journal of Remote sensing , volume=

A survey of image classification methods and techniques for improving classification performance , author=. International journal of Remote sensing , volume=. 2007 , publisher=

2007

[72] [72]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[73] [73]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

What does a platypus look like? generating customized prompts for zero-shot image classification , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[74] [74]

ACM Transactions on Multimedia Computing, Communications and Applications , volume=

Variational autoencoder with cca for audio--visual cross-modal retrieval , author=. ACM Transactions on Multimedia Computing, Communications and Applications , volume=. 2023 , publisher=

2023

[75] [75]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Pano-avqa: Grounded audio-visual question answering on 360deg videos , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[76] [76]

2024 , isbn =

Wang, Yaoting and Liu, Weisong and Li, Guangyao and Ding, Jian and Hu, Di and Li, Xi , title =. 2024 , isbn =. doi:10.1609/aaai.v38i6.28378 , booktitle =

work page doi:10.1609/aaai.v38i6.28378 2024

[77] [77]

European Conference on Computer Vision , pages=

Can Textual Semantics Mitigate Sounding Object Segmentation Preference? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[78] [78]

arXiv preprint arXiv:2406.12793 , year=

Chatglm: A family of large language models from glm-130b to glm-4 all tools , author=. arXiv preprint arXiv:2406.12793 , year=

Pith/arXiv arXiv

[79] [79]

arXiv preprint arXiv:2502.00358 , year=

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects? , author=. arXiv preprint arXiv:2502.00358 , year=

arXiv

[80] [80]

arXiv preprint arXiv:2407.00634 , year=

Tarsier: Recipes for training and evaluating large video description models , author=. arXiv preprint arXiv:2407.00634 , year=

arXiv