When Vision Speaks for Sound

Muhao Chen; Peng Qi; Rui Cai; Tinghui Zhu; Wendi Li; Wenjie Jacky Mo; Xiaofei Wen; Xingyu Fu; Yanan Xie

arxiv: 2605.16403 · v1 · pith:HN3GWO7Xnew · submitted 2026-05-13 · 💻 cs.CV · cs.SD

When Vision Speaks for Sound

Xiaofei Wen , Wenjie Jacky Mo , Xingyu Fu , Rui Cai , Tinghui Zhu , Wendi Li , Yanan Xie , Muhao Chen

show 1 more author

Peng Qi

This is my paper

Pith reviewed 2026-05-20 22:09 UTC · model grok-4.3

classification 💻 cs.CV cs.SD

keywords multimodal large language modelsaudio-visual understandingClever Hans effectcounterfactual interventionsvideo QAmodel alignmentaudio verification

0 comments

The pith

Video models often infer or hallucinate sounds from visual cues rather than verifying the audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that video-capable multimodal models frequently base their understanding of sound on what they see in the visuals instead of checking the actual audio track. This creates an audio-visual Clever Hans effect where models appear to process sound but instead exploit typical correlations between images and expected noises. The authors introduce the Thud framework to test this through three audio interventions: shifting timing, muting the sound, and swapping with mismatched audio. They also present a two-stage training method using preference pairs from these edits plus general video preferences, which raises average scores on the three tests by 28 percentage points while maintaining or slightly improving results on standard benchmarks.

Core claim

Models rely on visual cues to infer or hallucinate acoustic information rather than verifying the audio stream. This audio-visual Clever Hans effect is diagnosed using the Thud framework based on Shift, Mute, and Swap counterfactual audio edits. A two-stage alignment recipe using intervention-derived preference pairs improves average performance across these dimensions by 28 percentage points.

What carries the argument

The audio-visual Clever Hans effect diagnosed through Thud, an intervention-driven probing framework that applies three counterfactual audio edits (Shift to test temporal synchronization, Mute to test sound existence, Swap to test audio-visual consistency), combined with a two-stage alignment recipe that teaches verification via preference pairs while regularizing with general video preferences.

If this is right

Systematic testing with temporal shifts, muting, and swaps reveals whether models truly ground responses in audio or default to visual correlations.
Training on preference pairs derived from these interventions teaches audio verification while avoiding over-specialization.
Gains on the three intervention tests translate to better handling of real audio-visual misalignment cases.
General video and audio-visual QA performance stays stable or improves slightly after the two-stage alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Strong visual-audio correlations in existing training data likely encourage models to shortcut true multimodal verification.
The same reliance on cross-modal correlations may appear in other multimodal setups beyond video and sound.
Extending the intervention method to more varied or layered mismatches could probe deeper levels of audio understanding.

Load-bearing premise

The three counterfactual audio edits isolate the model's ability to verify audio without introducing new biases or changing visual processing independently of the intended audio check.

What would settle it

A model that performs at similar levels on videos with shifted, muted, or swapped audio as on correctly aligned audio, by correctly identifying the mismatches, would show it is verifying the audio stream rather than relying on visuals.

Figures

Figures reproduced from arXiv: 2605.16403 by Muhao Chen, Peng Qi, Rui Cai, Tinghui Zhu, Wendi Li, Wenjie Jacky Mo, Xiaofei Wen, Xingyu Fu, Yanan Xie.

**Figure 1.** Figure 1: When vision speaks for sound. Given the same visual event but different audio tracks, current video-capable models produce nearly identical captions, suggesting visual-prior shortcutting rather than audio-grounded understanding. We find that current video-capable MLLMs are often visually dominated when reasoning about audio-related information in sounded videos. As illustrated in [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 2.** Figure 2: Representative failure cases under Shift, Mute, and Swap interventions. Gemini and Qwen3-Omni often rely on visual priors rather than verifying the audio stream, leading to missed temporal shifts, hallucinated sounds, and visually biased predictions. 2 How Can We Align Models Beyond Visual Shortcuts? [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Failure-mode heatmap. Red indicates higher failure; audio hallucination dominates, while temporal failures are model-specific. Overall, most models show large drops from Original to intervention settings, indicating that strong performance on naturally correlated videos is fragile. MiniCPM-o-4.5 and MiMo-V2.5 have the largest gaps, 80.7% and 78.4%. Qwen3-Omni is diagnostic: its perfect original temporal-… view at source ↗

**Figure 4.** Figure 4: decomposes each model’s predictions on the three intervention tasks. On Mute and Swap, almost all errors collapse onto Hallucinated synced, with five of six models fabricating matching audio on over 80% of muted clips and the mismatched class recovered at most 37% of the time. Hallucinated shift is negligible everywhere, indicating that models hold a strong synced prior and rarely entertain temporal altern… view at source ↗

**Figure 5.** Figure 5: Difficulty-band robustness. Smaller offsets are harder; our model remains robust while baselines collapse under desynchronization. We next ask whether targeted intervention training can improve temporal grounding without hurting general capabilities. Starting from Qwen3- Omni-30B, we compare alignment recipes using original synchronization preferences, self-sampled negatives, counterfactual temporal pr… view at source ↗

**Figure 6.** Figure 6: Complementary synchronization results. Left: model accuracy on binary synchronization, three-way temporal classification, and direction prediction. Right: the fraction of samples whose predicted offset is close to the ground-truth temporal displacement. suggests that counterfactual temporal supervision supplies the grounding signal, while FineVideo and LLaVA-Video preferences regularize the model toward br… view at source ↗

**Figure 7.** Figure 7: Beyond temporal synchronization. Combined Mute and Swap accuracy over original and intervened conditions [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Intervention-control tradeoff. Top-left indicates strong intervention detection with few false alarms on original controls. Native Omni Models and CrossModal Shortcuts Recent frontier multimodal models are shifting from frame-centric video-language pipelines toward native multimodal or omni-modal processing, where video, audio, images, and text are handled through a unified interface or architecture [26,… view at source ↗

**Figure 9.** Figure 9: Pipeline for intervention data construction. We create Shift, Mute, and Swap variants from source videos with salient acoustic events, annotate visual/audio events and timestamps via cross-model verification with human review, and construct chosen–rejected preference pairs for training. The bottom panel shows a representative Shift example. A Schematic Overviews of Data Construction and Alignment A.1 Data … view at source ↗

**Figure 10.** Figure 10: Two-stage intervention-driven alignment pipeline. Counterfactual intervention data is first used for SFT warm-up, and intervention preference pairs are then mixed with general video data during preference optimization. This design encourages audio-verified responses while preserving general video understanding [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Video MLLMs often rely on visuals to infer audio rather than verifying the sound, with a proposed fix that may need tighter controls on the probes.

read the letter

The main point to take away is that many video multimodal large language models are not truly grounding their responses in the audio track. Instead, they seem to be using visual information to guess or hallucinate what the sound should be. The authors frame this as an audio-visual Clever Hans effect and demonstrate it in both open and closed models. What the paper does well is to introduce a probing framework they call Thud. It relies on three targeted audio edits: shifting the audio to break synchronization, muting it entirely to test for sound existence, and swapping it to test for content consistency. These interventions help measure how much the models depend on visual cues versus actual audio verification. They also present a two-stage alignment approach. The first stage uses preference pairs derived from the interventions to encourage audio checking. The second stage adds event-level general video preferences to prevent the model from becoming too specialized on the probe tasks. With a relatively small set of 10,000 samples, this recipe lifts average performance on the three dimensions by 28 percentage points and maintains or slightly improves results on general video and audio-visual QA benchmarks. This is a useful contribution because it provides both a diagnostic tool and a mitigation strategy for a potential reliability issue in current systems. On the softer side, there is a question about whether the counterfactual edits cleanly isolate the audio verification capability. The stress test raises a fair point that low-level signal changes from these edits could be detected by audio-only mechanisms without involving visual cross-checking. For example, a mute edit removes energy, a shift alters temporal patterns, and a swap changes the acoustic content in ways that might trigger quality or anomaly detectors independently. If the models are responding to these artifacts, the gains from the preference alignment could be explained by learning to disregard flawed audio rather than by developing genuine multimodal integration. The abstract does not include error bars or detailed ablation on this, so the robustness is not fully clear from the high-level description. This paper would be of interest to researchers working on multimodal models, particularly those concerned with evaluation and alignment for audio-visual tasks. A reader focused on identifying and fixing shortcuts in large models would find practical value here. I recommend putting it through peer review. The core finding about the Clever Hans effect in video MLLMs is important enough to warrant expert feedback on the methods and results.

Referee Report

2 major / 2 minor

Summary. The paper claims that video-capable multimodal large language models (MLLMs) exhibit an 'audio-visual Clever Hans effect,' relying on visual cues to infer or hallucinate acoustic information instead of verifying the audio stream. This is diagnosed across open- and closed-source models using the Thud probing framework, which applies three counterfactual audio edits (Shift for temporal synchronization, Mute for sound existence, and Swap for audio-visual consistency). The authors further propose a two-stage alignment recipe—intervention-derived preference pairs to teach audio verification, regularized by event-level general video preferences—that yields a 28 percentage point average improvement on the three Thud dimensions while slightly improving general video and audio-visual QA benchmarks.

Significance. If the central claims hold after addressing the isolation of the edits, the work would be significant for highlighting a systematic limitation in current MLLMs' cross-modal grounding, with direct implications for reliable video understanding applications. The intervention-driven diagnosis and preference-based mitigation recipe are constructive strengths, offering a falsifiable test and a practical training method that avoids over-specialization. The systematic evaluation across multiple model classes adds value, though the overall impact hinges on confirming that gains reflect genuine audio verification rather than learning to ignore edit artifacts.

major comments (2)

[Thud framework] Thud framework (counterfactual edits section): The claim that Shift, Mute, and Swap isolate audio verification capability is load-bearing for the Clever Hans diagnosis, yet the manuscript provides no controls showing that models cannot detect these edits via audio-only heuristics. Mute removes all audio energy (detectable by simple RMS thresholds), Shift alters temporal statistics (potentially flagged by audio feature extractors), and Swap replaces content (possibly caught by quality or spectral heuristics). Without audio-only baseline results or attention analysis on the edited streams, poor baseline performance could reflect generic audio shortfalls rather than vision-driven inference.
[Alignment recipe and evaluation] Two-stage alignment recipe and results: The reported 28-point average gain on Shift/Mute/Swap is presented without error bars, number of runs, or ablation on the 10K preference pairs versus general preferences. It is therefore unclear whether the improvement teaches cross-modal verification or merely suppresses responses to the specific artifacts introduced by the edits. This directly affects the claim that the recipe acquires 'genuine audio verification.'

minor comments (2)

[Abstract] Abstract: The statement that the recipe 'slightly improves' general benchmarks should specify the exact benchmarks and delta values to allow readers to assess any trade-offs.
[Methods] The manuscript would benefit from a clearer description of how the preference pairs are constructed from the interventions (e.g., which model generates the rejected responses and the exact prompt templates).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the robustness of the Thud framework and the alignment recipe. We address each major point below and commit to revisions that strengthen the evidence for genuine audio verification without overclaiming current results.

read point-by-point responses

Referee: [Thud framework] Thud framework (counterfactual edits section): The claim that Shift, Mute, and Swap isolate audio verification capability is load-bearing for the Clever Hans diagnosis, yet the manuscript provides no controls showing that models cannot detect these edits via audio-only heuristics. Mute removes all audio energy (detectable by simple RMS thresholds), Shift alters temporal statistics (potentially flagged by audio feature extractors), and Swap replaces content (possibly caught by quality or spectral heuristics). Without audio-only baseline results or attention analysis on the edited streams, poor baseline performance could reflect generic audio shortfalls rather than vision-driven inference.

Authors: We agree that explicit controls are necessary to rule out audio-only detection of the edits. The Thud interventions target distinct failure modes (temporal misalignment, absence of sound, and cross-modal inconsistency), but without audio-only baselines it remains possible that models exploit low-level audio cues. In the revised manuscript we will add audio-only evaluations on the edited streams (prompting models with audio alone) to quantify how much of the performance drop is attributable to vision-driven inference versus generic audio heuristics. We will also report attention visualizations where available to illustrate cross-modal reliance. These additions directly address the isolation concern while preserving the core Clever Hans diagnosis. revision: yes
Referee: [Alignment recipe and evaluation] Two-stage alignment recipe and results: The reported 28-point average gain on Shift/Mute/Swap is presented without error bars, number of runs, or ablation on the 10K preference pairs versus general preferences. It is therefore unclear whether the improvement teaches cross-modal verification or merely suppresses responses to the specific artifacts introduced by the edits. This directly affects the claim that the recipe acquires 'genuine audio verification.'

Authors: We acknowledge that the current presentation lacks statistical rigor and ablations, which limits confidence that gains reflect true audio verification rather than artifact suppression. In the revised manuscript we will include error bars computed over multiple training runs with different random seeds, and add an ablation comparing (i) intervention-derived pairs alone, (ii) general video preferences alone, and (iii) the full two-stage recipe. These results will be reported on both the Thud dimensions and held-out general video/audio-visual QA benchmarks to demonstrate that the recipe improves verification without degrading broader capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical diagnosis via external interventions and standard preference training

full rationale

The paper's core contribution is an empirical diagnosis of vision-driven audio inference using three externally defined counterfactual edits (Shift for temporal sync, Mute for sound existence, Swap for consistency) collected into the Thud framework, followed by a two-stage training recipe that generates preference pairs from those same edits plus general video preferences. No equations, self-definitional loops, or fitted parameters are invoked that would make the reported 28-point gain equivalent to the inputs by construction; the interventions are defined independently of model outputs, and performance gains reflect observable changes after training rather than tautological renaming or self-citation chains. The work is self-contained against external benchmarks and does not rely on load-bearing self-citations for its central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about multimodal model behavior and the validity of counterfactual edits; no major free parameters or invented entities are introduced beyond the named interventions and recipe.

axioms (1)

domain assumption Counterfactual audio edits isolate audio verification without confounding visual changes
Invoked when defining Shift, Mute, and Swap as tests of synchronization, existence, and consistency.

pith-pipeline@v0.9.0 · 5756 in / 1214 out tokens · 70328 ms · 2026-05-20T22:09:00.360576+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce THUD, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 13 internal anchors

[1]

Analyzing the behavior of visual question answering models

Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering models. In Jian Su, Kevin Duh, and Xavier Carreras, editors,Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1955–1960, Austin, Texas, November 2016. Association for Computational Linguistics

work page 2016
[2]

Self- supervised learning by cross-modal audio-video clustering.ArXiv, abs/1911.12667, 2019

Humam Alwassel, Dhruv Kumar Mahajan, Lorenzo Torresani, Bernard Ghanem, and Du Tran. Self- supervised learning by cross-modal audio-video clustering.ArXiv, abs/1911.12667, 2019

work page arXiv 1911
[3]

Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017

Relja Arandjelovi´c and Andrew Zisserman. Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017

work page 2017
[4]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Thomas Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova Dassarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, John Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Dassarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Thomas Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

Ami Baid, Zihui Xue, and Kristen Grauman. Don’t let the video speak: Audio-contrastive preference optimization for audio-visual language models.arXiv preprint arXiv:2604.14129, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Buch, Cristobal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles

S. Buch, Cristobal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the “video” in video-language understanding.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2907–2917, 2022

work page 2022
[8]

Diagnosing and mitigating modality interference in multimodal large language models.ArXiv, abs/2505.19616, 2025

Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, and Zhe Zhao. Diagnosing and mitigating modality interference in multimodal large language models.ArXiv, abs/2505.19616, 2025

work page arXiv 2025
[9]

Quo vadis, action recognition? a new model and the kinetics dataset

João Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, 2017

work page 2017
[10]

Audio-visual synchronization in the wild

Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. Audio-visual synchronization in the wild. InBMVC, 2021

work page 2021
[11]

Vggsound: A large-scale audio-visual dataset.ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset.ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020

work page 2020
[12]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural ...

work page 2017
[13]

Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction.CoRR, 2026

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyu Sun, Yingjin Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuohang Lin, Hanyu Liu, Qi Gui, Qing-Yuan Han, Yuyang Wen, Huiping Liu, Rongkang Wang, Yaqi Zhang, Hong- Rui Wei, Chi Chen, You Li, Kechen Fang, Jie Zhou, Yuxuan Li, Guoyang...

work page 2026
[14]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Proce...

work page 2023
[15]

Oops! predicting unintentional action in video.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 916–926, 2019

Dave Epstein, Boyuan Chen, and Carl V ondrick. Oops! predicting unintentional action in video.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 916–926, 2019. 10

work page 2020
[16]

Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo, 2024

Miquel Farré, Andi Marafioti, Lewis Tunstall, Leandro V on Werra, and Thomas Wolf. Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo, 2024

work page 2024
[17]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InI...

work page 2025
[18]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. ArXiv, abs/2404.12390, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Zemel, Wieland Brendel, Matthias Bethge, and Felix A

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nat. Mach. Intell., 2(11):665– 673, 2020

work page 2020
[20]

Gemmeke, Daniel P

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017

work page 2017
[21]

Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

work page 2023
[22]

Gemini 3.https://deepmind.google/models/gemini/, 2026

Google DeepMind. Gemini 3.https://deepmind.google/models/gemini/, 2026

work page 2026
[23]

Making the V in VQA matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325–6334. IEEE Computer Society, 2017

work page 2017
[24]

Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page 2024
[25]

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.CoRR, abs/2502.04326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Chat-univi: Unified vi- sual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding.arXiv preprint arXiv:2311.08046, 2023

work page arXiv 2023
[28]

Cooperative learning of audio and video models from self-supervised synchronization

Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. InNeural Information Processing Systems, 2018

work page 2018
[29]

Unmasking clever hans predictors and assessing what machines really learn.Nature Communications, 10, 2019

Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Unmasking clever hans predictors and assessing what machines really learn.Nature Communications, 10, 2019

work page 2019
[30]

Videochat: chat-centric video understanding.Science China Information Sciences, 68, 2023

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wen Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: chat-centric video understanding.Science China Information Sciences, 68, 2023

work page 2023
[31]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, 2024. 11

work page 2024
[32]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 22195–22206. IEEE, 2024

work page 2024
[33]

Baichuan-omni-1.5 technical report

Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368, 2025

work page arXiv 2025
[34]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 292–305. Association f...

work page 2023
[35]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, p...

work page 2024
[36]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023

work page 2023
[37]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, ...

work page 2024
[38]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems...

work page 2023
[39]

Audio-visual instance discrimination with cross- modal agreement.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12470–12481, 2020

Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimination with cross- modal agreement.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12470–12481, 2020

work page 2021
[40]

Robust audio-visual instance discrimination

Pedro Miguel Morgado, Ishan Misra, and Nuno Vasconcelos. Robust audio-visual instance discrimination. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12929–12940, 2021

work page 2021
[41]

OpenAI GPT-5 System Card

OpenAI. Openai GPT-5 system card.CoRR, abs/2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

work page 2022
[43]

Andrew Owens and Alexei A. Efros. Audio-visual scene analysis with self-supervised multisensory features. InEuropean Conference on Computer Vision, 2018

work page 2018
[44]

Perception test: A diagnostic benchmark for multimodal video models

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens Continente, Larisa Markeeva, Dy- lan Sunil Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alexandre Fréchette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Di...

work page 2023
[45]

Von Osten.) a contribution to experimental animal and human psychology

Oskar Pfungst.Clever Hans:(the horse of Mr. Von Osten.) a contribution to experimental animal and human psychology. Holt, Rinehart and Winston, 1911

work page 1911
[46]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 12

work page 2023
[47]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14313–14323, 2024

work page 2024
[48]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium, October-November 2018. Assoc...

work page 2018
[49]

Do Audio-Visual Large Language Models Really See and Hear?

Ramaneswaran Selvakumar, Kaousheik Jayakumar, S Sakshi, Sreyan Ghosh, Ruohan Gao, and Dinesh Manocha. Do audio-visual large language models really see and hear?arXiv preprint arXiv:2604.02605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Learning to localize sound source in visual scenes.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018

Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In-So Kweon. Learning to localize sound source in visual scenes.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018

work page 2018
[51]

Nikhil Singh, Chih-Wei Wu, Iroro Orife, and Mahdi M. Kalayeh. Looking similar, sounding different: Leveraging counterfactual cross-modal pairs for audiovisual representation learning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26897–26908, 2023

work page 2024
[52]

A VH- Bench: A cross-modal hallucination benchmark for audio-visual large language models

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. A VH- Bench: A cross-modal hallucination benchmark for audio-visual large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[53]

Ming-omni: A unified multimodal model for perception and generation, 2025

Inclusion AI Team. Ming-omni: A unified multimodal model for perception and generation.CoRR, abs/2506.09344, 2025

work page arXiv 2025
[54]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL Team. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.CoRR, abs/2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Nemotron 3 nano omni: Efficient and open multimodal intelligence

Nemotron 3 Nano Omni Team. Nemotron 3 nano omni: Efficient and open multimodal intelligence. 2026

work page 2026
[56]

Qwen3-Omni Technical Report

Qwen Team. Qwen3-omni technical report.CoRR, abs/2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.CoRR, abs/2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026

work page 2026
[59]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5228–5238. IEEE, 2022

work page 2022
[60]

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238, 2022

work page 2022
[61]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 9568–9578. IEEE, 2024

work page 2024
[62]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Internvideo2: Scaling video foundation mod- els for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2: Scaling video foundation models for multimodal video understanding. ArXiv, abs/2403.15377, 2024

work page arXiv 2024
[64]

Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models

Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models.ArXiv, abs/2406.16338, 2024

work page arXiv 2024
[65]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022. 13

work page 2022
[66]

NExt-GPT: Any-to-any multimodal LLM, 2024

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. NExt-GPT: Any-to-any multimodal LLM, 2024

work page 2024
[67]

Xiaomi mimo-v2.5: A leap in agency and multimodality

Xiaomi MiMo Team. Xiaomi mimo-v2.5: A leap in agency and multimodality. https://mimo.xiaomi. com/mimo-v2-5/, April 2026. Accessed: 2026-05-04

work page 2026
[68]

Mert Yüksekgönül, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023
[69]

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Y . Zou. When and why vision-language models behave like bags-of-words, and what to do about it?ArXiv, abs/2210.01936, 2022

work page arXiv 2022
[70]

Anygpt: Unified multimodal llm with discrete sequence modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9637–9662, 2024

work page 2024
[71]

Video-LLaMA: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Yansong Feng and Els Lefever, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[72]

Direct preference optimization of video large multimodal models from language model reward

Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander G Hauptmann, Yonatan Bisk, and Yiming Yang. Direct preference optimization of video large multimodal models from language model reward. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Ame...

work page 2025
[73]

Video instruction tuning with synthetic data, 2024

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024

work page 2024
[74]

Llava-video: Video instruction tuning with synthetic data.Trans

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.Trans. Mach. Learn. Res., 2025, 2025

work page 2025
[75]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,

Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.CoRR, abs/2505.17862, 2025

work page arXiv 2025
[76]

Omniguard: Unified omni-modal guardrails with deliberate reasoning.ArXiv, abs/2512.02306, 2025

Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, and Muhao Chen. Omniguard: Unified omni-modal guardrails with deliberate reasoning.ArXiv, abs/2512.02306, 2025

work page arXiv 2025
[77]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Chris- tiano, and Geoffrey Irving. Fine-tuning language models from human preferences.CoRR, abs/1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[78]

Safety fine-tuning at (almost) no cost: A baseline for vision large language models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy M. Hospedales. Safety fine-tuning at (almost) no cost: A baseline for vision large language models.ArXiv, abs/2402.02207, 2024. 14 Source Videos Salient acoustic consequences Shift Mute Swap Temporal Displacement early delay Physical interventions Break natural audio-visual correlations ...

work page arXiv 2024
[79]

Visual agreement:Gemini, GPT, and Claude localize the visual event within ϵv = 0.8 seconds, or select overlapping frame units

work page
[80]

3.Event clarity:the visual event has a clear onset or peak moment, such as impact, fall, collision, breakage, or contact

Audio verification:the acoustic event is audible and its timestamp can be verified by human inspection withinϵ a = 0.5seconds of the Gemini prediction. 3.Event clarity:the visual event has a clear onset or peak moment, such as impact, fall, collision, breakage, or contact

work page

Showing first 80 references.

[1] [1]

Analyzing the behavior of visual question answering models

Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering models. In Jian Su, Kevin Duh, and Xavier Carreras, editors,Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1955–1960, Austin, Texas, November 2016. Association for Computational Linguistics

work page 2016

[2] [2]

Self- supervised learning by cross-modal audio-video clustering.ArXiv, abs/1911.12667, 2019

Humam Alwassel, Dhruv Kumar Mahajan, Lorenzo Torresani, Bernard Ghanem, and Du Tran. Self- supervised learning by cross-modal audio-video clustering.ArXiv, abs/1911.12667, 2019

work page arXiv 1911

[3] [3]

Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017

Relja Arandjelovi´c and Andrew Zisserman. Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017

work page 2017

[4] [4]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Thomas Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova Dassarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, John Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Dassarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Thomas Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

Ami Baid, Zihui Xue, and Kristen Grauman. Don’t let the video speak: Audio-contrastive preference optimization for audio-visual language models.arXiv preprint arXiv:2604.14129, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Buch, Cristobal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles

S. Buch, Cristobal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the “video” in video-language understanding.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2907–2917, 2022

work page 2022

[8] [8]

Diagnosing and mitigating modality interference in multimodal large language models.ArXiv, abs/2505.19616, 2025

Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, and Zhe Zhao. Diagnosing and mitigating modality interference in multimodal large language models.ArXiv, abs/2505.19616, 2025

work page arXiv 2025

[9] [9]

Quo vadis, action recognition? a new model and the kinetics dataset

João Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, 2017

work page 2017

[10] [10]

Audio-visual synchronization in the wild

Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. Audio-visual synchronization in the wild. InBMVC, 2021

work page 2021

[11] [11]

Vggsound: A large-scale audio-visual dataset.ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset.ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020

work page 2020

[12] [12]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural ...

work page 2017

[13] [13]

Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction.CoRR, 2026

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyu Sun, Yingjin Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuohang Lin, Hanyu Liu, Qi Gui, Qing-Yuan Han, Yuyang Wen, Huiping Liu, Rongkang Wang, Yaqi Zhang, Hong- Rui Wei, Chi Chen, You Li, Kechen Fang, Jie Zhou, Yuxuan Li, Guoyang...

work page 2026

[14] [14]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Proce...

work page 2023

[15] [15]

Oops! predicting unintentional action in video.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 916–926, 2019

Dave Epstein, Boyuan Chen, and Carl V ondrick. Oops! predicting unintentional action in video.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 916–926, 2019. 10

work page 2020

[16] [16]

Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo, 2024

Miquel Farré, Andi Marafioti, Lewis Tunstall, Leandro V on Werra, and Thomas Wolf. Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo, 2024

work page 2024

[17] [17]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InI...

work page 2025

[18] [18]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. ArXiv, abs/2404.12390, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Zemel, Wieland Brendel, Matthias Bethge, and Felix A

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nat. Mach. Intell., 2(11):665– 673, 2020

work page 2020

[20] [20]

Gemmeke, Daniel P

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017

work page 2017

[21] [21]

Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

work page 2023

[22] [22]

Gemini 3.https://deepmind.google/models/gemini/, 2026

Google DeepMind. Gemini 3.https://deepmind.google/models/gemini/, 2026

work page 2026

[23] [23]

Making the V in VQA matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325–6334. IEEE Computer Society, 2017

work page 2017

[24] [24]

Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page 2024

[25] [25]

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.CoRR, abs/2502.04326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Chat-univi: Unified vi- sual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding.arXiv preprint arXiv:2311.08046, 2023

work page arXiv 2023

[28] [28]

Cooperative learning of audio and video models from self-supervised synchronization

Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. InNeural Information Processing Systems, 2018

work page 2018

[29] [29]

Unmasking clever hans predictors and assessing what machines really learn.Nature Communications, 10, 2019

Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Unmasking clever hans predictors and assessing what machines really learn.Nature Communications, 10, 2019

work page 2019

[30] [30]

Videochat: chat-centric video understanding.Science China Information Sciences, 68, 2023

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wen Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: chat-centric video understanding.Science China Information Sciences, 68, 2023

work page 2023

[31] [31]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, 2024. 11

work page 2024

[32] [32]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 22195–22206. IEEE, 2024

work page 2024

[33] [33]

Baichuan-omni-1.5 technical report

Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368, 2025

work page arXiv 2025

[34] [34]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 292–305. Association f...

work page 2023

[35] [35]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, p...

work page 2024

[36] [36]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023

work page 2023

[37] [37]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, ...

work page 2024

[38] [38]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems...

work page 2023

[39] [39]

Audio-visual instance discrimination with cross- modal agreement.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12470–12481, 2020

Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimination with cross- modal agreement.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12470–12481, 2020

work page 2021

[40] [40]

Robust audio-visual instance discrimination

Pedro Miguel Morgado, Ishan Misra, and Nuno Vasconcelos. Robust audio-visual instance discrimination. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12929–12940, 2021

work page 2021

[41] [41]

OpenAI GPT-5 System Card

OpenAI. Openai GPT-5 system card.CoRR, abs/2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

work page 2022

[43] [43]

Andrew Owens and Alexei A. Efros. Audio-visual scene analysis with self-supervised multisensory features. InEuropean Conference on Computer Vision, 2018

work page 2018

[44] [44]

Perception test: A diagnostic benchmark for multimodal video models

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens Continente, Larisa Markeeva, Dy- lan Sunil Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alexandre Fréchette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Di...

work page 2023

[45] [45]

Von Osten.) a contribution to experimental animal and human psychology

Oskar Pfungst.Clever Hans:(the horse of Mr. Von Osten.) a contribution to experimental animal and human psychology. Holt, Rinehart and Winston, 1911

work page 1911

[46] [46]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 12

work page 2023

[47] [47]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14313–14323, 2024

work page 2024

[48] [48]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium, October-November 2018. Assoc...

work page 2018

[49] [49]

Do Audio-Visual Large Language Models Really See and Hear?

Ramaneswaran Selvakumar, Kaousheik Jayakumar, S Sakshi, Sreyan Ghosh, Ruohan Gao, and Dinesh Manocha. Do audio-visual large language models really see and hear?arXiv preprint arXiv:2604.02605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Learning to localize sound source in visual scenes.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018

Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In-So Kweon. Learning to localize sound source in visual scenes.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018

work page 2018

[51] [51]

Nikhil Singh, Chih-Wei Wu, Iroro Orife, and Mahdi M. Kalayeh. Looking similar, sounding different: Leveraging counterfactual cross-modal pairs for audiovisual representation learning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26897–26908, 2023

work page 2024

[52] [52]

A VH- Bench: A cross-modal hallucination benchmark for audio-visual large language models

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. A VH- Bench: A cross-modal hallucination benchmark for audio-visual large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[53] [53]

Ming-omni: A unified multimodal model for perception and generation, 2025

Inclusion AI Team. Ming-omni: A unified multimodal model for perception and generation.CoRR, abs/2506.09344, 2025

work page arXiv 2025

[54] [54]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL Team. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.CoRR, abs/2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Nemotron 3 nano omni: Efficient and open multimodal intelligence

Nemotron 3 Nano Omni Team. Nemotron 3 nano omni: Efficient and open multimodal intelligence. 2026

work page 2026

[56] [56]

Qwen3-Omni Technical Report

Qwen Team. Qwen3-omni technical report.CoRR, abs/2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.CoRR, abs/2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026

work page 2026

[59] [59]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5228–5238. IEEE, 2022

work page 2022

[60] [60]

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238, 2022

work page 2022

[61] [61]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 9568–9578. IEEE, 2024

work page 2024

[62] [62]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

Internvideo2: Scaling video foundation mod- els for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2: Scaling video foundation models for multimodal video understanding. ArXiv, abs/2403.15377, 2024

work page arXiv 2024

[64] [64]

Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models

Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models.ArXiv, abs/2406.16338, 2024

work page arXiv 2024

[65] [65]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022. 13

work page 2022

[66] [66]

NExt-GPT: Any-to-any multimodal LLM, 2024

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. NExt-GPT: Any-to-any multimodal LLM, 2024

work page 2024

[67] [67]

Xiaomi mimo-v2.5: A leap in agency and multimodality

Xiaomi MiMo Team. Xiaomi mimo-v2.5: A leap in agency and multimodality. https://mimo.xiaomi. com/mimo-v2-5/, April 2026. Accessed: 2026-05-04

work page 2026

[68] [68]

Mert Yüksekgönül, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023

[69] [69]

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Y . Zou. When and why vision-language models behave like bags-of-words, and what to do about it?ArXiv, abs/2210.01936, 2022

work page arXiv 2022

[70] [70]

Anygpt: Unified multimodal llm with discrete sequence modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9637–9662, 2024

work page 2024

[71] [71]

Video-LLaMA: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Yansong Feng and Els Lefever, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore, December 2023. Association for Computational Linguistics

work page 2023

[72] [72]

Direct preference optimization of video large multimodal models from language model reward

Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander G Hauptmann, Yonatan Bisk, and Yiming Yang. Direct preference optimization of video large multimodal models from language model reward. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Ame...

work page 2025

[73] [73]

Video instruction tuning with synthetic data, 2024

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024

work page 2024

[74] [74]

Llava-video: Video instruction tuning with synthetic data.Trans

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.Trans. Mach. Learn. Res., 2025, 2025

work page 2025

[75] [75]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,

Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.CoRR, abs/2505.17862, 2025

work page arXiv 2025

[76] [76]

Omniguard: Unified omni-modal guardrails with deliberate reasoning.ArXiv, abs/2512.02306, 2025

Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, and Muhao Chen. Omniguard: Unified omni-modal guardrails with deliberate reasoning.ArXiv, abs/2512.02306, 2025

work page arXiv 2025

[77] [77]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Chris- tiano, and Geoffrey Irving. Fine-tuning language models from human preferences.CoRR, abs/1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[78] [78]

Safety fine-tuning at (almost) no cost: A baseline for vision large language models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy M. Hospedales. Safety fine-tuning at (almost) no cost: A baseline for vision large language models.ArXiv, abs/2402.02207, 2024. 14 Source Videos Salient acoustic consequences Shift Mute Swap Temporal Displacement early delay Physical interventions Break natural audio-visual correlations ...

work page arXiv 2024

[79] [79]

Visual agreement:Gemini, GPT, and Claude localize the visual event within ϵv = 0.8 seconds, or select overlapping frame units

work page

[80] [80]

3.Event clarity:the visual event has a clear onset or peak moment, such as impact, fall, collision, breakage, or contact

Audio verification:the acoustic event is audible and its timestamp can be verified by human inspection withinϵ a = 0.5seconds of the Gemini prediction. 3.Event clarity:the visual event has a clear onset or peak moment, such as impact, fall, collision, breakage, or contact

work page