arxiv: 2311.17005 · v4 · pith:JO2LRJNGnew · submitted 2023-11-28 · 💻 cs.CV

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li , Yali Wang , Yinan He , Yizhuo Li , Yi Wang , Yi Liu , Zun Wang , Jilan Xu

show 4 more authors

Guo Chen Ping Luo Limin Wang Yu Qiao

This is my paper

Pith reviewed 2026-05-17 20:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-modal large language modelsvideo understanding benchmarktemporal reasoningMLLM evaluationVideoChat2static-to-dynamic conversionmultiple-choice QA

0 comments

The pith

Most multi-modal AI models fail at temporal understanding in videos, but a new benchmark and training method lift performance by more than 15 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MVBench to test multi-modal large language models on twenty video tasks that demand changes over time and cannot be solved from any single frame. It defines these tasks by converting established static image problems into dynamic video versions, then automatically turns public video annotations into multiple-choice questions for evaluation. Results show current leading models perform poorly on these temporal skills, while the authors' VideoChat2 baseline, built through progressive multi-modal training, outperforms them by a large margin. This setup allows rapid, low-bias benchmark construction because it reuses existing ground-truth labels instead of relying on LLM scoring.

Core claim

MVBench covers twenty challenging video tasks that cannot be effectively solved with a single frame. These tasks are generated through a static-to-dynamic conversion that systematically produces examples requiring temporal skills from basic perception to higher cognition. Existing MLLMs remain far from satisfactory in temporal understanding, while VideoChat2 surpasses leading models by over fifteen percent on the benchmark.

What carries the argument

The static-to-dynamic method that transforms static image tasks into dynamic video tasks to generate a broad range of temporal skills from perception to cognition, paired with automatic conversion of public annotations into multiple-choice QA pairs.

If this is right

Current MLLMs need explicit temporal training to handle real-world video content reliably.
Benchmarks built from reused annotations can scale evaluation of dynamic skills without heavy manual labeling.
VideoChat2's progressive training recipe provides a practical path to stronger temporal performance in video models.
Fairness in scoring improves when evaluation stays tied to original ground-truth labels rather than LLM judgments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of MVBench could shift model development away from image-only pretraining toward sequence-aware architectures.
The same static-to-dynamic conversion idea might extend to other modalities such as audio or 3D scene understanding.
Longer video clips or open-ended questions could be added later to test whether the current gains hold for more complex narratives.

Load-bearing premise

Automatically turning public video annotations into multiple-choice questions accurately measures the intended temporal skills without creating annotation biases or letting models succeed via single-frame shortcuts.

What would settle it

A controlled test in which top models score nearly as high on MVBench after temporal order is randomly shuffled or timing cues are removed, showing the benchmark can be passed without genuine sequence understanding.

read the original abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MVBench gives a practical static-to-dynamic way to generate 20 temporal video tasks and shows VideoChat2 beating other MLLMs by 15%, but the claim that these tasks truly require multi-frame reasoning rests on untested assumptions.

read the letter

The core contribution here is the static-to-dynamic conversion that turns established image tasks into video ones focused on temporal skills. This produces a benchmark with 20 tasks spanning perception to cognition, built efficiently from public annotations without heavy manual work. They also release VideoChat2, trained progressively on diverse data, which pulls ahead of leading models by more than 15% on the new set. That gap is the main empirical result worth noting.

Referee Report

2 major / 2 minor

Summary. The paper introduces MVBench, a benchmark with 20 video tasks for assessing temporal understanding in multi-modal large language models (MLLMs). Tasks are created via a static-to-dynamic conversion method applied to public video annotations, which are then automatically transformed into multiple-choice QA pairs. The authors also propose VideoChat2, a video MLLM trained with progressive multi-modal instruction tuning, and report that existing MLLMs perform poorly on temporal tasks while VideoChat2 outperforms them by over 15% on the new benchmark.

Significance. If the tasks genuinely isolate temporal reasoning, MVBench would provide a valuable, scalable diagnostic for video MLLMs that current static-image benchmarks do not address. The automatic annotation-conversion pipeline and open release of models, data, and code at the GitHub repository are strengths that support reproducibility and community follow-up. The approach of deriving dynamic tasks from established static ones offers a systematic way to cover perception-to-cognition temporal skills.

major comments (2)

[§3] §3 (Task Definition and static-to-dynamic method): The central claim that the 20 tasks 'cannot be effectively solved with a single frame' is load-bearing for interpreting MVBench as a temporal-understanding benchmark, yet the manuscript provides no single-frame baselines, static-cue ablations, or human validation of shortcut resistance. Without these controls, performance differences could reflect exploitation of frame-level appearance or annotation patterns rather than dynamics, directly affecting the interpretation of VideoChat2's >15% gain.
[§5] §5 (Experiments and results): The reported superiority of VideoChat2 is shown only on MVBench; adding comparisons against the same models on established video benchmarks (e.g., those already testing temporal reasoning) would strengthen the claim that the improvement reflects genuine advances in temporal capability rather than benchmark-specific tuning.

minor comments (2)

[Abstract] The abstract and §1 could preview the exact average score and per-task range for the 15% improvement to give readers an immediate sense of effect size.
[Figures] Figure captions and task examples would benefit from explicit indication of which visual cues are static versus dynamic to help readers quickly grasp the conversion procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of MVBench as a diagnostic for temporal understanding in video MLLMs, as well as the strengths in reproducibility. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3] §3 (Task Definition and static-to-dynamic method): The central claim that the 20 tasks 'cannot be effectively solved with a single frame' is load-bearing for interpreting MVBench as a temporal-understanding benchmark, yet the manuscript provides no single-frame baselines, static-cue ablations, or human validation of shortcut resistance. Without these controls, performance differences could reflect exploitation of frame-level appearance or annotation patterns rather than dynamics, directly affecting the interpretation of VideoChat2's >15% gain.

Authors: We agree that empirical validation is necessary to substantiate the claim that the tasks require temporal reasoning rather than static cues. The static-to-dynamic conversion is constructed so that each task explicitly depends on temporal information (e.g., ordering of events or changes across frames) that is absent from any individual frame. Nevertheless, to strengthen the manuscript, we will add single-frame baselines for all 20 tasks, which will quantify the performance drop when temporal context is removed. We will also include a brief analysis of potential annotation patterns and how the automatic multiple-choice QA generation, grounded in public video annotations, reduces the risk of exploitable shortcuts. revision: yes
Referee: [§5] §5 (Experiments and results): The reported superiority of VideoChat2 is shown only on MVBench; adding comparisons against the same models on established video benchmarks (e.g., those already testing temporal reasoning) would strengthen the claim that the improvement reflects genuine advances in temporal capability rather than benchmark-specific tuning.

Authors: While the primary contribution is the introduction of MVBench to expose limitations in existing MLLMs on temporal tasks, we acknowledge that cross-benchmark evaluation would better contextualize VideoChat2's gains. In the revised manuscript we will report results for VideoChat2 and the compared models on additional established video benchmarks that emphasize temporal reasoning, thereby clarifying whether the observed improvements generalize beyond MVBench. revision: yes

Circularity Check

0 steps flagged

Benchmark construction relies on external annotations and explicit transformation method with no self-referential reduction

full rationale

The paper defines MVBench tasks via a static-to-dynamic conversion of public video annotations into MCQA pairs and reports empirical model scores including a >15% gain for VideoChat2. No equation, parameter fit, or derivation reduces to its own inputs by construction; the temporal-requirement claim follows directly from the stated transformation procedure rather than a loop, and results are obtained by running models on the generated benchmark. The methodology is self-contained against external data sources and does not invoke load-bearing self-citations or uniqueness theorems that collapse the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that public video annotations contain sufficient temporal information and that the transformation preserves task validity without new fitted parameters or invented entities.

axioms (1)

domain assumption Public video annotations can be reliably converted to multiple-choice QA without loss of temporal information or introduction of bias.
Invoked in the automatic conversion step described in the abstract.

pith-pipeline@v0.9.0 · 5611 in / 1101 out tokens · 39585 ms · 2026-05-17T20:18:29.561336+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FCMBench-Video: Benchmarking Document Video Intelligence
cs.CV 2026-04 unverdicted novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
cs.CV 2026-04 unverdicted novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
cs.CV 2025-04 unverdicted novelty 7.0

SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
QoS-QoE Translation with Large Language Model
cs.MM 2026-04 unverdicted novelty 6.0

A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
cs.CV 2026-04 conditional novelty 6.0

VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
cs.CV 2026-01 unverdicted novelty 6.0

HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
TempCompass: Do Video LLMs Really Understand Videos?
cs.CV 2024-03 unverdicted novelty 6.0

TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
cs.CV 2026-05 unverdicted novelty 5.0

Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
cs.CV 2024-04 conditional novelty 5.0

A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 16 Pith papers · 24 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Binkow...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021. 6, 8

work page 2021
[4]

Ali Furkan Biten, Rub `en P ´erez Tito, Andr ´es Mafla, Llu ´ıs G´omez, Marc ¸al Rusi˜nol, Ernest Valveny, C. V . Jawahar, and Dimosthenis Karatzas. Scene text visual question answer- ing. In ICCV, 2019. 6

work page 2019
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020. 2

work page 2020
[6]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021. 6

work page 2021
[7]

Chen and William B

David L. Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011. 2

work page 2011
[8]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elho- seiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. ArXiv, abs/2310.09478, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Ke Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. ArXiv, abs/2306.15195, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Benton C

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope...

work page
[11]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pas- cale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023. 2, 6, 7, 8

work page 2023
[12]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022. 9

work page 2022
[13]

Doell, and Jason J

Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013. 6

work page 2013
[14]

Imagenet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In CVPR, 2009. 6

work page 2009
[15]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. ArXiv, abs/1810.04805, 2018. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Xia, Mehdi S

Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...

work page 2023
[17]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for mul- timodal large language models. ArXiv, abs/2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Violet: End-to- end video-language transformers with masked visual-token modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to- end video-language transformers with masked visual-token modeling. ArXiv, abs/2111.12681, 2021. 10

work page arXiv 2021
[19]

Mist : Multi-modal iterative spatial-temporal transformer for long-form video question answering

Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yezhou Yang, and Mike Zheng Shou. Mist : Multi-modal iterative spatial-temporal transformer for long-form video question answering. In CVPR, 2022. 10

work page 2022
[20]

Gao, Chen Sun, Zhenheng Yang, and Ramakant Nevatia

J. Gao, Chen Sun, Zhenheng Yang, and Ramakant Nevatia. Tall: Temporal activity localization via language query. In ICCV, 2017. 3, 12

work page 2017
[21]

Multimodal-gpt: A vision and lan- guage model for dialogue with humans

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qianmengke Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vi- sion and language model for dialogue with humans. ArXiv, abs/2305.04790, 2023. 2

work page arXiv 2023
[22]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fr ¨und, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017. 2, 6, 9

work page 2017
[23]

Making the v in vqa matter: El- evating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, 2017. 2, 6

work page 2017
[24]

Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, S...

work page 2022
[25]

Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen

J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR,

work page
[26]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 6

work page 2019
[28]

Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim

Y . Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017. 6

work page 2017
[29]

Mistral 7B

Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Lawrence Zitnick, and Ross B

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Gir- shick. Clevr: A diagnostic dataset for compositional lan- guage and elementary visual reasoning. In CVPR, 2017. 4, 6

work page 2017
[31]

The Kinetics Human Action Video Dataset

Will Kay, Jo ˜ao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Apostol Natsev, Mustafa Suley- man, and Andrew Zisserman. The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Beyond the nav-graph: Vision-and- language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Ba- tra, and Stefan Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InECCV,

work page
[33]

A hierarchical approach for generating descriptive image paragraphs

Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. A hierarchical approach for generating descriptive image paragraphs. In CVPR, 2017. 6

work page 2017
[34]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017. 6

work page 2017
[35]

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. Tvqa: Localized, compositional video question answering. In EMNLP, 2018. 3, 7, 10, 12

work page 2018
[36]

Moreno, and Jes ´us Lov´on-Melgarejo

Paul Lerner, Olivier Ferret, Camille Guinaudeau, Herv ´e Le Borgne, Romaric Besanc ¸on, Jos´e G. Moreno, and Jes ´us Lov´on-Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In SIGIR,

work page
[37]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125, 2023. 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. ArXiv, abs/2305.03726,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2022. 1, 6, 7

work page 2022
[40]

Inten- tqa: Context-aware video intent reasoning

Jiapeng Li, Ping Wei, Wenjuan Han, and Lifeng Fan. Inten- tqa: Context-aware video intent reasoning. 2023. 7, 10

work page 2023
[41]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Y . Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer.ArXiv, abs/2211.09552, 2022. 6, 10

work page arXiv 2022
[42]

VideoChat: Chat-Centric Video Understanding

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wen Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023. 1, 2, 5, 6, 7, 8, 9, 10, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Unmasked teacher: Towards training-efficient video foundation models

Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023. 2, 6, 8, 9, 10, 12

work page 2023
[44]

M3it: A large-scale dataset towards multi-modal multilingual instruction tun- ing

Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, and Qi Liu. M3it: A large-scale dataset towards multi-modal multilingual instruction tun- ing. ArXiv, abs/2306.04387, 2023. 5

work page arXiv 2023
[45]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in large vision-language models. ArXiv, abs/2305.10355,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 6, 8

work page 2014
[47]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 1, 2, 6, 7, 10

work page 2023
[48]

Ntu rgb+d 120: A large-scale benchmark for 3d human activity understand- ing

Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understand- ing. TPAMI, 2020. 3, 12

work page 2020
[49]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuanzhan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mm- bench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023. 1, 2, 3, 5, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Val- ley: Video assistant with large language model enhanced ability

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Ming- Hui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Val- ley: Video assistant with large language model enhanced ability. ArXiv, abs/2306.07207, 2023. 2

work page arXiv 2023
[51]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv, abs/2306.05424, 2023. 2, 6, 7, 8, 10, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jiten- dra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. ArXiv, abs/2308.09126, 2023. 7, 10

work page arXiv 2023
[53]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019. 2, 6

work page 2019
[54]

Manmatha, and C

Minesh Mathew, Dimosthenis Karatzas, R. Manmatha, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021. 6

work page 2021
[55]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 6

work page 2019
[56]

Spoken moments: Learning joint audio-visual representations from video de- scriptions

Mathew Monfort and SouYoung Jin. Spoken moments: Learning joint audio-visual representations from video de- scriptions. In CVPR, 2021. 7

work page 2021
[57]

Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, and Aude Oliva

Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Alex An- donian, Tom Yan, Kandan Ramakrishnan, Lisa M. Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, and Aude Oliva. Moments in time dataset: One million videos for event understanding. TPAMI, 2020. 3, 12

work page 2020
[58]

OpenAI. Chatgpt. https://openai.com/blog/ chatgpt/, 2023. 1, 4, 5, 8, 10

work page 2023
[59]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. https://api. semanticscholar . org / CorpusID : 263218031,

work page
[60]

Im2text: Describing images using 1 million captioned pho- tographs

Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned pho- tographs. In NeurIPS, 2011. 6

work page 2011
[61]

Koster, Junlin Zhang, Stephanie, Winkler, Yusuf Aytar, Si- mon Osindero, Dima Damen, Andrew Zisserman, and Jo˜ao Carreira

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adri`a Recasens Continente, Larisa Markeeva, Dylan, Ba- narse, Mateusz Malinowski, Yezhou Yang, Carl Doer- sch, Tatiana Matejovicova, Yury Sulsky, Antoine, Miech, Skanda Koppula, Alexander Fr´echette, Hanna Klimczak, R. Koster, Junlin Zhang, Stephanie, Winkler, Yusuf Aytar, Si- mon Osindero, Dima Damen, Andr...

work page 2023
[62]

Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In ICCV,

work page
[63]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020. 2

work page 2020
[64]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022. 6

work page 2022
[65]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InACL,

work page
[66]

Textcaps: a dataset for image caption- ing with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In ECCV, 2020. 6

work page 2020
[67]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR,

work page
[68]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Yu Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. ArXiv, abs/2303.15389, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Vi- sualmrc: Machine reading comprehension on document im- ages

Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. Vi- sualmrc: Machine reading comprehension on document im- ages. In AAAI, 2021. 6

work page 2021
[70]

Internlm: A multilingual language model with progressively enhanced capabilities

InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https : / / github.com/InternLM/InternLM, 2023. 2

work page 2023
[71]

Vicuna: An open-source chatbot impress- ing gpt-4 with 90% chatgpt quality

Vicuna Team. Vicuna: An open-source chatbot impress- ing gpt-4 with 90% chatgpt quality. https://vicuna. lmsys.org/, 2023. 1, 6, 8

work page 2023
[72]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. 1, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cant ´on Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer- nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goy...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

All in one: Exploring unified video-language pre-training

Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. In CVPR, 2023. 10

work page 2023
[75]

Temporal segment networks: Towards good practices for deep action recogni- tion

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recogni- tion. In ECCV, 2016. 9

work page 2016
[76]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023. 9

work page 2023
[77]

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. ArXiv, abs/2212.03191, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[78]

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Jian Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Zi- wei Liu, Yali Wang, Limin Wang, and Y . Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. ArXiv, 2023. 6

work page 2023
[79]

Pax- ion: Patching action knowledge in video-language founda- tion models

Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Pax- ion: Patching action knowledge in video-language founda- tion models. In NeurIPS, 2023. 3, 9, 12

work page 2023
[80]

Dai, and Quoc V

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In ICLR, 2021. 2

work page 2021

Showing first 80 references.