MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Pith reviewed 2026-05-17 20:18 UTC · model grok-4.3
The pith
Most multi-modal AI models fail at temporal understanding in videos, but a new benchmark and training method lift performance by more than 15 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MVBench covers twenty challenging video tasks that cannot be effectively solved with a single frame. These tasks are generated through a static-to-dynamic conversion that systematically produces examples requiring temporal skills from basic perception to higher cognition. Existing MLLMs remain far from satisfactory in temporal understanding, while VideoChat2 surpasses leading models by over fifteen percent on the benchmark.
What carries the argument
The static-to-dynamic method that transforms static image tasks into dynamic video tasks to generate a broad range of temporal skills from perception to cognition, paired with automatic conversion of public annotations into multiple-choice QA pairs.
If this is right
- Current MLLMs need explicit temporal training to handle real-world video content reliably.
- Benchmarks built from reused annotations can scale evaluation of dynamic skills without heavy manual labeling.
- VideoChat2's progressive training recipe provides a practical path to stronger temporal performance in video models.
- Fairness in scoring improves when evaluation stays tied to original ground-truth labels rather than LLM judgments.
Where Pith is reading between the lines
- Widespread use of MVBench could shift model development away from image-only pretraining toward sequence-aware architectures.
- The same static-to-dynamic conversion idea might extend to other modalities such as audio or 3D scene understanding.
- Longer video clips or open-ended questions could be added later to test whether the current gains hold for more complex narratives.
Load-bearing premise
Automatically turning public video annotations into multiple-choice questions accurately measures the intended temporal skills without creating annotation biases or letting models succeed via single-frame shortcuts.
What would settle it
A controlled test in which top models score nearly as high on MVBench after temporal order is randomly shuffled or timing cues are removed, showing the benchmark can be passed without genuine sequence understanding.
read the original abstract
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MVBench, a benchmark with 20 video tasks for assessing temporal understanding in multi-modal large language models (MLLMs). Tasks are created via a static-to-dynamic conversion method applied to public video annotations, which are then automatically transformed into multiple-choice QA pairs. The authors also propose VideoChat2, a video MLLM trained with progressive multi-modal instruction tuning, and report that existing MLLMs perform poorly on temporal tasks while VideoChat2 outperforms them by over 15% on the new benchmark.
Significance. If the tasks genuinely isolate temporal reasoning, MVBench would provide a valuable, scalable diagnostic for video MLLMs that current static-image benchmarks do not address. The automatic annotation-conversion pipeline and open release of models, data, and code at the GitHub repository are strengths that support reproducibility and community follow-up. The approach of deriving dynamic tasks from established static ones offers a systematic way to cover perception-to-cognition temporal skills.
major comments (2)
- [§3] §3 (Task Definition and static-to-dynamic method): The central claim that the 20 tasks 'cannot be effectively solved with a single frame' is load-bearing for interpreting MVBench as a temporal-understanding benchmark, yet the manuscript provides no single-frame baselines, static-cue ablations, or human validation of shortcut resistance. Without these controls, performance differences could reflect exploitation of frame-level appearance or annotation patterns rather than dynamics, directly affecting the interpretation of VideoChat2's >15% gain.
- [§5] §5 (Experiments and results): The reported superiority of VideoChat2 is shown only on MVBench; adding comparisons against the same models on established video benchmarks (e.g., those already testing temporal reasoning) would strengthen the claim that the improvement reflects genuine advances in temporal capability rather than benchmark-specific tuning.
minor comments (2)
- [Abstract] The abstract and §1 could preview the exact average score and per-task range for the 15% improvement to give readers an immediate sense of effect size.
- [Figures] Figure captions and task examples would benefit from explicit indication of which visual cues are static versus dynamic to help readers quickly grasp the conversion procedure.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential value of MVBench as a diagnostic for temporal understanding in video MLLMs, as well as the strengths in reproducibility. We address each major comment below and describe the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Task Definition and static-to-dynamic method): The central claim that the 20 tasks 'cannot be effectively solved with a single frame' is load-bearing for interpreting MVBench as a temporal-understanding benchmark, yet the manuscript provides no single-frame baselines, static-cue ablations, or human validation of shortcut resistance. Without these controls, performance differences could reflect exploitation of frame-level appearance or annotation patterns rather than dynamics, directly affecting the interpretation of VideoChat2's >15% gain.
Authors: We agree that empirical validation is necessary to substantiate the claim that the tasks require temporal reasoning rather than static cues. The static-to-dynamic conversion is constructed so that each task explicitly depends on temporal information (e.g., ordering of events or changes across frames) that is absent from any individual frame. Nevertheless, to strengthen the manuscript, we will add single-frame baselines for all 20 tasks, which will quantify the performance drop when temporal context is removed. We will also include a brief analysis of potential annotation patterns and how the automatic multiple-choice QA generation, grounded in public video annotations, reduces the risk of exploitable shortcuts. revision: yes
-
Referee: [§5] §5 (Experiments and results): The reported superiority of VideoChat2 is shown only on MVBench; adding comparisons against the same models on established video benchmarks (e.g., those already testing temporal reasoning) would strengthen the claim that the improvement reflects genuine advances in temporal capability rather than benchmark-specific tuning.
Authors: While the primary contribution is the introduction of MVBench to expose limitations in existing MLLMs on temporal tasks, we acknowledge that cross-benchmark evaluation would better contextualize VideoChat2's gains. In the revised manuscript we will report results for VideoChat2 and the compared models on additional established video benchmarks that emphasize temporal reasoning, thereby clarifying whether the observed improvements generalize beyond MVBench. revision: yes
Circularity Check
Benchmark construction relies on external annotations and explicit transformation method with no self-referential reduction
full rationale
The paper defines MVBench tasks via a static-to-dynamic conversion of public video annotations into MCQA pairs and reports empirical model scores including a >15% gain for VideoChat2. No equation, parameter fit, or derivation reduces to its own inputs by construction; the temporal-requirement claim follows directly from the stated transformation procedure rather than a loop, and results are obtained by running models on the generated benchmark. The methodology is self-contained against external data sources and does not invoke load-bearing self-citations or uniqueness theorems that collapse the central claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Public video annotations can be reliably converted to multiple-choice QA without loss of temporal information or introduction of bias.
Forward citations
Cited by 17 Pith papers
-
FCMBench-Video: Benchmarking Document Video Intelligence
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
-
QoS-QoE Translation with Large Language Model
A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while imp...
-
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
-
TempCompass: Do Video LLMs Really Understand Videos?
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
-
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
-
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Binkow...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021. 6, 8
work page 2021
-
[4]
Ali Furkan Biten, Rub `en P ´erez Tito, Andr ´es Mafla, Llu ´ıs G´omez, Marc ¸al Rusi˜nol, Ernest Valveny, C. V . Jawahar, and Dimosthenis Karatzas. Scene text visual question answer- ing. In ICCV, 2019. 6
work page 2019
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020. 2
work page 2020
-
[6]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021. 6
work page 2021
-
[7]
David L. Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011. 2
work page 2011
-
[8]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elho- seiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. ArXiv, abs/2310.09478, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Ke Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. ArXiv, abs/2306.15195, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Benton C
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope...
-
[11]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pas- cale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023. 2, 6, 7, 8
work page 2023
-
[12]
Fu, Stefano Ermon, Atri Rudra, and Christopher R´e
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022. 9
work page 2022
-
[13]
Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013. 6
work page 2013
-
[14]
Imagenet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In CVPR, 2009. 6
work page 2009
-
[15]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. ArXiv, abs/1810.04805, 2018. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...
work page 2023
-
[17]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for mul- timodal large language models. ArXiv, abs/2306.13394,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Violet: End-to- end video-language transformers with masked visual-token modeling
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to- end video-language transformers with masked visual-token modeling. ArXiv, abs/2111.12681, 2021. 10
-
[19]
Mist : Multi-modal iterative spatial-temporal transformer for long-form video question answering
Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yezhou Yang, and Mike Zheng Shou. Mist : Multi-modal iterative spatial-temporal transformer for long-form video question answering. In CVPR, 2022. 10
work page 2022
-
[20]
Gao, Chen Sun, Zhenheng Yang, and Ramakant Nevatia
J. Gao, Chen Sun, Zhenheng Yang, and Ramakant Nevatia. Tall: Temporal activity localization via language query. In ICCV, 2017. 3, 12
work page 2017
-
[21]
Multimodal-gpt: A vision and lan- guage model for dialogue with humans
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qianmengke Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vi- sion and language model for dialogue with humans. ArXiv, abs/2305.04790, 2023. 2
-
[22]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fr ¨und, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017. 2, 6, 9
work page 2017
-
[23]
Making the v in vqa matter: El- evating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, 2017. 2, 6
work page 2017
-
[24]
Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh K. Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Z. Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, S...
work page 2022
-
[25]
Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen
J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR,
-
[26]
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 6
work page 2019
-
[28]
Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim
Y . Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gun- hee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017. 6
work page 2017
-
[29]
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b. ArXiv, abs/2...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Gir- shick. Clevr: A diagnostic dataset for compositional lan- guage and elementary visual reasoning. In CVPR, 2017. 4, 6
work page 2017
-
[31]
The Kinetics Human Action Video Dataset
Will Kay, Jo ˜ao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Apostol Natsev, Mustafa Suley- man, and Andrew Zisserman. The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Beyond the nav-graph: Vision-and- language navigation in continuous environments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Ba- tra, and Stefan Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InECCV,
-
[33]
A hierarchical approach for generating descriptive image paragraphs
Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. A hierarchical approach for generating descriptive image paragraphs. In CVPR, 2017. 6
work page 2017
-
[34]
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017. 6
work page 2017
-
[35]
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. Tvqa: Localized, compositional video question answering. In EMNLP, 2018. 3, 7, 10, 12
work page 2018
-
[36]
Moreno, and Jes ´us Lov´on-Melgarejo
Paul Lerner, Olivier Ferret, Camille Guinaudeau, Herv ´e Le Borgne, Romaric Besanc ¸on, Jos´e G. Moreno, and Jes ´us Lov´on-Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In SIGIR,
-
[37]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125, 2023. 2, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. ArXiv, abs/2305.03726,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2022. 1, 6, 7
work page 2022
-
[40]
Inten- tqa: Context-aware video intent reasoning
Jiapeng Li, Ping Wei, Wenjuan Han, and Lifeng Fan. Inten- tqa: Context-aware video intent reasoning. 2023. 7, 10
work page 2023
- [41]
-
[42]
VideoChat: Chat-Centric Video Understanding
Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wen Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023. 1, 2, 5, 6, 7, 8, 9, 10, 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Unmasked teacher: Towards training-efficient video foundation models
Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023. 2, 6, 8, 9, 10, 12
work page 2023
-
[44]
M3it: A large-scale dataset towards multi-modal multilingual instruction tun- ing
Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, and Qi Liu. M3it: A large-scale dataset towards multi-modal multilingual instruction tun- ing. ArXiv, abs/2306.04387, 2023. 5
-
[45]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in large vision-language models. ArXiv, abs/2305.10355,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 6, 8
work page 2014
-
[47]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 1, 2, 6, 7, 10
work page 2023
-
[48]
Ntu rgb+d 120: A large-scale benchmark for 3d human activity understand- ing
Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understand- ing. TPAMI, 2020. 3, 12
work page 2020
-
[49]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mm- bench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023. 1, 2, 3, 5, 8, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Val- ley: Video assistant with large language model enhanced ability
Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Ming- Hui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Val- ley: Video assistant with large language model enhanced ability. ArXiv, abs/2306.07207, 2023. 2
-
[51]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv, abs/2306.05424, 2023. 2, 6, 7, 8, 10, 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Egoschema: A diagnostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jiten- dra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. ArXiv, abs/2308.09126, 2023. 7, 10
-
[53]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019. 2, 6
work page 2019
-
[54]
Minesh Mathew, Dimosthenis Karatzas, R. Manmatha, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021. 6
work page 2021
-
[55]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 6
work page 2019
-
[56]
Spoken moments: Learning joint audio-visual representations from video de- scriptions
Mathew Monfort and SouYoung Jin. Spoken moments: Learning joint audio-visual representations from video de- scriptions. In CVPR, 2021. 7
work page 2021
-
[57]
Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, and Aude Oliva
Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Alex An- donian, Tom Yan, Kandan Ramakrishnan, Lisa M. Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, and Aude Oliva. Moments in time dataset: One million videos for event understanding. TPAMI, 2020. 3, 12
work page 2020
-
[58]
OpenAI. Chatgpt. https://openai.com/blog/ chatgpt/, 2023. 1, 4, 5, 8, 10
work page 2023
-
[59]
OpenAI. Gpt-4v(ision) system card. https://api. semanticscholar . org / CorpusID : 263218031,
-
[60]
Im2text: Describing images using 1 million captioned pho- tographs
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned pho- tographs. In NeurIPS, 2011. 6
work page 2011
-
[61]
Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adri`a Recasens Continente, Larisa Markeeva, Dylan, Ba- narse, Mateusz Malinowski, Yezhou Yang, Carl Doer- sch, Tatiana Matejovicova, Yury Sulsky, Antoine, Miech, Skanda Koppula, Alexander Fr´echette, Hanna Klimczak, R. Koster, Junlin Zhang, Stephanie, Winkler, Yusuf Aytar, Si- mon Osindero, Dima Damen, Andr...
work page 2023
-
[62]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In ICCV,
-
[63]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020. 2
work page 2020
-
[64]
A-okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022. 6
work page 2022
-
[65]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InACL,
-
[66]
Textcaps: a dataset for image caption- ing with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In ECCV, 2020. 6
work page 2020
-
[67]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR,
-
[68]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Yu Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. ArXiv, abs/2303.15389, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Vi- sualmrc: Machine reading comprehension on document im- ages
Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. Vi- sualmrc: Machine reading comprehension on document im- ages. In AAAI, 2021. 6
work page 2021
-
[70]
Internlm: A multilingual language model with progressively enhanced capabilities
InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https : / / github.com/InternLM/InternLM, 2023. 2
work page 2023
-
[71]
Vicuna: An open-source chatbot impress- ing gpt-4 with 90% chatgpt quality
Vicuna Team. Vicuna: An open-source chatbot impress- ing gpt-4 with 90% chatgpt quality. https://vicuna. lmsys.org/, 2023. 1, 6, 8
work page 2023
-
[72]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. 1, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cant ´on Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer- nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goy...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
All in one: Exploring unified video-language pre-training
Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. In CVPR, 2023. 10
work page 2023
-
[75]
Temporal segment networks: Towards good practices for deep action recogni- tion
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recogni- tion. In ECCV, 2016. 9
work page 2016
-
[76]
Videomae v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023. 9
work page 2023
-
[77]
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. ArXiv, abs/2212.03191, 2022. 10
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[78]
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Jian Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Zi- wei Liu, Yali Wang, Limin Wang, and Y . Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. ArXiv, 2023. 6
work page 2023
-
[79]
Pax- ion: Patching action knowledge in video-language founda- tion models
Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Pax- ion: Patching action knowledge in video-language founda- tion models. In NeurIPS, 2023. 3, 9, 12
work page 2023
-
[80]
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In ICLR, 2021. 2
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.