Benchmarking Visual State Tracking in Multimodal Video Understanding

Boyang Zheng; Ellis Brown; Hyunseok Lee; Jinwoo Shin; June Suk Choi; Nanye Ma; Oscar Michel; Pinzhi Huang; Saining Xie; Shusheng Yang

arxiv: 2606.03920 · v1 · pith:TSOOYSZPnew · submitted 2026-06-02 · 💻 cs.CV

Benchmarking Visual State Tracking in Multimodal Video Understanding

Sihyun Yu , Nanye Ma , Pinzhi Huang , Hyunseok Lee , Shusheng Yang , June Suk Choi , Ellis Brown , Oscar Michel

show 3 more authors

Boyang Zheng Jinwoo Shin Saining Xie

This is my paper

Pith reviewed 2026-06-28 10:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual state trackingmultimodal large language modelsvideo understandingbenchmark evaluationvisual perceptiontemporal integrationMLLM limitations

0 comments

The pith

Multimodal LLMs reason about video events in text but fail to perceive the visual state changes needed for accurate tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Visual STAte Tracking benchmark (VSTAT) consisting of 834 video clips and 1500 questions that require integrating events across the full video stream rather than from any single frame or short segment. State-of-the-art MLLMs perform far below human levels on this benchmark and only modestly above answer-prior baselines. When their thinking traces are compared to the video content, the models track states correctly in text but fail to visually perceive the events required for that tracking. This reveals a core limitation in how current models handle continuous visual perception in videos despite strong results on other benchmarks. The finding matters because reliable video understanding depends on this state-tracking ability.

Core claim

The paper's central claim is that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on VSTAT. Analysis of their reasoning traces against the video stream shows they reason and track correctly in text but fail at visually perceiving the events they need to track. Preliminary tests indicate that recent agentic approaches including MLLM-based video agents and coding agents do not resolve these failures.

What carries the argument

VSTAT benchmark of 834 clips paired with 1500 questions that demand continuous perception and integration of events across the entire video stream.

If this is right

Existing video benchmarks may not fully diagnose limitations in continuous visual state tracking.
Agentic methods built on current MLLMs will likely inherit the same visual perception shortfalls on dynamic tasks.
Models must improve visual perception of state changes over time to close the gap with humans on video tasks.
New evaluation sets should prioritize questions that force integration across entire video streams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world systems using MLLMs for video monitoring or action prediction may underperform when state changes occur gradually or across long clips.
Architectures that better couple visual encoders with temporal memory could be tested specifically against VSTAT-style questions.
The gap between text reasoning and visual perception suggests training regimes that penalize mismatches between generated text and actual frame content.

Load-bearing premise

The 1500 questions cannot be answered from any single frame or short segment and truly require continuous perception and integration of events across the full video.

What would settle it

Demonstration that many VSTAT questions can be answered correctly from one or two frames alone, or that a model reaches human-level accuracy on VSTAT without using temporal visual information from the full clip.

read the original abstract

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VSTAT introduces a benchmark targeting full-video state tracking in MLLMs and reports a clear gap versus humans plus text-reasoning dissociation, but the central claim depends on unshown validation that questions truly require continuous perception.

read the letter

The paper's core contribution is VSTAT: 834 clips and 1500 questions built so that answers need integration across the whole video, not single frames or short clips. SOTA MLLMs sit far below human performance and only modestly above answer-prior baselines, while their thinking traces show they can track states correctly once the events are described in text but fail to perceive those events from the video itself. Agentic setups do not close the gap in the preliminary tests.

This is new in its explicit focus on state tracking as a distinct failure mode and in the trace-to-video comparison. The empirical pattern is straightforward and worth having on record.

The main soft spot is the assertion that none of the questions can be answered from any single frame or short segment. The abstract states this property, but the provided summary gives no concrete protocol—single-frame ablations, key-frame oracles, or restricted-access annotator checks—to confirm it holds for the full set. If a non-trivial portion of questions leak information from salient moments, the performance gap and the perception-versus-reasoning diagnosis become harder to isolate from ordinary long-video weaknesses. Soundness details on question validation and baseline construction are also thin in what is visible.

The work is aimed at groups building or evaluating video MLLMs. A reader interested in temporal perception limits will get a concrete new target to test against. It deserves peer review because the benchmark idea is well-motivated and the reported dissociation is falsifiable with the right controls; a referee can check the missing validation steps and statistical reporting without the paper being incoherent on its own terms.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Visual STAte Tracking (VSTAT) benchmark with 834 clips (synthetic and real-world) and 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration across the full video. It reports that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines; analysis of thinking traces shows models reason and track correctly in text but fail at visually perceiving the needed events; preliminary tests indicate agentic MLLM-based video and coding agents do not resolve the failures.

Significance. If the benchmark questions are rigorously shown to require full-video integration, VSTAT would offer a useful diagnostic for a specific visual perception limitation in MLLMs that standard video benchmarks miss. The explicit answer-prior baseline comparison and thinking-trace analysis are strengths that support a more precise diagnosis than accuracy numbers alone.

major comments (2)

[Abstract] Abstract: the central claim that MLLMs fail specifically at visual state tracking (rather than general perception) rests on the assertion that the 1,500 questions 'cannot be answered from any single frame or short segment'; the manuscript provides no concrete validation protocol (e.g., single-frame ablation, key-frame oracle, or restricted-temporal-access annotator checks) for the 834 clips, leaving the performance-gap interpretation vulnerable.
The construction of the answer-prior baselines and any statistical significance tests for the reported gaps (MLLMs vs. humans vs. baselines) are not described, which directly affects the robustness of the 'modestly above' and 'far below' claims.

minor comments (1)

A table or section summarizing clip sources, question categories, and inter-annotator agreement would improve clarity of the benchmark construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The two major comments highlight important areas where additional documentation will strengthen the manuscript. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that MLLMs fail specifically at visual state tracking (rather than general perception) rests on the assertion that the 1,500 questions 'cannot be answered from any single frame or short segment'; the manuscript provides no concrete validation protocol (e.g., single-frame ablation, key-frame oracle, or restricted-temporal-access annotator checks) for the 834 clips, leaving the performance-gap interpretation vulnerable.

Authors: We agree that an explicit validation protocol is necessary to support the claim that questions require full-video integration. The current manuscript states the requirement but does not detail the verification steps performed during dataset construction. In the revised version we will add a dedicated subsection under 'Benchmark Construction' that describes: (1) single-frame and short-segment (≤5s) ablations run by multiple annotators on a random subset of questions, (2) a key-frame oracle condition, and (3) restricted-temporal-access checks in which annotators could view only non-contiguous frames. These results will be reported with inter-annotator agreement statistics. This addition will make the validation transparent and directly address the vulnerability noted. revision: yes
Referee: [—] The construction of the answer-prior baselines and any statistical significance tests for the reported gaps (MLLMs vs. humans vs. baselines) are not described, which directly affects the robustness of the 'modestly above' and 'far below' claims.

Authors: We acknowledge that the manuscript does not describe the precise construction of the answer-prior baselines nor any statistical tests. In the revision we will expand the 'Baselines' and 'Evaluation Metrics' sections to specify: (a) the exact procedures used to generate the answer-prior baselines (most-frequent answer per question type, random sampling from the answer distribution, and length-matched priors), (b) the number of samples drawn for each baseline, and (c) the statistical tests employed (paired t-tests or bootstrap confidence intervals with multiple-comparison correction) together with the resulting p-values for all reported gaps. These details will be added without altering the numerical results already presented. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with external validation

full rationale

The paper introduces VSTAT as an empirical benchmark consisting of 834 clips and 1500 questions, with performance measured against human annotators and answer-prior baselines. No derivations, equations, fitted parameters, or predictions are claimed. The assertion that questions require full-video integration is presented as a design property of the benchmark rather than a derived result from any self-referential chain. All comparisons are to independent external references (human performance, prior MLLM benchmarks), satisfying the criteria for non-circularity. No self-citation load-bearing steps or ansatz smuggling occur.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the domain assumption that the chosen questions demand full-video tracking; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Questions cannot be answered from any single frame or short segment
Explicitly stated in the abstract as the defining property of the benchmark questions.

pith-pipeline@v0.9.1-grok · 5788 in / 1101 out tokens · 29107 ms · 2026-06-28T10:41:03.901830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

126 extracted references · 27 linked inside Pith

[1]

Diffusion for World Modeling: Visual Details Matter in Atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for World Modeling: Visual Details Matter in Atari. InAdvances in Neural Information Processing Systems, 2024

2024
[2]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/claude-opu s-4-7, April 2026. Accessed: 2026-05-02

2026
[3]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023
[4]

Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023
[5]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[6]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[7]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

2025
[8]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[9]

𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020

2020
[11]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InIEEE Conference on Computer Vision and Pattern Recognition, 2015

2015
[12]

Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

arXiv 2024
[13]

Hourvideo: 1-hour video-language understanding

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video-language understanding. InAdvances in Neural Information Processing Systems, 2024

2024
[14]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611, 2026

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Chris Dongjoo Kim, Yue Yang, Ali Farhadi, and Ranjay Krishna. Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611, 2026

Pith/arXiv arXiv 2026
[15]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025
[16]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InIEEE Conference on Computer Vision and Pattern Recognition, 2025

2025
[17]

VITA-1.5: Towards GPT-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, and Ran He. VITA-1.5: Towards GPT-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Pith/arXiv arXiv 2025
[18]

Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

Chaoyou Fu, Hao Yuan, Yuhao Dong, Yifan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yong Xie, Xiawu Zheng, Xuejiao Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, and Ran He. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

Pith/arXiv arXiv 2026
[19]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InIEEE International Conference on Computer Vision, 2017. 13

2017
[20]

Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026
[21]

GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Pith/arXiv arXiv 2025
[22]

Gemini 3 flash

Google DeepMind. Gemini 3 flash. https://deepmind.google/models/gemini/flash/ , 2025

2025
[23]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/model-c ards/gemini-3-1-pro/, 2026

2026
[24]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. In International Conference on Learning Representations, 2025

2025
[25]

MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

arXiv 2025
[26]

Recurrent World Models Facilitate Policy Evolution

David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. InAdvances in Neural Information Processing Systems, 2018

2018
[27]

Mastering Diverse Domains through World Models.Nature, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering Diverse Domains through World Models.Nature, 2025

2025
[28]

RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

arXiv 2025
[29]

Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

Pith/arXiv arXiv 2025
[30]

World and Human Action Models Towards Gameplay Ideation.Nature, 2025

Anssi Kanervisto, Dave Bignell, Linda Yilin Wen, Martin Grayson, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Tabish Rashid, Tim Pearce, Yuhan Cao, et al. World and Human Action Models Towards Gameplay Ideation.Nature, 2025

2025
[31]

3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996, 2025

Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996, 2025

arXiv 2025
[32]

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence.arXiv preprint arXiv:2605.25979, 2026

Glint Lab, AIM for Health Lab, and MVP Lab. LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence.arXiv preprint arXiv:2605.25979, 2026

Pith/arXiv arXiv 2026
[33]

A Path Towards Autonomous Machine Intelligence.Open Review, 2022

Yann LeCun. A Path Towards Autonomous Machine Intelligence.Open Review, 2022

2022
[34]

Llava-onevision: Easy visual task transfer.T ransactions on Machine Learning Research, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.T ransactions on Machine Learning Research, 2025

2025
[35]

MVbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. MVbench: A comprehensive multi-modal video understanding benchmark. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

2024
[36]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024

2024
[37]

Can vision-language models solve the shell game?arXiv preprint arXiv:2603.08436, 2026

Tiedong Liu and Wee Sun Lee. Can vision-language models solve the shell game?arXiv preprint arXiv:2603.08436, 2026. 14

arXiv 2026
[38]

Videoreasonbench: Can MLLMs perform vision-centric complex video reasoning? InInternational Conference on Learning Representations, 2026

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y.Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can MLLMs perform vision-centric complex video reasoning? InInternational Conference on Learning Representations, 2026

2026
[39]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

2024
[40]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems, 2023

2023
[41]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021

2021
[42]

Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models

Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models. InInternational Conference on Learning Representations, 2025

2025
[43]

Oriane Sim’eoni, Huy V . Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

Pith/arXiv arXiv 2025
[44]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Pith/arXiv arXiv 2026
[45]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

2024
[46]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025
[47]

Advancing Open-Source World Models.arXiv preprint arXiv:2601.20540, 2026

Team Robbyant, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing Open-Source World Models.arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026
[48]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024
[49]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023
[50]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025
[51]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 15

Pith/arXiv arXiv 2024
[52]

LVBench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. LVBench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

Pith/arXiv arXiv 2024
[53]

InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025
[54]

Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

arXiv 2023
[55]

Continuous perception matters: Diagnosing temporal integration failures in multimodal models.arXiv preprint arXiv:2408.07867, 2024

Zeyu Wang, Zhenzhen Weng, and Serena Yeung-Levy. Continuous perception matters: Diagnosing temporal integration failures in multimodal models.arXiv preprint arXiv:2408.07867, 2024

arXiv 2024
[56]

Ryoo, and Juan Carlos Niebles

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, and Juan Carlos Niebles. Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025

Pith/arXiv arXiv 2025
[57]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Information Processing Systems, 2024

2024
[58]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InIEEE Conference on Computer Vision and Pattern Recogni- tion, 2021

2021
[59]

Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

Xiaomi LLM-Core Team, Zihao Yue, Zhenrui Lin, Yi-Hao Song, Weikun Wang, Shu-Qin Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zi-Ang Jiang, Zhixian Zheng, Zhichao Song, Zhen Luo, Yue Yu, Yudong Wang, Yu Tian, Yu Tu, Yihan Yan, Yi ...

arXiv 2025
[60]

Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

2024
[61]

Cambrian-s: Towards spatial supersensing in video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video. InInternational Conference on Learning Representations, 2026

2026
[62]

Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

Pith/arXiv arXiv 2024
[63]

Video instruction tuning with synthetic data.T ransactions on Machine Learning Research, 2025

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.T ransactions on Machine Learning Research, 2025

2025
[64]

Mmvu: Measuring expert-level multi-discipline video under- standing

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video under- standing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 16

2025
[65]

Mlvu: A comprehensive benchmark for multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

Pith/arXiv arXiv 2024
[66]

second-to-last

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 17 A. Benchmark Breakdown A.1. Detailed Information Formal definitionIn Table 6 and 7, we provide a...

Pith/arXiv arXiv 2025
[67]

When rolling up, the old Bottom face becomes the new Front face

Move 1 (roll up):The visible faces become Top = Green, Front = Pink, Right = Blue. When rolling up, the old Bottom face becomes the new Front face. Since the new Front is Pink, the initial Bottom face was Pink. Because the initial Top face was Red, we can conclude thatRed and Pink are opposite faces
[68]

The visible faces are Top = Blue, Front = Green, Right = Pink

Move 3 (roll left):From Move 2 (Top = Red, Front = Green, Right = Blue), rolling left makes the old Right (Blue) the new Top, and the old Bottom (Pink) the new Right. The visible faces are Top = Blue, Front = Green, Right = Pink
[69]

The visible faces are Top = Green, Front = Yellow, Right = Pink

Move 4 (roll up):Rolling up from Move 3 makes the old Bottom the new Front. The visible faces are Top = Green, Front = Yellow, Right = Pink. Since the new Front is Yellow, the old Bottom was Yellow. Because the Top in Move 3 was Blue, Blue and Yellow are opposite faces
[70]

Since Red and Pink are opposite faces, the Pink face is on the bottom (touching the floor) exactly when the Red face is on the top

By elimination, the remaining two faces,Green and White, are opposite faces. Since Red and Pink are opposite faces, the Pink face is on the bottom (touching the floor) exactly when the Red face is on the top. We can track the Top face at the initial position and after each of the 19 moves by looking at the first color in each of the provided visible face ...
[71]

Moves 6, 7, and 8 (counted as 1 distinct time since they are consecutive)
[72]

Move 17 The Pink face touched the floor 4 distinct times, including the initial position.Answer: 4✓GT: 4 Figure 9|Text transcription and thinking trace summaryfor the rolling die task. 28 Q.At the end of the video, which position is the Cup B (that contains the ball) in? [Text description.] Three cups are placed in order — Left, Middle, Right — and the ba...
[73]

Answer: Center✓GT: Center Figure 10|Text transcription and thinking trace summaryfor the shell game task

10.(Left, Right)- The ball is not involved in this swap, so it stays atMiddle. Answer: Center✓GT: Center Figure 10|Text transcription and thinking trace summaryfor the shell game task. Q.Which corner will the ball end up at? [Text description.] Label the corners of a box as follows: 1 — 2 | | 4 — 3 Place a ball at corner 2. Then tilt the box in the follow...
[74]

Red scores

At 00:08, the puck goes into the bottom goal. Red scores. (Red 1 - Blue 0)
[75]

Blue scores

At 00:18, the puck goes into the top goal. Blue scores. (Red 1 - Blue 1)
[76]

Blue scores

At 00:27, the puck goes into the top goal. Blue scores. (Red 1 - Blue 2)
[77]

Red scores

At 00:39, the puck goes into the bottom goal. Red scores. (Red 2 - Blue 2)
[78]

Red scores

At 00:49, the puck goes into the bottom goal. Red scores. (Red 3 - Blue 2)
[79]

Red scores

At 00:58, the puck goes into the bottom goal. Red scores. (Red 4 - Blue 2) The final score is Red 4, Blue 2. However, this option is not available. Let’s re-examine the video carefully. It appears there might be a misunderstanding of the video’s events or the provided options are incorrect. Given the strict instruction to only output a single letter from ...
[80]

The user distinctly taps the numbers 1, 2, 3 (frames 00:01 to 00:03)

Showing first 80 references.

[1] [1]

Diffusion for World Modeling: Visual Details Matter in Atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for World Modeling: Visual Details Matter in Atari. InAdvances in Neural Information Processing Systems, 2024

2024

[2] [2]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/claude-opu s-4-7, April 2026. Accessed: 2026-05-02

2026

[3] [3]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023

[4] [4]

Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023

[5] [5]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[6] [6]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[7] [7]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

2025

[8] [8]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[9] [9]

𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[10] [10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020

2020

[11] [11]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InIEEE Conference on Computer Vision and Pattern Recognition, 2015

2015

[12] [12]

Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

arXiv 2024

[13] [13]

Hourvideo: 1-hour video-language understanding

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video-language understanding. InAdvances in Neural Information Processing Systems, 2024

2024

[14] [14]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611, 2026

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Chris Dongjoo Kim, Yue Yang, Ali Farhadi, and Ranjay Krishna. Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611, 2026

Pith/arXiv arXiv 2026

[15] [15]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025

[16] [16]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InIEEE Conference on Computer Vision and Pattern Recognition, 2025

2025

[17] [17]

VITA-1.5: Towards GPT-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, and Ran He. VITA-1.5: Towards GPT-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Pith/arXiv arXiv 2025

[18] [18]

Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

Chaoyou Fu, Hao Yuan, Yuhao Dong, Yifan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yong Xie, Xiawu Zheng, Xuejiao Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, and Ran He. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

Pith/arXiv arXiv 2026

[19] [19]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InIEEE International Conference on Computer Vision, 2017. 13

2017

[20] [20]

Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026

[21] [21]

GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Pith/arXiv arXiv 2025

[22] [22]

Gemini 3 flash

Google DeepMind. Gemini 3 flash. https://deepmind.google/models/gemini/flash/ , 2025

2025

[23] [23]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/model-c ards/gemini-3-1-pro/, 2026

2026

[24] [24]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. In International Conference on Learning Representations, 2025

2025

[25] [25]

MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

arXiv 2025

[26] [26]

Recurrent World Models Facilitate Policy Evolution

David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. InAdvances in Neural Information Processing Systems, 2018

2018

[27] [27]

Mastering Diverse Domains through World Models.Nature, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering Diverse Domains through World Models.Nature, 2025

2025

[28] [28]

RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

arXiv 2025

[29] [29]

Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

Pith/arXiv arXiv 2025

[30] [30]

World and Human Action Models Towards Gameplay Ideation.Nature, 2025

Anssi Kanervisto, Dave Bignell, Linda Yilin Wen, Martin Grayson, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Tabish Rashid, Tim Pearce, Yuhan Cao, et al. World and Human Action Models Towards Gameplay Ideation.Nature, 2025

2025

[31] [31]

3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996, 2025

Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996, 2025

arXiv 2025

[32] [32]

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence.arXiv preprint arXiv:2605.25979, 2026

Glint Lab, AIM for Health Lab, and MVP Lab. LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence.arXiv preprint arXiv:2605.25979, 2026

Pith/arXiv arXiv 2026

[33] [33]

A Path Towards Autonomous Machine Intelligence.Open Review, 2022

Yann LeCun. A Path Towards Autonomous Machine Intelligence.Open Review, 2022

2022

[34] [34]

Llava-onevision: Easy visual task transfer.T ransactions on Machine Learning Research, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.T ransactions on Machine Learning Research, 2025

2025

[35] [35]

MVbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. MVbench: A comprehensive multi-modal video understanding benchmark. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

2024

[36] [36]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024

2024

[37] [37]

Can vision-language models solve the shell game?arXiv preprint arXiv:2603.08436, 2026

Tiedong Liu and Wee Sun Lee. Can vision-language models solve the shell game?arXiv preprint arXiv:2603.08436, 2026. 14

arXiv 2026

[38] [38]

Videoreasonbench: Can MLLMs perform vision-centric complex video reasoning? InInternational Conference on Learning Representations, 2026

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y.Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can MLLMs perform vision-centric complex video reasoning? InInternational Conference on Learning Representations, 2026

2026

[39] [39]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

2024

[40] [40]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems, 2023

2023

[41] [41]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021

2021

[42] [42]

Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models

Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models. InInternational Conference on Learning Representations, 2025

2025

[43] [43]

Oriane Sim’eoni, Huy V . Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

Pith/arXiv arXiv 2025

[44] [44]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Pith/arXiv arXiv 2026

[45] [45]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

2024

[46] [46]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025

[47] [47]

Advancing Open-Source World Models.arXiv preprint arXiv:2601.20540, 2026

Team Robbyant, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing Open-Source World Models.arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026

[48] [48]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024

[49] [49]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023

[50] [50]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025

[51] [51]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 15

Pith/arXiv arXiv 2024

[52] [52]

LVBench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. LVBench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

Pith/arXiv arXiv 2024

[53] [53]

InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025

[54] [54]

Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

arXiv 2023

[55] [55]

Continuous perception matters: Diagnosing temporal integration failures in multimodal models.arXiv preprint arXiv:2408.07867, 2024

Zeyu Wang, Zhenzhen Weng, and Serena Yeung-Levy. Continuous perception matters: Diagnosing temporal integration failures in multimodal models.arXiv preprint arXiv:2408.07867, 2024

arXiv 2024

[56] [56]

Ryoo, and Juan Carlos Niebles

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, and Juan Carlos Niebles. Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025

Pith/arXiv arXiv 2025

[57] [57]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Information Processing Systems, 2024

2024

[58] [58]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InIEEE Conference on Computer Vision and Pattern Recogni- tion, 2021

2021

[59] [59]

Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

Xiaomi LLM-Core Team, Zihao Yue, Zhenrui Lin, Yi-Hao Song, Weikun Wang, Shu-Qin Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zi-Ang Jiang, Zhixian Zheng, Zhichao Song, Zhen Luo, Yue Yu, Yudong Wang, Yu Tian, Yu Tu, Yihan Yan, Yi ...

arXiv 2025

[60] [60]

Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

2024

[61] [61]

Cambrian-s: Towards spatial supersensing in video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video. InInternational Conference on Learning Representations, 2026

2026

[62] [62]

Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

Pith/arXiv arXiv 2024

[63] [63]

Video instruction tuning with synthetic data.T ransactions on Machine Learning Research, 2025

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.T ransactions on Machine Learning Research, 2025

2025

[64] [64]

Mmvu: Measuring expert-level multi-discipline video under- standing

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video under- standing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 16

2025

[65] [65]

Mlvu: A comprehensive benchmark for multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

Pith/arXiv arXiv 2024

[66] [66]

second-to-last

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 17 A. Benchmark Breakdown A.1. Detailed Information Formal definitionIn Table 6 and 7, we provide a...

Pith/arXiv arXiv 2025

[67] [67]

When rolling up, the old Bottom face becomes the new Front face

Move 1 (roll up):The visible faces become Top = Green, Front = Pink, Right = Blue. When rolling up, the old Bottom face becomes the new Front face. Since the new Front is Pink, the initial Bottom face was Pink. Because the initial Top face was Red, we can conclude thatRed and Pink are opposite faces

[68] [68]

The visible faces are Top = Blue, Front = Green, Right = Pink

Move 3 (roll left):From Move 2 (Top = Red, Front = Green, Right = Blue), rolling left makes the old Right (Blue) the new Top, and the old Bottom (Pink) the new Right. The visible faces are Top = Blue, Front = Green, Right = Pink

[69] [69]

The visible faces are Top = Green, Front = Yellow, Right = Pink

Move 4 (roll up):Rolling up from Move 3 makes the old Bottom the new Front. The visible faces are Top = Green, Front = Yellow, Right = Pink. Since the new Front is Yellow, the old Bottom was Yellow. Because the Top in Move 3 was Blue, Blue and Yellow are opposite faces

[70] [70]

Since Red and Pink are opposite faces, the Pink face is on the bottom (touching the floor) exactly when the Red face is on the top

By elimination, the remaining two faces,Green and White, are opposite faces. Since Red and Pink are opposite faces, the Pink face is on the bottom (touching the floor) exactly when the Red face is on the top. We can track the Top face at the initial position and after each of the 19 moves by looking at the first color in each of the provided visible face ...

[71] [71]

Moves 6, 7, and 8 (counted as 1 distinct time since they are consecutive)

[72] [72]

Move 17 The Pink face touched the floor 4 distinct times, including the initial position.Answer: 4✓GT: 4 Figure 9|Text transcription and thinking trace summaryfor the rolling die task. 28 Q.At the end of the video, which position is the Cup B (that contains the ball) in? [Text description.] Three cups are placed in order — Left, Middle, Right — and the ba...

[73] [73]

Answer: Center✓GT: Center Figure 10|Text transcription and thinking trace summaryfor the shell game task

10.(Left, Right)- The ball is not involved in this swap, so it stays atMiddle. Answer: Center✓GT: Center Figure 10|Text transcription and thinking trace summaryfor the shell game task. Q.Which corner will the ball end up at? [Text description.] Label the corners of a box as follows: 1 — 2 | | 4 — 3 Place a ball at corner 2. Then tilt the box in the follow...

[74] [74]

Red scores

At 00:08, the puck goes into the bottom goal. Red scores. (Red 1 - Blue 0)

[75] [75]

Blue scores

At 00:18, the puck goes into the top goal. Blue scores. (Red 1 - Blue 1)

[76] [76]

Blue scores

At 00:27, the puck goes into the top goal. Blue scores. (Red 1 - Blue 2)

[77] [77]

Red scores

At 00:39, the puck goes into the bottom goal. Red scores. (Red 2 - Blue 2)

[78] [78]

Red scores

At 00:49, the puck goes into the bottom goal. Red scores. (Red 3 - Blue 2)

[79] [79]

Red scores

At 00:58, the puck goes into the bottom goal. Red scores. (Red 4 - Blue 2) The final score is Red 4, Blue 2. However, this option is not available. Let’s re-examine the video carefully. It appears there might be a misunderstanding of the video’s events or the provided options are incorrect. Given the strict instruction to only output a single letter from ...

[80] [80]

The user distinctly taps the numbers 1, 2, 3 (frames 00:01 to 00:03)