pith. sign in

arxiv: 2606.03920 · v1 · pith:TSOOYSZPnew · submitted 2026-06-02 · 💻 cs.CV

Benchmarking Visual State Tracking in Multimodal Video Understanding

Pith reviewed 2026-06-28 10:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual state trackingmultimodal large language modelsvideo understandingbenchmark evaluationvisual perceptiontemporal integrationMLLM limitations
0
0 comments X

The pith

Multimodal LLMs reason about video events in text but fail to perceive the visual state changes needed for accurate tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Visual STAte Tracking benchmark (VSTAT) consisting of 834 video clips and 1500 questions that require integrating events across the full video stream rather than from any single frame or short segment. State-of-the-art MLLMs perform far below human levels on this benchmark and only modestly above answer-prior baselines. When their thinking traces are compared to the video content, the models track states correctly in text but fail to visually perceive the events required for that tracking. This reveals a core limitation in how current models handle continuous visual perception in videos despite strong results on other benchmarks. The finding matters because reliable video understanding depends on this state-tracking ability.

Core claim

The paper's central claim is that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on VSTAT. Analysis of their reasoning traces against the video stream shows they reason and track correctly in text but fail at visually perceiving the events they need to track. Preliminary tests indicate that recent agentic approaches including MLLM-based video agents and coding agents do not resolve these failures.

What carries the argument

VSTAT benchmark of 834 clips paired with 1500 questions that demand continuous perception and integration of events across the entire video stream.

If this is right

  • Existing video benchmarks may not fully diagnose limitations in continuous visual state tracking.
  • Agentic methods built on current MLLMs will likely inherit the same visual perception shortfalls on dynamic tasks.
  • Models must improve visual perception of state changes over time to close the gap with humans on video tasks.
  • New evaluation sets should prioritize questions that force integration across entire video streams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world systems using MLLMs for video monitoring or action prediction may underperform when state changes occur gradually or across long clips.
  • Architectures that better couple visual encoders with temporal memory could be tested specifically against VSTAT-style questions.
  • The gap between text reasoning and visual perception suggests training regimes that penalize mismatches between generated text and actual frame content.

Load-bearing premise

The 1500 questions cannot be answered from any single frame or short segment and truly require continuous perception and integration of events across the full video.

What would settle it

Demonstration that many VSTAT questions can be answered correctly from one or two frames alone, or that a model reaches human-level accuracy on VSTAT without using temporal visual information from the full clip.

read the original abstract

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Visual STAte Tracking (VSTAT) benchmark with 834 clips (synthetic and real-world) and 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration across the full video. It reports that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines; analysis of thinking traces shows models reason and track correctly in text but fail at visually perceiving the needed events; preliminary tests indicate agentic MLLM-based video and coding agents do not resolve the failures.

Significance. If the benchmark questions are rigorously shown to require full-video integration, VSTAT would offer a useful diagnostic for a specific visual perception limitation in MLLMs that standard video benchmarks miss. The explicit answer-prior baseline comparison and thinking-trace analysis are strengths that support a more precise diagnosis than accuracy numbers alone.

major comments (2)
  1. [Abstract] Abstract: the central claim that MLLMs fail specifically at visual state tracking (rather than general perception) rests on the assertion that the 1,500 questions 'cannot be answered from any single frame or short segment'; the manuscript provides no concrete validation protocol (e.g., single-frame ablation, key-frame oracle, or restricted-temporal-access annotator checks) for the 834 clips, leaving the performance-gap interpretation vulnerable.
  2. The construction of the answer-prior baselines and any statistical significance tests for the reported gaps (MLLMs vs. humans vs. baselines) are not described, which directly affects the robustness of the 'modestly above' and 'far below' claims.
minor comments (1)
  1. A table or section summarizing clip sources, question categories, and inter-annotator agreement would improve clarity of the benchmark construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The two major comments highlight important areas where additional documentation will strengthen the manuscript. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that MLLMs fail specifically at visual state tracking (rather than general perception) rests on the assertion that the 1,500 questions 'cannot be answered from any single frame or short segment'; the manuscript provides no concrete validation protocol (e.g., single-frame ablation, key-frame oracle, or restricted-temporal-access annotator checks) for the 834 clips, leaving the performance-gap interpretation vulnerable.

    Authors: We agree that an explicit validation protocol is necessary to support the claim that questions require full-video integration. The current manuscript states the requirement but does not detail the verification steps performed during dataset construction. In the revised version we will add a dedicated subsection under 'Benchmark Construction' that describes: (1) single-frame and short-segment (≤5s) ablations run by multiple annotators on a random subset of questions, (2) a key-frame oracle condition, and (3) restricted-temporal-access checks in which annotators could view only non-contiguous frames. These results will be reported with inter-annotator agreement statistics. This addition will make the validation transparent and directly address the vulnerability noted. revision: yes

  2. Referee: [—] The construction of the answer-prior baselines and any statistical significance tests for the reported gaps (MLLMs vs. humans vs. baselines) are not described, which directly affects the robustness of the 'modestly above' and 'far below' claims.

    Authors: We acknowledge that the manuscript does not describe the precise construction of the answer-prior baselines nor any statistical tests. In the revision we will expand the 'Baselines' and 'Evaluation Metrics' sections to specify: (a) the exact procedures used to generate the answer-prior baselines (most-frequent answer per question type, random sampling from the answer distribution, and length-matched priors), (b) the number of samples drawn for each baseline, and (c) the statistical tests employed (paired t-tests or bootstrap confidence intervals with multiple-comparison correction) together with the resulting p-values for all reported gaps. These details will be added without altering the numerical results already presented. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with external validation

full rationale

The paper introduces VSTAT as an empirical benchmark consisting of 834 clips and 1500 questions, with performance measured against human annotators and answer-prior baselines. No derivations, equations, fitted parameters, or predictions are claimed. The assertion that questions require full-video integration is presented as a design property of the benchmark rather than a derived result from any self-referential chain. All comparisons are to independent external references (human performance, prior MLLM benchmarks), satisfying the criteria for non-circularity. No self-citation load-bearing steps or ansatz smuggling occur.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the domain assumption that the chosen questions demand full-video tracking; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Questions cannot be answered from any single frame or short segment
    Explicitly stated in the abstract as the defining property of the benchmark questions.

pith-pipeline@v0.9.1-grok · 5788 in / 1101 out tokens · 29107 ms · 2026-06-28T10:41:03.901830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

126 extracted references · 27 linked inside Pith

  1. [1]

    Diffusion for World Modeling: Visual Details Matter in Atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for World Modeling: Visual Details Matter in Atari. InAdvances in Neural Information Processing Systems, 2024

  2. [2]

    Introducing Claude Opus 4.7

    Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/claude-opu s-4-7, April 2026. Accessed: 2026-05-02

  3. [3]

    Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  4. [4]

    Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

  5. [5]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  6. [6]

    Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  7. [7]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

  8. [8]

    Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  9. [9]

    𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  10. [10]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020

  11. [11]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InIEEE Conference on Computer Vision and Pattern Recognition, 2015

  12. [12]

    Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

  13. [13]

    Hourvideo: 1-hour video-language understanding

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video-language understanding. InAdvances in Neural Information Processing Systems, 2024

  14. [14]

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611, 2026

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Chris Dongjoo Kim, Yue Yang, Ali Farhadi, and Ranjay Krishna. Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611, 2026

  15. [15]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  16. [16]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InIEEE Conference on Computer Vision and Pattern Recognition, 2025

  17. [17]

    VITA-1.5: Towards GPT-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, and Ran He. VITA-1.5: Towards GPT-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

  18. [18]

    Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

    Chaoyou Fu, Hao Yuan, Yuhao Dong, Yifan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yong Xie, Xiawu Zheng, Xuejiao Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, and Ran He. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

  19. [19]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InIEEE International Conference on Computer Vision, 2017. 13

  20. [20]

    Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  21. [21]

    GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

    GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  22. [22]

    Gemini 3 flash

    Google DeepMind. Gemini 3 flash. https://deepmind.google/models/gemini/flash/ , 2025

  23. [23]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/model-c ards/gemini-3-1-pro/, 2026

  24. [24]

    Navigating the digital world as humans do: Universal visual grounding for GUI agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. In International Conference on Learning Representations, 2025

  25. [25]

    MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

    Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A Real-Time and Open-Source Interactive World Model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

  26. [26]

    Recurrent World Models Facilitate Policy Evolution

    David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. InAdvances in Neural Information Processing Systems, 2018

  27. [27]

    Mastering Diverse Domains through World Models.Nature, 2025

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering Diverse Domains through World Models.Nature, 2025

  28. [28]

    RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

  29. [29]

    Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

  30. [30]

    World and Human Action Models Towards Gameplay Ideation.Nature, 2025

    Anssi Kanervisto, Dave Bignell, Linda Yilin Wen, Martin Grayson, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Tabish Rashid, Tim Pearce, Yuhan Cao, et al. World and Human Action Models Towards Gameplay Ideation.Nature, 2025

  31. [31]

    3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996, 2025

    Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996, 2025

  32. [32]

    LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence.arXiv preprint arXiv:2605.25979, 2026

    Glint Lab, AIM for Health Lab, and MVP Lab. LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence.arXiv preprint arXiv:2605.25979, 2026

  33. [33]

    A Path Towards Autonomous Machine Intelligence.Open Review, 2022

    Yann LeCun. A Path Towards Autonomous Machine Intelligence.Open Review, 2022

  34. [34]

    Llava-onevision: Easy visual task transfer.T ransactions on Machine Learning Research, 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.T ransactions on Machine Learning Research, 2025

  35. [35]

    MVbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. MVbench: A comprehensive multi-modal video understanding benchmark. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

  36. [36]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024

  37. [37]

    Can vision-language models solve the shell game?arXiv preprint arXiv:2603.08436, 2026

    Tiedong Liu and Wee Sun Lee. Can vision-language models solve the shell game?arXiv preprint arXiv:2603.08436, 2026. 14

  38. [38]

    Videoreasonbench: Can MLLMs perform vision-centric complex video reasoning? InInternational Conference on Learning Representations, 2026

    Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y.Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can MLLMs perform vision-centric complex video reasoning? InInternational Conference on Learning Representations, 2026

  39. [39]

    Openeqa: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

  40. [40]

    Egoschema: A diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems, 2023

  41. [41]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021

  42. [42]

    Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models

    Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models. InInternational Conference on Learning Representations, 2025

  43. [43]

    Oriane Sim’eoni, Huy V . Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  44. [44]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

  45. [45]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

  46. [46]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

  47. [47]

    Advancing Open-Source World Models.arXiv preprint arXiv:2601.20540, 2026

    Team Robbyant, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing Open-Source World Models.arXiv preprint arXiv:2601.20540, 2026

  48. [48]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InAdvances in Neural Information Processing Systems, volume 37, 2024

  49. [49]

    Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  50. [50]

    Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  51. [51]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 15

  52. [52]

    LVBench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. LVBench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

  53. [53]

    InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  54. [54]

    Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

    Ying Wang, Yanlai Yang, and Mengye Ren. Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos.arXiv preprint arXiv:2312.05269, 2023

  55. [55]

    Continuous perception matters: Diagnosing temporal integration failures in multimodal models.arXiv preprint arXiv:2408.07867, 2024

    Zeyu Wang, Zhenzhen Weng, and Serena Yeung-Levy. Continuous perception matters: Diagnosing temporal integration failures in multimodal models.arXiv preprint arXiv:2408.07867, 2024

  56. [56]

    Ryoo, and Juan Carlos Niebles

    Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, and Juan Carlos Niebles. Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025

  57. [57]

    Longvideobench: A benchmark for long-context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Information Processing Systems, 2024

  58. [58]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InIEEE Conference on Computer Vision and Pattern Recogni- tion, 2021

  59. [59]

    Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

    Xiaomi LLM-Core Team, Zihao Yue, Zhenrui Lin, Yi-Hao Song, Weikun Wang, Shu-Qin Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zi-Ang Jiang, Zhixian Zheng, Zhichao Song, Zhen Luo, Yue Yu, Yudong Wang, Yu Tian, Yu Tu, Yihan Yan, Yi ...

  60. [60]

    Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces

    Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

  61. [61]

    Cambrian-s: Towards spatial supersensing in video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video. InInternational Conference on Learning Representations, 2026

  62. [62]

    Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

  63. [63]

    Video instruction tuning with synthetic data.T ransactions on Machine Learning Research, 2025

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.T ransactions on Machine Learning Research, 2025

  64. [64]

    Mmvu: Measuring expert-level multi-discipline video under- standing

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video under- standing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 16

  65. [65]

    Mlvu: A comprehensive benchmark for multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

  66. [66]

    second-to-last

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 17 A. Benchmark Breakdown A.1. Detailed Information Formal definitionIn Table 6 and 7, we provide a...

  67. [67]

    When rolling up, the old Bottom face becomes the new Front face

    Move 1 (roll up):The visible faces become Top = Green, Front = Pink, Right = Blue. When rolling up, the old Bottom face becomes the new Front face. Since the new Front is Pink, the initial Bottom face was Pink. Because the initial Top face was Red, we can conclude thatRed and Pink are opposite faces

  68. [68]

    The visible faces are Top = Blue, Front = Green, Right = Pink

    Move 3 (roll left):From Move 2 (Top = Red, Front = Green, Right = Blue), rolling left makes the old Right (Blue) the new Top, and the old Bottom (Pink) the new Right. The visible faces are Top = Blue, Front = Green, Right = Pink

  69. [69]

    The visible faces are Top = Green, Front = Yellow, Right = Pink

    Move 4 (roll up):Rolling up from Move 3 makes the old Bottom the new Front. The visible faces are Top = Green, Front = Yellow, Right = Pink. Since the new Front is Yellow, the old Bottom was Yellow. Because the Top in Move 3 was Blue, Blue and Yellow are opposite faces

  70. [70]

    Since Red and Pink are opposite faces, the Pink face is on the bottom (touching the floor) exactly when the Red face is on the top

    By elimination, the remaining two faces,Green and White, are opposite faces. Since Red and Pink are opposite faces, the Pink face is on the bottom (touching the floor) exactly when the Red face is on the top. We can track the Top face at the initial position and after each of the 19 moves by looking at the first color in each of the provided visible face ...

  71. [71]

    Moves 6, 7, and 8 (counted as 1 distinct time since they are consecutive)

  72. [72]

    Move 17 The Pink face touched the floor 4 distinct times, including the initial position.Answer: 4✓GT: 4 Figure 9|Text transcription and thinking trace summaryfor the rolling die task. 28 Q.At the end of the video, which position is the Cup B (that contains the ball) in? [Text description.] Three cups are placed in order — Left, Middle, Right — and the ba...

  73. [73]

    Answer: Center✓GT: Center Figure 10|Text transcription and thinking trace summaryfor the shell game task

    10.(Left, Right)- The ball is not involved in this swap, so it stays atMiddle. Answer: Center✓GT: Center Figure 10|Text transcription and thinking trace summaryfor the shell game task. Q.Which corner will the ball end up at? [Text description.] Label the corners of a box as follows: 1 — 2 | | 4 — 3 Place a ball at corner 2. Then tilt the box in the follow...

  74. [74]

    Red scores

    At 00:08, the puck goes into the bottom goal. Red scores. (Red 1 - Blue 0)

  75. [75]

    Blue scores

    At 00:18, the puck goes into the top goal. Blue scores. (Red 1 - Blue 1)

  76. [76]

    Blue scores

    At 00:27, the puck goes into the top goal. Blue scores. (Red 1 - Blue 2)

  77. [77]

    Red scores

    At 00:39, the puck goes into the bottom goal. Red scores. (Red 2 - Blue 2)

  78. [78]

    Red scores

    At 00:49, the puck goes into the bottom goal. Red scores. (Red 3 - Blue 2)

  79. [79]

    Red scores

    At 00:58, the puck goes into the bottom goal. Red scores. (Red 4 - Blue 2) The final score is Red 4, Blue 2. However, this option is not available. Let’s re-examine the video carefully. It appears there might be a misunderstanding of the video’s events or the provided options are incorrect. Given the strict instruction to only output a single letter from ...

  80. [80]

    The user distinctly taps the numbers 1, 2, 3 (frames 00:01 to 00:03)

Showing first 80 references.