arxiv: 2511.18373 · v2 · submitted 2025-11-23 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Xiyang Wu , Zongxia Li , Jihui Jin , Guangyao Shi , Gouthaman KV , Vishnu Raj , Nilotpal Sinha , Jingxi Chen

show 2 more authors

Fan Du Dinesh Manocha

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsphysics reasoningmotion trackingspatial-temporal groundingvideo question answeringreinforcement fine-tuningbenchmark dataset

0 comments

The pith

Adding motion tracking and 3D depth signals to vision-language models lets them handle physics reasoning in videos nearly as well as closed-source leaders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models manage standard video tasks but have trouble with physics reasoning that requires tracking motion dynamics and spatial interactions. The paper presents MASS as a model-agnostic way to translate physical cues into interpretable forms the models can use, through depth-based 3D encoding, visual grounding, and a motion tracker for object movements. It also releases MASS-Bench, a collection of 4,350 videos and 8,361 question-answer pairs centered on physics comprehension, with annotations for detections, grounding, and 3D motion. Reinforcement fine-tuning then produces refined models that beat many baselines and larger systems while staying within 2 percent of Gemini-2.5-Flash.

Core claim

MASS is a model-agnostic approach that injects spatiotemporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. Combined with the MASS-Bench dataset of real-world and AIGC videos plus free-form physics question-answer pairs and reinforcement fine-tuning, this produces VLMs whose physics reasoning and comprehension performance matches or approaches closed-source state-of-the-art models.

What carries the argument

MASS, the model-agnostic method that converts physical context cues into aligned representations using depth-based 3D encoding, visual grounding, and motion tracking for object dynamics.

If this is right

Refined VLMs outperform comparable baselines, larger models, and prior state-of-the-art on physics reasoning and comprehension.
Performance reaches levels comparable to closed-source VLMs with only a 2% gap to Gemini-2.5-Flash.
The approach strengthens cross-modal alignment for motion dynamics and spatial interactions in video inputs.
The released benchmark supplies detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same injection of dynamic 3D signals could extend to other video reasoning domains that involve object interactions over time.
Explicit motion tracking may help models reduce errors on long-sequence videos where implicit learning of dynamics falls short.
The benchmark's mix of real and generated videos offers a way to test whether models generalize physics understanding across video sources.

Load-bearing premise

The MASS-Bench questions and annotations truly isolate physics reasoning and motion comprehension instead of testing general video understanding or annotation patterns.

What would settle it

An ablation test in which removing the motion tracker or depth-based 3D encoding leaves performance on the physics benchmark unchanged, or a control experiment showing similar scores on non-physics video questions.

Figures

Figures reproduced from arXiv: 2511.18373 by Dinesh Manocha, Fan Du, Gouthaman KV, Guangyao Shi, Jihui Jin, Jingxi Chen, Nilotpal Sinha, Vishnu Raj, Xiyang Wu, Zongxia Li.

**Figure 1.** Figure 1: Physics-Centric Video Question Answering. Physics-aware video comprehension is challenging, as VLMs must capture finegrained spatial–temporal cues and integrate them for higher-level reasoning. MASS introduces a motion-aware spatial–temporal grounding module that explicitly encodes object motions and scene dynamics into the language space. By enriching VLMs with structured spatial, temporal, and semantic … view at source ↗

**Figure 2.** Figure 2: Data Exhibition of MASS-Bench. MASS-Bench provides two question types—factual and critical-thinking—to evaluate physics-driven video understanding. For each video–question–answer pair, we supply rich motion-grounding annotations, including temporal segmentation, entity-level visual grounding, temporal profiles across the full video, and motion attributes such as first/last positions and 3D displacement vec… view at source ↗

**Figure 3.** Figure 3: Overview of MASS: We use a model-agnostic approach to enhance visual recognition with explicit spatial and motion awareness. Beyond standard visual transformer encoders that process video inputs (e.g., LLaVA-OneVision [24], Qwen2.5-VL [3]), we introduce a visual grounding module to strengthen correlations between queried entities and corresponding visual cues. Depth estimation captures spatial geometry, wh… view at source ↗

**Figure 4.** Figure 4: Prompt template used for motion-aware video question answering. The template first serializes entity-level motion grounding (positions, motion vectors, bounding boxes, and frame ranges) into text, then injects this context into a chain-of-thought style prompt that guides the VLM to reason in <think> tags and output its final prediction in standardized <answer> tags. D. Evaluation Template We provide the LL… view at source ↗

**Figure 5.** Figure 5: Prompt template used for automatic evaluation of model answers against ground-truth references. The template presents the question, ground truth, and model output provided for LLM-as-a-judge evaluation and instructs the evaluator to output one of three outcomes—Correct, Incorrect, or Unclear—ensuring reliable and consistent scoring across predictions. poral, procedural, and interaction-centric reasoning, w… view at source ↗

**Figure 6.** Figure 6: Video question-answering example from the Spatial Understanding (SU) category. We present physics reasoning and comprehension cases from state-of-the-art VLMs evaluated on spatial understanding tasks. Each example includes the video-generation prompt and human expert annotations, with visual grounding annotated (Red), the corresponding questions (Purple), and model responses from GPT-4o (Orange), Gemini-2.… view at source ↗

**Figure 7.** Figure 7: Video question-answering example from the Temporal Understanding (TU) category. We present physics reasoning and comprehension cases from state-of-the-art VLMs evaluated on temporal understanding tasks. Each example includes the video-generation prompt and human expert annotations, with visual grounding annotated (Red), the corresponding questions (Purple), and model responses from GPT-4o (Orange), Gemini-… view at source ↗

**Figure 8.** Figure 8: Video question-answering example from the Motion and Action Recognition (MAR) category. We present physics reasoning and comprehension cases from state-of-the-art VLMs evaluated on motion and action recognition tasks. Each example includes the video-generation prompt and human expert annotations, with visual grounding annotated (Red), the corresponding questions (Purple), and model responses from GPT-4o (O… view at source ↗

**Figure 9.** Figure 9: Video question-answering example from the Physics Comprehension (PC) category. We present physics reasoning and comprehension cases from state-of-the-art VLMs evaluated on physics comprehension tasks. Each example includes the video-generation prompt and human expert annotations, with visual grounding annotated (Red), the corresponding questions (Purple), and model responses from GPT-4o (Orange), Gemini-2.… view at source ↗

**Figure 10.** Figure 10: Video question-answering example from the Physics Abnormality Detection (PA) category. We present physics reasoning and comprehension cases from state-of-the-art VLMs evaluated on physics abnormality detection tasks. Each example includes the video-generation prompt and human expert annotations, with visual grounding annotated (Red), the corresponding questions (Purple), and model responses from GPT-4o (O… view at source ↗

read the original abstract

Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-related reasoning involving motion dynamics and spatial interactions. We present a novel approach to address this gap by translating physical-world context cues into interpretable representations aligned with VLM perception, comprehension, and reasoning. We introduce MASS, a model-agnostic approach that injects spatiotemporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. We also contribute a comprehensive benchmark, MASS-Bench, consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections and grounding over sub-segments, as well as full-sequence 3D motion tracking of entities. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning to MASS. Experiments and ablations show that our refined VLMs outperform comparable baselines, larger models, and prior state-of-the-art models, achieving performance comparable to closed-source state-of-the-art VLMs, with only a 2\% gap to Gemini-2.5-Flash on physics reasoning and comprehension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MASS adds depth encoding and motion tracking to VLMs for physics reasoning plus a new benchmark, but the performance claims rest on high-level summaries without enough controls or examples to confirm the gains come from the added components.

read the letter

Colleague, the main thing to know is that this paper gives VLMs a concrete way to handle motion and spatial physics by injecting depth-based 3D signals, sub-segment grounding, and an object motion tracker, then fine-tuning with reinforcement learning. They also release MASS-Bench, built from 4350 videos and 8361 QA pairs that include detections, grounding, and full 3D tracks. That combination is the actual new artifact here. The approach is model-agnostic, which makes it easy to test on existing VLMs, and the benchmark targets a real gap where current models fall short on dynamics and interactions. Releasing the annotations and tracks is useful for anyone working on video understanding or robotics applications. The ablations they mention at least try to isolate the contribution of each piece. The soft spots sit in the evaluation. The abstract reports beating baselines and landing within 2% of Gemini-2.5-Flash, yet supplies no tables, error bars, data splits, or statistical checks. More critically, there are no sample questions or construction details, so it is still possible that many items can be answered from static frames, language priors, or the provided tracks themselves rather than true motion comprehension. If that holds, the reported gains do not cleanly demonstrate the value of the depth and tracker modules. This work is aimed at researchers who build or benchmark VLMs for physical scene understanding. A reader who needs a physics-focused video QA dataset would find the annotations worth examining. It deserves peer review because the pipeline is straightforward to reproduce and the benchmark could become a standard test set if the questions prove sound. I would send it forward but flag the need for full results, question examples, and explicit checks against shortcut solutions.

Referee Report

2 major / 2 minor

Summary. The paper introduces MASS, a model-agnostic method that injects motion-aware spatiotemporal signals into VLMs via depth-based 3D encoding, visual grounding, and a dedicated motion tracker for object dynamics. It contributes MASS-Bench, a dataset of 4,350 real-world and AIGC videos paired with 8,361 free-form QA annotations that include sub-segment grounding and full-sequence 3D motion tracks. Reinforcement fine-tuning is applied to align the VLM with these signals. Experiments claim that the resulting models outperform comparable open-source baselines and prior SOTA while closing to within 2% of Gemini-2.5-Flash on physics reasoning and comprehension tasks.

Significance. If the benchmark genuinely isolates physics and motion comprehension, the work would offer a practical route to improve VLM handling of dynamic spatial interactions, a persistent weakness in current video VLMs. The model-agnostic design and public benchmark constitute clear contributions that could be reused by the community. The reinforcement fine-tuning step is a reasonable alignment technique. However, the significance is tempered by the absence of controls that would confirm the benchmark measures the intended capabilities rather than general video understanding or annotation artifacts.

major comments (2)

[Benchmark construction] Benchmark construction section: The manuscript provides no question examples, construction protocol, or controls (e.g., human performance on static frames only, or ablation removing motion tracks) to demonstrate that the 8,361 QA pairs require comprehension of dynamics and 3D spatial interactions rather than language priors or the supplied detections/tracks. This is load-bearing for the central claim that MASS plus fine-tuning yields gains specifically in physics reasoning.
[Experiments and results] Experiments and results section: Reported performance numbers (including the 2% gap to Gemini-2.5-Flash) are presented without error bars, statistical significance tests, exact train/test splits, or details on prompt/video selection. Without these, it is impossible to determine whether the outperformance over baselines is robust or reproducible.

minor comments (2)

[Abstract] Abstract: The claim of 'only a 2% gap' should specify the exact metric (accuracy, F1, etc.) and the precise baseline scores for transparency.
[Method] Notation: The distinction between 'depth-based 3D encoding' and the 'motion tracker' outputs should be clarified with a diagram or explicit input/output definitions early in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of benchmark validation and experimental reporting that we will address through targeted revisions to strengthen the paper's claims.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: The manuscript provides no question examples, construction protocol, or controls (e.g., human performance on static frames only, or ablation removing motion tracks) to demonstrate that the 8,361 QA pairs require comprehension of dynamics and 3D spatial interactions rather than language priors or the supplied detections/tracks. This is load-bearing for the central claim that MASS plus fine-tuning yields gains specifically in physics reasoning.

Authors: We agree that explicit examples, a detailed construction protocol, and targeted controls are necessary to substantiate that the QA pairs isolate physics reasoning and motion comprehension. In the revised manuscript, we will add representative question examples in the main text or appendix, along with a step-by-step description of the annotation protocol, including how questions were crafted to require dynamic and 3D understanding. We will also incorporate an ablation comparing model performance with and without motion tracks, and report human accuracy on static frames versus full video sequences to demonstrate the added value of spatiotemporal signals. These additions will directly support the central claim. revision: yes
Referee: [Experiments and results] Experiments and results section: Reported performance numbers (including the 2% gap to Gemini-2.5-Flash) are presented without error bars, statistical significance tests, exact train/test splits, or details on prompt/video selection. Without these, it is impossible to determine whether the outperformance over baselines is robust or reproducible.

Authors: We acknowledge the need for greater statistical rigor and reproducibility details. The experiments involved multiple runs, but these were not fully reported. In the revision, we will include error bars (standard deviation across 3–5 runs), results of statistical significance tests (e.g., paired t-tests against baselines), the exact train/test split ratios and video selection criteria, and full prompt templates. These details will be added to the Experiments section and supplementary material to allow readers to assess robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method and benchmark presented as independent

full rationale

The paper introduces the MASS approach for injecting depth-based 3D encodings and motion tracking into VLMs, contributes the separate MASS-Bench dataset of 4,350 videos and 8,361 QA pairs with annotations, applies reinforcement fine-tuning, and reports ablation and comparison experiments. No equations, first-principles derivations, or predictions are described that reduce by construction to fitted parameters or self-defined quantities. Core claims rest on empirical outperformance rather than any self-citation chain or ansatz smuggled from prior author work. The benchmark and evaluation protocol are described as distinct from the model injection technique, with no load-bearing uniqueness theorems or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that depth-based 3D encoding plus motion tracking produces representations that VLMs can reliably use for physics reasoning, plus the assumption that reinforcement fine-tuning improves alignment without introducing new biases. No explicit free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption Depth-based 3D encoding and motion tracking can be aligned with VLM language space to improve physics comprehension
Invoked in the description of how MASS injects spatiotemporal signals.

pith-pipeline@v0.9.0 · 5554 in / 1215 out tokens · 21498 ms · 2026-05-17T05:52:16.730512+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MASS ... injects spatiotemporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 14 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mash-vlm: Mitigating action- scene hallucination in video-llms through disentangled spatial- temporal representations

Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. Mash-vlm: Mitigating action- scene hallucination in video-llms through disentangled spatial- temporal representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13744– 13753, 2025. 3

work page 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 2, 3, 4, 1

work page arXiv 2025
[5]

Exploiting vlm localizability and semantics for open vocabulary action detection

Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Ren- qiang Min, and Yu Kong. Exploiting vlm localizability and semantics for open vocabulary action detection. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8291–8301. IEEE, 2025. 3

work page 2025
[6]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 3, 4

work page 2024
[7]

Activitynet: A large-scale video bench- mark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video bench- mark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 4

work page 2015
[8]

Holistic evaluation of multimodal llms on spatial intelligence, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, Hui En Pang, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holistic evaluation of multimodal llms on spatial intelligence, 2025. 1

work page 2025
[9]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 3

work page 2024
[10]

Motionllm: Understanding human behaviors from human motions and videos.arXiv preprint arXiv:2405.20340, 2024

Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, and Lei Zhang. Motionllm: Understanding human behaviors from human motions and videos.arXiv preprint arXiv:2405.20340, 2024. 1

work page arXiv 2024
[11]

Spa- tialrgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37: 135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spa- tialrgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37: 135062–135093, 2024. 3

work page 2024
[12]

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Physbench: Benchmarking and enhancing vision-language models for physical world under- standing.arXiv preprint arXiv:2501.16411, 2025

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world under- standing.arXiv preprint arXiv:2501.16411, 2025. 1, 3

work page arXiv 2025
[14]

Motion-grounded video reasoning: Understanding and perceiving motion at pixel level

Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, and Chen Chen. Motion-grounded video reasoning: Understanding and perceiving motion at pixel level. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8625–8636, 2025. 2, 3

work page 2025
[15]

Mo- tionsight: Boosting fine-grained motion understanding in multimodal llms.arXiv preprint arXiv:2506.01674, 2025

Yipeng Du, Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Xiang Li, Jian Yang, Zhenheng Yang, and Ying Tai. Mo- tionsight: Boosting fine-grained motion understanding in multimodal llms.arXiv preprint arXiv:2506.01674, 2025. 2, 3, 4, 1

work page arXiv 2025
[16]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 3

work page arXiv 2025
[17]

EMCompress: Video-LLMs with Endomorphic Multimodal Compression

Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R Fung, Manling Li, and Heng Ji. Video-llms with temporal visual screening.arXiv preprint arXiv:2508.21094, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1

work page 2025
[20]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024. 3

work page arXiv 2024
[21]

Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8450–8460, 2025. 2, 3

work page 2025
[22]

How far is video generation from world model: A physical law perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024. 3

work page arXiv 2024
[23]

Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos.arXiv preprint arXiv:2410.11831, 2024

Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos.arXiv preprint arXiv:2410.11831, 2024. 5 9

work page arXiv 2024
[24]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Wolf: Dense video captioning with a world summa- rization framework.arXiv preprint arXiv:2407.18908, 2024

Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, et al. Wolf: Dense video captioning with a world summa- rization framework.arXiv preprint arXiv:2407.18908, 2024. 1

work page arXiv 2024
[26]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025. 2, 3

work page arXiv 2025
[28]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3

work page 2024
[29]

Eventvl: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025

Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, and Hui Xiong. Eventvl: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025. 1

work page arXiv 2025
[30]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges.arXiv preprint arXiv:2501.02189, 2025

Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges.arXiv preprint arXiv:2501.02189, 2025. 3

work page arXiv 2025
[32]

Videohallu: Evaluating and mitigating multi- modal hallucinations on synthetic video understanding.arXiv preprint arXiv:2505.01481, 2025

Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, and Jordan Lee Boyd-Graber. Videohallu: Evaluating and mitigating multi- modal hallucinations on synthetic video understanding.arXiv preprint arXiv:2505.01481, 2025. 2, 3, 4, 6, 8, 1

work page arXiv 2025
[33]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 7

work page 2004
[34]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 5, 7

work page 2023
[35]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean con- ference on computer vision, pages 38–55. Springer, 2024. 5

work page 2024
[36]

Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444, 2025

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444, 2025. 3

work page arXiv 2025
[37]

Travl: A recipe for making video-language mod- els better judges of physics implausibility.arXiv preprint arXiv:2510.07550, 2025

Saman Motamed, Minghao Chen, Luc Van Gool, and Iro Laina. Travl: A recipe for making video-language mod- els better judges of physics implausibility.arXiv preprint arXiv:2510.07550, 2025. 3

work page arXiv 2025
[38]

Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 3

work page arXiv 2025
[39]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024. 3

work page arXiv 2024
[40]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024. 5

work page 2024
[42]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Layoutvlm: Differentiable optimization of 3d layout via vision-language models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29469– 29478, 2025. 3

work page 2025
[44]

Content-rich aigc video quality assessment via intricate text alignment and motion-aware consistency.arXiv preprint arXiv:2502.04076, 2025

Shangkun Sun, Xiaoyu Liang, Bowen Qu, and Wei Gao. Content-rich aigc video quality assessment via intricate text alignment and motion-aware consistency.arXiv preprint arXiv:2502.04076, 2025. 3

work page arXiv 2025
[45]

Veo 3.DeepMind Blog, 2025

Veo-Team. Veo 3.DeepMind Blog, 2025. 3

work page 2025
[46]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Lavie: High-quality video generation 10 with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation 10 with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 3

work page 2025
[48]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022. 6

work page 2022
[49]

Longvlm: Efficient long video understand- ing via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InEuropean Conference on Computer Vision, pages 453–470. Springer, 2024. 3

work page 2024
[50]

Autohallu- sion: Automatic generation of hallucination benchmarks for vision-language models.arXiv preprint arXiv:2406.10900,

Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xi- aoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, et al. Autohallu- sion: Automatic generation of hallucination benchmarks for vision-language models.arXiv preprint arXiv:2406.10900,

work page arXiv
[51]

Slowfast-llava: A strong training-free base- line for video large language models.arXiv preprint arXiv:2407.15841, 2024

Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free base- line for video large language models.arXiv preprint arXiv:2407.15841, 2024. 3

work page arXiv 2024
[52]

Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation

Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025. 3

work page 2025
[53]

Egolife: Towards egocentric life assistant

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885– 28900, 2025. 3

work page 2025
[54]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1

work page 2025
[55]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv:2406.09414, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Re- thinking temporal search for long-form video understanding

Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chan- drasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, et al. Re- thinking temporal search for long-form video understanding. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 8579–8591, 2025. 2, 3

work page 2025
[58]

Evaluating multimodal large language models on video captioning via monte carlo tree search

Linhao Yu, Xingguang Ji, Yahui Liu, Fanheng Kong, Chenxi Sun, Jingyuan Zhang, Hongzhi Zhang, Fuzheng Zhang, Deyi Xiong, et al. Evaluating multimodal large language models on video captioning via monte carlo tree search. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6435–6462, 2025. 1

work page 2025
[59]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments

Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, and Efs- tratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments. arXiv preprint arXiv:2504.02918, 2025. 3

work page arXiv 2025
[60]

Mmvu: Measuring expert-level multi- discipline video understanding

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi- discipline video understanding. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 8475– 8489, 2025. 2, 3

work page 2025
[61]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

work page arXiv
[62]

Vlm4d: To- wards spatiotemporal awareness in vision language models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: To- wards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025. 1

work page 2025
[63]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3, 7 11 MASS: Motion-Aware Spatial–temporal Grounding for Physics Reasoning and Comprehension in ...

work page internal anchor Pith review Pith/arXiv arXiv 2025