Recognition: 2 theorem links
· Lean TheoremMASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Pith reviewed 2026-05-17 05:52 UTC · model grok-4.3
The pith
Adding motion tracking and 3D depth signals to vision-language models lets them handle physics reasoning in videos nearly as well as closed-source leaders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MASS is a model-agnostic approach that injects spatiotemporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. Combined with the MASS-Bench dataset of real-world and AIGC videos plus free-form physics question-answer pairs and reinforcement fine-tuning, this produces VLMs whose physics reasoning and comprehension performance matches or approaches closed-source state-of-the-art models.
What carries the argument
MASS, the model-agnostic method that converts physical context cues into aligned representations using depth-based 3D encoding, visual grounding, and motion tracking for object dynamics.
If this is right
- Refined VLMs outperform comparable baselines, larger models, and prior state-of-the-art on physics reasoning and comprehension.
- Performance reaches levels comparable to closed-source VLMs with only a 2% gap to Gemini-2.5-Flash.
- The approach strengthens cross-modal alignment for motion dynamics and spatial interactions in video inputs.
- The released benchmark supplies detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking.
Where Pith is reading between the lines
- The same injection of dynamic 3D signals could extend to other video reasoning domains that involve object interactions over time.
- Explicit motion tracking may help models reduce errors on long-sequence videos where implicit learning of dynamics falls short.
- The benchmark's mix of real and generated videos offers a way to test whether models generalize physics understanding across video sources.
Load-bearing premise
The MASS-Bench questions and annotations truly isolate physics reasoning and motion comprehension instead of testing general video understanding or annotation patterns.
What would settle it
An ablation test in which removing the motion tracker or depth-based 3D encoding leaves performance on the physics benchmark unchanged, or a control experiment showing similar scores on non-physics video questions.
Figures
read the original abstract
Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-related reasoning involving motion dynamics and spatial interactions. We present a novel approach to address this gap by translating physical-world context cues into interpretable representations aligned with VLM perception, comprehension, and reasoning. We introduce MASS, a model-agnostic approach that injects spatiotemporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. We also contribute a comprehensive benchmark, MASS-Bench, consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections and grounding over sub-segments, as well as full-sequence 3D motion tracking of entities. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning to MASS. Experiments and ablations show that our refined VLMs outperform comparable baselines, larger models, and prior state-of-the-art models, achieving performance comparable to closed-source state-of-the-art VLMs, with only a 2\% gap to Gemini-2.5-Flash on physics reasoning and comprehension.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MASS, a model-agnostic method that injects motion-aware spatiotemporal signals into VLMs via depth-based 3D encoding, visual grounding, and a dedicated motion tracker for object dynamics. It contributes MASS-Bench, a dataset of 4,350 real-world and AIGC videos paired with 8,361 free-form QA annotations that include sub-segment grounding and full-sequence 3D motion tracks. Reinforcement fine-tuning is applied to align the VLM with these signals. Experiments claim that the resulting models outperform comparable open-source baselines and prior SOTA while closing to within 2% of Gemini-2.5-Flash on physics reasoning and comprehension tasks.
Significance. If the benchmark genuinely isolates physics and motion comprehension, the work would offer a practical route to improve VLM handling of dynamic spatial interactions, a persistent weakness in current video VLMs. The model-agnostic design and public benchmark constitute clear contributions that could be reused by the community. The reinforcement fine-tuning step is a reasonable alignment technique. However, the significance is tempered by the absence of controls that would confirm the benchmark measures the intended capabilities rather than general video understanding or annotation artifacts.
major comments (2)
- [Benchmark construction] Benchmark construction section: The manuscript provides no question examples, construction protocol, or controls (e.g., human performance on static frames only, or ablation removing motion tracks) to demonstrate that the 8,361 QA pairs require comprehension of dynamics and 3D spatial interactions rather than language priors or the supplied detections/tracks. This is load-bearing for the central claim that MASS plus fine-tuning yields gains specifically in physics reasoning.
- [Experiments and results] Experiments and results section: Reported performance numbers (including the 2% gap to Gemini-2.5-Flash) are presented without error bars, statistical significance tests, exact train/test splits, or details on prompt/video selection. Without these, it is impossible to determine whether the outperformance over baselines is robust or reproducible.
minor comments (2)
- [Abstract] Abstract: The claim of 'only a 2% gap' should specify the exact metric (accuracy, F1, etc.) and the precise baseline scores for transparency.
- [Method] Notation: The distinction between 'depth-based 3D encoding' and the 'motion tracker' outputs should be clarified with a diagram or explicit input/output definitions early in the method section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of benchmark validation and experimental reporting that we will address through targeted revisions to strengthen the paper's claims.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: The manuscript provides no question examples, construction protocol, or controls (e.g., human performance on static frames only, or ablation removing motion tracks) to demonstrate that the 8,361 QA pairs require comprehension of dynamics and 3D spatial interactions rather than language priors or the supplied detections/tracks. This is load-bearing for the central claim that MASS plus fine-tuning yields gains specifically in physics reasoning.
Authors: We agree that explicit examples, a detailed construction protocol, and targeted controls are necessary to substantiate that the QA pairs isolate physics reasoning and motion comprehension. In the revised manuscript, we will add representative question examples in the main text or appendix, along with a step-by-step description of the annotation protocol, including how questions were crafted to require dynamic and 3D understanding. We will also incorporate an ablation comparing model performance with and without motion tracks, and report human accuracy on static frames versus full video sequences to demonstrate the added value of spatiotemporal signals. These additions will directly support the central claim. revision: yes
-
Referee: [Experiments and results] Experiments and results section: Reported performance numbers (including the 2% gap to Gemini-2.5-Flash) are presented without error bars, statistical significance tests, exact train/test splits, or details on prompt/video selection. Without these, it is impossible to determine whether the outperformance over baselines is robust or reproducible.
Authors: We acknowledge the need for greater statistical rigor and reproducibility details. The experiments involved multiple runs, but these were not fully reported. In the revision, we will include error bars (standard deviation across 3–5 runs), results of statistical significance tests (e.g., paired t-tests against baselines), the exact train/test split ratios and video selection criteria, and full prompt templates. These details will be added to the Experiments section and supplementary material to allow readers to assess robustness. revision: yes
Circularity Check
No significant circularity; empirical method and benchmark presented as independent
full rationale
The paper introduces the MASS approach for injecting depth-based 3D encodings and motion tracking into VLMs, contributes the separate MASS-Bench dataset of 4,350 videos and 8,361 QA pairs with annotations, applies reinforcement fine-tuning, and reports ablation and comparison experiments. No equations, first-principles derivations, or predictions are described that reduce by construction to fitted parameters or self-defined quantities. Core claims rest on empirical outperformance rather than any self-citation chain or ansatz smuggled from prior author work. The benchmark and evaluation protocol are described as distinct from the model injection technique, with no load-bearing uniqueness theorems or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Depth-based 3D encoding and motion tracking can be aligned with VLM language space to improve physics comprehension
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MASS ... injects spatiotemporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. Mash-vlm: Mitigating action- scene hallucination in video-llms through disentangled spatial- temporal representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13744– 13753, 2025. 3
work page 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 2, 3, 4, 1
-
[5]
Exploiting vlm localizability and semantics for open vocabulary action detection
Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Ren- qiang Min, and Yu Kong. Exploiting vlm localizability and semantics for open vocabulary action detection. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8291–8301. IEEE, 2025. 3
work page 2025
-
[6]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 3, 4
work page 2024
-
[7]
Activitynet: A large-scale video bench- mark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video bench- mark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 4
work page 2015
-
[8]
Holistic evaluation of multimodal llms on spatial intelligence, 2025
Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, Hui En Pang, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holistic evaluation of multimodal llms on spatial intelligence, 2025. 1
work page 2025
-
[9]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 3
work page 2024
-
[10]
Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, and Lei Zhang. Motionllm: Understanding human behaviors from human motions and videos.arXiv preprint arXiv:2405.20340, 2024. 1
-
[11]
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spa- tialrgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37: 135062–135093, 2024. 3
work page 2024
-
[12]
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world under- standing.arXiv preprint arXiv:2501.16411, 2025. 1, 3
-
[14]
Motion-grounded video reasoning: Understanding and perceiving motion at pixel level
Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, and Chen Chen. Motion-grounded video reasoning: Understanding and perceiving motion at pixel level. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8625–8636, 2025. 2, 3
work page 2025
-
[15]
Yipeng Du, Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Xiang Li, Jian Yang, Zhenheng Yang, and Ying Tai. Mo- tionsight: Boosting fine-grained motion understanding in multimodal llms.arXiv preprint arXiv:2506.01674, 2025. 2, 3, 4, 1
-
[16]
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 3
-
[17]
EMCompress: Video-LLMs with Endomorphic Multimodal Compression
Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R Fung, Manling Li, and Heng Ji. Video-llms with temporal visual screening.arXiv preprint arXiv:2508.21094, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1
work page 2025
-
[20]
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024. 3
-
[21]
Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8450–8460, 2025. 2, 3
work page 2025
-
[22]
How far is video generation from world model: A physical law perspective
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024. 3
-
[23]
Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos.arXiv preprint arXiv:2410.11831, 2024. 5 9
-
[24]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, et al. Wolf: Dense video captioning with a world summa- rization framework.arXiv preprint arXiv:2407.18908, 2024. 1
-
[26]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025. 2, 3
-
[28]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3
work page 2024
-
[29]
Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, and Hui Xiong. Eventvl: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025. 1
-
[30]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges.arXiv preprint arXiv:2501.02189, 2025. 3
-
[32]
Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, and Jordan Lee Boyd-Graber. Videohallu: Evaluating and mitigating multi- modal hallucinations on synthetic video understanding.arXiv preprint arXiv:2505.01481, 2025. 2, 3, 4, 6, 8, 1
-
[33]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 7
work page 2004
-
[34]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 5, 7
work page 2023
-
[35]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean con- ference on computer vision, pages 38–55. Springer, 2024. 5
work page 2024
-
[36]
Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444, 2025
Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444, 2025. 3
-
[37]
Saman Motamed, Minghao Chen, Luc Van Gool, and Iro Laina. Travl: A recipe for making video-language mod- els better judges of physics implausibility.arXiv preprint arXiv:2510.07550, 2025. 3
-
[38]
Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025
Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 3
-
[39]
Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024. 3
-
[40]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos.arXiv preprint arXiv:2408.00714,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Grounded sam: Assembling open-world models for diverse visual tasks, 2024
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024. 5
work page 2024
-
[42]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Layoutvlm: Differentiable optimization of 3d layout via vision-language models
Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29469– 29478, 2025. 3
work page 2025
-
[44]
Shangkun Sun, Xiaoyu Liang, Bowen Qu, and Wei Gao. Content-rich aigc video quality assessment via intricate text alignment and motion-aware consistency.arXiv preprint arXiv:2502.04076, 2025. 3
- [45]
-
[46]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation 10 with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025. 3
work page 2025
-
[48]
Chain-of- thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022. 6
work page 2022
-
[49]
Longvlm: Efficient long video understand- ing via large language models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InEuropean Conference on Computer Vision, pages 453–470. Springer, 2024. 3
work page 2024
-
[50]
Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xi- aoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, et al. Autohallu- sion: Automatic generation of hallucination benchmarks for vision-language models.arXiv preprint arXiv:2406.10900,
-
[51]
Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free base- line for video large language models.arXiv preprint arXiv:2407.15841, 2024. 3
-
[52]
Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation
Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025. 3
work page 2025
-
[53]
Egolife: Towards egocentric life assistant
Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885– 28900, 2025. 3
work page 2025
-
[54]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1
work page 2025
-
[55]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv:2406.09414, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Re- thinking temporal search for long-form video understanding
Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chan- drasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, et al. Re- thinking temporal search for long-form video understanding. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 8579–8591, 2025. 2, 3
work page 2025
-
[58]
Evaluating multimodal large language models on video captioning via monte carlo tree search
Linhao Yu, Xingguang Ji, Yahui Liu, Fanheng Kong, Chenxi Sun, Jingyuan Zhang, Hongzhi Zhang, Fuzheng Zhang, Deyi Xiong, et al. Evaluating multimodal large language models on video captioning via monte carlo tree search. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6435–6462, 2025. 1
work page 2025
-
[59]
Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments
Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, and Efs- tratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments. arXiv preprint arXiv:2504.02918, 2025. 3
-
[60]
Mmvu: Measuring expert-level multi- discipline video understanding
Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi- discipline video understanding. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 8475– 8489, 2025. 2, 3
work page 2025
-
[61]
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,
-
[62]
Vlm4d: To- wards spatiotemporal awareness in vision language models
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: To- wards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025. 1
work page 2025
-
[63]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3, 7 11 MASS: Motion-Aware Spatial–temporal Grounding for Physics Reasoning and Comprehension in ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.