Recognition: no theorem link
Motion-o: Trajectory-Grounded Video Reasoning
Pith reviewed 2026-05-15 08:43 UTC · model grok-4.3
The pith
Motion-o adds explicit motion descriptions to video reasoning models by inserting a dedicated chain-of-thought tag for trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete <motion/> tag summarizing direction, speed, and scale change. To supervise MCoT, sparse spatio-temporal annotations are densified into object tracks and motion descriptors are derived from centroid displacement and box-area change. Training uses complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning.
What carries the argument
Motion Chain of Thought (MCoT) with its <motion/> tag, supervised by densified object tracks, centroid/box-derived descriptors, and a perturbation-based consistency reward.
If this is right
- Trajectory-dependent claims become directly supervisable and penalizable during training.
- Existing VLM pipelines can incorporate explicit motion without architectural redesign.
- Implicit dynamics in video evidence chains convert into verifiable components.
- Consistent gains appear on multiple video understanding benchmarks for faithfulness to object paths.
Where Pith is reading between the lines
- The same perturbation technique could be applied to other missing elements in reasoning chains, such as causal relations or audio events.
- Explicit motion tags might help downstream tasks like action prediction or video editing by providing machine-readable trajectory summaries.
- If the densification step relies on off-the-shelf trackers, errors in those trackers could propagate and limit gains on videos with heavy occlusion or fast motion.
Load-bearing premise
The assumption that densifying sparse annotations into tracks and deriving motion from centroid displacement and box-area change, together with the perturbation reward, supplies accurate and unbiased supervision without introducing fitting artifacts or data biases.
What would settle it
Train the same base model once with and once without the MCoT tag and perturbation reward, then measure whether trajectory-faithful reasoning scores on the benchmarks rise only in the version that includes them; if the scores stay the same or drop, the central claim does not hold.
Figures
read the original abstract
Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{<motion/>} tag summarizing direction, speed, and scale change. To supervise MCoT, we densify sparse spatio-temporal annotations into object tracks and derive motion descriptors from centroid displacement and box-area change. We then train with complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning without architectural modifications. These results suggest that an explicit motion interface can complement existing VLM pipelines by converting implicit dynamics into verifiable evidence. Code is available at~\href{https://github.com/ostadabbas/Motion-o}{\faGithub\ \texttt{ostadabbas/Motion-o}}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Motion-o, a motion-centric extension to vision-language models that augments spatio-temporal evidence chains with Motion Chain of Thought (MCoT) via a discrete <motion/> tag. Motion descriptors are derived from densified object tracks using centroid displacement (for direction/speed) and box-area change (for scale), supervised via trajectory-consistency and perturbation-based rewards that penalize invariant motion tags under frame removal. The central claim is that this yields consistent improvements in trajectory-faithful reasoning across video benchmarks without architectural changes to the underlying VLM.
Significance. If the quantitative results and ablations hold, the work provides a lightweight, verifiable interface for explicit motion reasoning that complements existing VLM pipelines. The availability of code at the cited GitHub repository is a positive factor for reproducibility. However, the limited motion representation (translation and scale only) and absence of visible metrics in the provided text limit the assessed impact until supported by data.
major comments (2)
- [Motion Descriptors and Supervision] The motion descriptor construction (centroid displacement for direction/speed plus box-area change for scale) is invariant to rotation, shear, non-rigid deformation, and intra-object velocity variation. This is load-bearing for the central claim of trajectory-grounded reasoning, because benchmark videos frequently contain articulated or rotational motion; the perturbation reward only tests tag invariance under frame removal and does not validate correctness against actual trajectories.
- [Abstract and Experimental Results] No quantitative metrics, ablation studies, error bars, or reward-implementation details are visible in the abstract or manuscript summary, despite the claim of 'consistent improvements' across benchmarks. This undermines evaluation of the trajectory-faithful reasoning gains and the contribution of the perturbation signal.
minor comments (2)
- [Method] Clarify the exact discretization thresholds used to map continuous centroid displacement and area change into discrete <motion/> tags (e.g., speed bins, direction quantization).
- [Model Architecture] The abstract states 'without architectural modifications' but does not specify whether any additional projection layers or fine-tuning heads are introduced for the MCoT output.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve transparency and acknowledge limitations.
read point-by-point responses
-
Referee: [Motion Descriptors and Supervision] The motion descriptor construction (centroid displacement for direction/speed plus box-area change for scale) is invariant to rotation, shear, non-rigid deformation, and intra-object velocity variation. This is load-bearing for the central claim of trajectory-grounded reasoning, because benchmark videos frequently contain articulated or rotational motion; the perturbation reward only tests tag invariance under frame removal and does not validate correctness against actual trajectories.
Authors: We agree that the current descriptors, derived from centroid displacement and box-area change, capture only translation and scale and are invariant to rotation, shear, non-rigid deformation, and intra-object velocity variation. This is a genuine limitation of the motion representation. The perturbation reward enforces sensitivity to temporal evidence removal but does not provide direct validation against ground-truth trajectories. We have added an explicit limitations paragraph in the revised discussion section acknowledging these constraints and noting that the evaluated benchmarks primarily feature translational and scaling motions. Future extensions to include rotational and articulated components are outlined. revision: yes
-
Referee: [Abstract and Experimental Results] No quantitative metrics, ablation studies, error bars, or reward-implementation details are visible in the abstract or manuscript summary, despite the claim of 'consistent improvements' across benchmarks. This undermines evaluation of the trajectory-faithful reasoning gains and the contribution of the perturbation signal.
Authors: The full manuscript includes quantitative results with metrics, ablations, and error bars in Tables 1–3 and Section 4, along with reward implementation details (including the perturbation signal) in Section 3.2. To address visibility, we have revised the abstract to report specific gains (e.g., average +3.8% trajectory-faithful accuracy across benchmarks) and added a pointer to the experimental section in the introduction. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper defines MCoT supervision explicitly by densifying external spatio-temporal annotations into tracks and computing motion descriptors via centroid displacement and box-area change, then applies perturbation rewards for consistency. These are constructive geometric definitions from independent inputs rather than fitted parameters renamed as predictions or self-referential equations. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation chain. The central improvement claim rests on benchmark results outside the method definition itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Densifying sparse spatio-temporal annotations into object tracks yields reliable motion descriptors from centroid displacement and box-area change
invented entities (1)
-
Motion Chain of Thought (MCoT) with <motion/> tag
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Chen, R., Luo, T., Fan, Z., Zou, H., Feng, Z., Xie, G., Zhang, H., Wang, Z., Liu, Z., Huaijian, Z.: Datasets and recipes for video temporal grounding via reinforcement learning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. pp. 983–992 (2025)
work page 2025
-
[3]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Cho, J.H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., et al.: Perceptionlm: Open-access data and models for detailed visual understanding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
work page 2025
-
[4]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
work page 2025
-
[5]
Video-R1: Reinforcing Video Reasoning in MLLMs
Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., Wu, J., Zhang, X., Wang, B., Yue, X.: Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24108–24118 (2025)
work page 2025
-
[7]
arXiv preprint arXiv:2511.21375 (2025)
Gu, X., Zhang, H., Fan, Q., Niu, J., Zhang, Z., Zhang, L., Chen, G., Chen, F., Wen, L., Zhu, S.: Thinking with bounding boxes: Enhancing spatio-temporal video grounding via reinforcement fine-tuning. arXiv preprint arXiv:2511.21375 (2025)
-
[8]
arXiv preprint arXiv:2502.04326 (2025)
Hong, J., Yan, S., Cai, J., Jiang, X., Hu, Y., Xie, W.: Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms. arXiv preprint arXiv:2502.04326 (2025)
-
[9]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Hong, W., Cheng, Y., Yang, Z., Wang, W., Wang, L., Gu, X., Huang, S., Dong, Y., Tang, J.: Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8450–8460 (2025)
work page 2025
-
[10]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)
work page 2024
-
[11]
arXiv preprint arXiv:2504.06958 (2025)
Li,X.,Yan,Z.,Meng,D.,Dong,L.,Zeng,X.,He,Y.,Wang,Y.,Qiao,Y.,Wang,Y., Wang, L.: Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958 (2025)
- [12]
-
[13]
arXiv preprint arXiv:2510.20579 (2025)
Meng, J., Li, X., Wang, H., Tan, Y., Zhang, T., Kong, L., Tong, Y., Wang, A., Teng, Z., Wang, Y., et al.: Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence. arXiv preprint arXiv:2510.20579 (2025)
-
[14]
OpenAI: Introducing o3 and o4-mini.https://openai.com/index/introducing- o3-and-o4-mini/(2025)
work page 2025
-
[15]
arXiv preprint arXiv:2504.01805 (2025)
Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025) 22 B. Galoaa, S. Moezzi et al
-
[16]
Park, J., Na, J., Kim, J., Kim, H.J.: Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo. In: NeurIPS (2025)
work page 2025
-
[17]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Varma, M., Delbrouck, J.B., Ostmeier, S., Chaudhari, A.S., Langlotz, C.: TRove: Discovering error-inducing static feature biases in temporal vision-language mod- els. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
work page 2025
-
[18]
In: The Fourteenth International Conference on Learning Representations (2026)
Wang,H.,Li,X.,Huang,Z.,Wang,A.,Wang,J.,Zhang,T.,Bai,S.,Kang,Z.,Feng, J., Zhuochen, W., et al.: Traceable evidence enhanced visual grounded reasoning: Evaluation and method. In: The Fourteenth International Conference on Learning Representations (2026)
work page 2026
-
[19]
VGR: Visual Grounded Reasoning
Wang, J., Kang, Z., Wang, H., Jiang, H., Li, J., Wu, B., Wang, Y., Ran, J., Liang, X., Feng, C., et al.: Vgr: Visual grounded reasoning. arXiv preprint arXiv:2506.11991 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Wang, Q., Yu, Y., Yuan, Y., Mao, R., Zhou, T.: Videorft: Incentivizing video rea- soning capability in mllms via reinforced fine-tuning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
work page 2025
-
[21]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)
Wang, Y., Wang, Z., Xu, B., Du, Y., Lin, K., Xiao, Z., Yue, Z., Ju, J., Zhang, L., Yang, D., et al.: Time-r1: Post-training large vision language model for temporal video grounding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)
work page 2026
-
[22]
Wang, Y., Zheng, Z., Zhao, X., Li, J., Wang, Y., Zhao, D.: Vstar: A video-grounded dialogue dataset for situated semantic understanding with scene and topic transi- tions. In: Proceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). pp. 5036–5048 (2023)
work page 2023
-
[23]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Wang, Z., Yoon, J., Yu, S., Islam, M.M., Bertasius, G., Bansal, M.: Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 28114–28128 (2025)
work page 2025
-
[24]
In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) Workshops
Zhang, X.: The escalator problem: Identifying implicit motion blindness in ai for accessibility. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) Workshops. pp. 6635–6643 (October 2025)
work page 2025
-
[25]
Group Sequence Policy Optimization
Zheng, C., Liu, S., Li, M., Chen, X.H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., et al.: Group sequence policy optimization. arXiv preprint arXiv:2507.18071 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing thinking with images via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.