arxiv: 2603.18856 · v2 · submitted 2026-03-19 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Motion-o: Trajectory-Grounded Video Reasoning

Bishoy Galoaa , Shayda Moezzi , Xiangyu Bai , Sarah Ostadabbas

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video reasoningmotion chain of thoughttrajectory groundingvision-language modelsspatio-temporal reasoningMCoTperturbation rewardobject tracking

0 comments

The pith

Motion-o adds explicit motion descriptions to video reasoning models by inserting a dedicated chain-of-thought tag for trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that video reasoning models produce evidence chains showing where and when objects appear but leave the connecting motion implicit, making trajectory-dependent claims hard to supervise or verify. Motion-o addresses this by extending vision-language models with Motion Chain of Thought, which inserts a structured tag that explicitly summarizes an object's direction, speed, and scale change. Supervision is created by turning sparse spatio-temporal annotations into full object tracks, computing motion descriptors from centroid shifts and box-area changes, and applying rewards that include a perturbation signal to ensure descriptions actually depend on the video evidence. The result is improved trajectory-faithful reasoning on multiple benchmarks with no changes to the underlying model architecture. Readers should care because it turns implicit dynamics into a component that can be directly trained, checked, and penalized when unsupported.

Core claim

Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete <motion/> tag summarizing direction, speed, and scale change. To supervise MCoT, sparse spatio-temporal annotations are densified into object tracks and motion descriptors are derived from centroid displacement and box-area change. Training uses complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning.

What carries the argument

Motion Chain of Thought (MCoT) with its <motion/> tag, supervised by densified object tracks, centroid/box-derived descriptors, and a perturbation-based consistency reward.

If this is right

Trajectory-dependent claims become directly supervisable and penalizable during training.
Existing VLM pipelines can incorporate explicit motion without architectural redesign.
Implicit dynamics in video evidence chains convert into verifiable components.
Consistent gains appear on multiple video understanding benchmarks for faithfulness to object paths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perturbation technique could be applied to other missing elements in reasoning chains, such as causal relations or audio events.
Explicit motion tags might help downstream tasks like action prediction or video editing by providing machine-readable trajectory summaries.
If the densification step relies on off-the-shelf trackers, errors in those trackers could propagate and limit gains on videos with heavy occlusion or fast motion.

Load-bearing premise

The assumption that densifying sparse annotations into tracks and deriving motion from centroid displacement and box-area change, together with the perturbation reward, supplies accurate and unbiased supervision without introducing fitting artifacts or data biases.

What would settle it

Train the same base model once with and once without the MCoT tag and perturbation reward, then measure whether trajectory-faithful reasoning scores on the benchmarks rise only in the version that includes them; if the scores stay the same or drop, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2603.18856 by Bishoy Galoaa, Sarah Ostadabbas, Shayda Moezzi, Xiangyu Bai.

**Figure 1.** Figure 1: Motion-o: trajectory-grounded video reasoning. Although recent video models can produce fluent <think> traces, their reasoning is typically ungrounded, it does not explicitly tie claims back to where/when evidence occurs in the video, nor does it encode the motion that connects observations. Motion-o introduces a Spatial–Temporal–Trajectory (STT) evidence chain: timestamped <obj, box, t> observations pai… view at source ↗

**Figure 2.** Figure 2: CoT vs. Motion Chain-ofThought (MCoT). CoT in [13] yields sparse temporal bounding boxes, forcing implicit inter-frame interpolation. MCoT adds an object-conditioned <motion/> tag that explicitly parameterizes the dynamics between observations, making the evidence chain trajectory-consistent and reducing prior-driven extrapolation. We introduce Motion-o, a motioncentric extension to grounded video rea… view at source ↗

**Figure 3.** Figure 3: Motion-o end-to-end pipeline. Starting from the spatio-temporal grounding dataset from [13] with keyframe boxes plus the addition of our dense interpolated bounding boxes (purple box), we (i) compute track-derived motion primitives (direction/speed/scale) and inject them into the reasoning trace as an MCoT <motion/> tag (blue box), (ii) perform supervised fine-tuning (Stage 1) to teach the model the struc… view at source ↗

**Figure 4.** Figure 4: Qualitative examples of Motion-o reasoning. Top: The model tracks Sheldon across three timestamps with varying camera angles and emits two <motion/> tags, both correctly identifying the subject as stationary despite significant viewpoint changes. The dense multi-point grounding enables the model to distinguish true stationarity from apparent visual displacement caused by camera cuts. Bottom: The model gr… view at source ↗

read the original abstract

Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{<motion/>} tag summarizing direction, speed, and scale change. To supervise MCoT, we densify sparse spatio-temporal annotations into object tracks and derive motion descriptors from centroid displacement and box-area change. We then train with complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning without architectural modifications. These results suggest that an explicit motion interface can complement existing VLM pipelines by converting implicit dynamics into verifiable evidence. Code is available at~\href{https://github.com/ostadabbas/Motion-o}{\faGithub\ \texttt{ostadabbas/Motion-o}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Motion-o adds explicit <motion/> tags to video VLM chains via densified tracks and a perturbation reward, but the descriptors are limited to centroid and scale changes.

read the letter

The main thing to know is that this paper formalizes Spatial-Temporal-Trajectory reasoning and inserts a discrete Motion Chain of Thought with a tag that summarizes direction, speed, and scale. They densify sparse annotations into tracks, derive the tags from centroid displacement and box-area change, and add a perturbation reward that penalizes motion descriptions unchanged after frame removal. This keeps the model architecture untouched while making the implicit motion part of evidence chains more verifiable and supervisable. Code is released, which helps reproducibility. The approach is a clean, practical extension of existing spatio-temporal chains and targets a genuine gap in dynamic video understanding. The soft spot is the motion representation itself. Centroid and area scalars do not capture rotation, shear, non-rigid deformation, or varying intra-object velocities, and the perturbation check only tests stability rather than correctness against actual trajectory. If the benchmarks lean toward simple translational motion, the claimed consistent gains could be narrower than they look. Full ablations and error breakdowns would clarify this. The work is aimed at researchers building or extending VLMs for video reasoning, robotics, or surveillance. The thinking is straightforward and the gap is real, so the paper deserves a serious referee to examine the quantitative results and test the motion descriptors on more varied trajectories. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Motion-o, a motion-centric extension to vision-language models that augments spatio-temporal evidence chains with Motion Chain of Thought (MCoT) via a discrete <motion/> tag. Motion descriptors are derived from densified object tracks using centroid displacement (for direction/speed) and box-area change (for scale), supervised via trajectory-consistency and perturbation-based rewards that penalize invariant motion tags under frame removal. The central claim is that this yields consistent improvements in trajectory-faithful reasoning across video benchmarks without architectural changes to the underlying VLM.

Significance. If the quantitative results and ablations hold, the work provides a lightweight, verifiable interface for explicit motion reasoning that complements existing VLM pipelines. The availability of code at the cited GitHub repository is a positive factor for reproducibility. However, the limited motion representation (translation and scale only) and absence of visible metrics in the provided text limit the assessed impact until supported by data.

major comments (2)

[Motion Descriptors and Supervision] The motion descriptor construction (centroid displacement for direction/speed plus box-area change for scale) is invariant to rotation, shear, non-rigid deformation, and intra-object velocity variation. This is load-bearing for the central claim of trajectory-grounded reasoning, because benchmark videos frequently contain articulated or rotational motion; the perturbation reward only tests tag invariance under frame removal and does not validate correctness against actual trajectories.
[Abstract and Experimental Results] No quantitative metrics, ablation studies, error bars, or reward-implementation details are visible in the abstract or manuscript summary, despite the claim of 'consistent improvements' across benchmarks. This undermines evaluation of the trajectory-faithful reasoning gains and the contribution of the perturbation signal.

minor comments (2)

[Method] Clarify the exact discretization thresholds used to map continuous centroid displacement and area change into discrete <motion/> tags (e.g., speed bins, direction quantization).
[Model Architecture] The abstract states 'without architectural modifications' but does not specify whether any additional projection layers or fine-tuning heads are introduced for the MCoT output.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve transparency and acknowledge limitations.

read point-by-point responses

Referee: [Motion Descriptors and Supervision] The motion descriptor construction (centroid displacement for direction/speed plus box-area change for scale) is invariant to rotation, shear, non-rigid deformation, and intra-object velocity variation. This is load-bearing for the central claim of trajectory-grounded reasoning, because benchmark videos frequently contain articulated or rotational motion; the perturbation reward only tests tag invariance under frame removal and does not validate correctness against actual trajectories.

Authors: We agree that the current descriptors, derived from centroid displacement and box-area change, capture only translation and scale and are invariant to rotation, shear, non-rigid deformation, and intra-object velocity variation. This is a genuine limitation of the motion representation. The perturbation reward enforces sensitivity to temporal evidence removal but does not provide direct validation against ground-truth trajectories. We have added an explicit limitations paragraph in the revised discussion section acknowledging these constraints and noting that the evaluated benchmarks primarily feature translational and scaling motions. Future extensions to include rotational and articulated components are outlined. revision: yes
Referee: [Abstract and Experimental Results] No quantitative metrics, ablation studies, error bars, or reward-implementation details are visible in the abstract or manuscript summary, despite the claim of 'consistent improvements' across benchmarks. This undermines evaluation of the trajectory-faithful reasoning gains and the contribution of the perturbation signal.

Authors: The full manuscript includes quantitative results with metrics, ablations, and error bars in Tables 1–3 and Section 4, along with reward implementation details (including the perturbation signal) in Section 3.2. To address visibility, we have revised the abstract to report specific gains (e.g., average +3.8% trajectory-faithful accuracy across benchmarks) and added a pointer to the experimental section in the introduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines MCoT supervision explicitly by densifying external spatio-temporal annotations into tracks and computing motion descriptors via centroid displacement and box-area change, then applies perturbation rewards for consistency. These are constructive geometric definitions from independent inputs rather than fitted parameters renamed as predictions or self-referential equations. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation chain. The central improvement claim rests on benchmark results outside the method definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that motion can be accurately captured by simple centroid and area changes and that the perturbation signal isolates true motion grounding; no free parameters or invented physical entities are described.

axioms (1)

domain assumption Densifying sparse spatio-temporal annotations into object tracks yields reliable motion descriptors from centroid displacement and box-area change
Invoked to create supervision for MCoT without additional labeled motion data.

invented entities (1)

Motion Chain of Thought (MCoT) with <motion/> tag no independent evidence
purpose: Structured representation of object motion in reasoning chains
New interface introduced to make trajectories explicit and verifiable.

pith-pipeline@v0.9.0 · 5577 in / 1278 out tokens · 47522 ms · 2026-05-15T08:43:41.800760+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 5 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Chen, R., Luo, T., Fan, Z., Zou, H., Feng, Z., Xie, G., Zhang, H., Wang, Z., Liu, Z., Huaijian, Z.: Datasets and recipes for video temporal grounding via reinforcement learning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. pp. 983–992 (2025)

work page 2025
[3]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Cho, J.H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., et al.: Perceptionlm: Open-access data and models for detailed visual understanding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[4]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[5]

Video-R1: Reinforcing Video Reasoning in MLLMs

Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., Wu, J., Zhang, X., Wang, B., Yue, X.: Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24108–24118 (2025)

work page 2025
[7]

arXiv preprint arXiv:2511.21375 (2025)

Gu, X., Zhang, H., Fan, Q., Niu, J., Zhang, Z., Zhang, L., Chen, G., Chen, F., Wen, L., Zhu, S.: Thinking with bounding boxes: Enhancing spatio-temporal video grounding via reinforcement fine-tuning. arXiv preprint arXiv:2511.21375 (2025)

work page arXiv 2025
[8]

arXiv preprint arXiv:2502.04326 (2025)

Hong, J., Yan, S., Cai, J., Jiang, X., Hu, Y., Xie, W.: Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms. arXiv preprint arXiv:2502.04326 (2025)

work page arXiv 2025
[9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Hong, W., Cheng, Y., Yang, Z., Wang, W., Wang, L., Gu, X., Huang, S., Dong, Y., Tang, J.: Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8450–8460 (2025)

work page 2025
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

work page 2024
[11]

arXiv preprint arXiv:2504.06958 (2025)

Li,X.,Yan,Z.,Meng,D.,Dong,L.,Zeng,X.,He,Y.,Wang,Y.,Qiao,Y.,Wang,Y., Wang, L.: Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958 (2025)

work page arXiv 2025
[12]

Liu, Y., Li, S., Liu, Y., Wang, Y., Ren, S., Li, L., Chen, S., Sun, X., Hou, L.: Temp- compass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476 (2024)

work page arXiv 2024
[13]

arXiv preprint arXiv:2510.20579 (2025)

Meng, J., Li, X., Wang, H., Tan, Y., Zhang, T., Kong, L., Tong, Y., Wang, A., Teng, Z., Wang, Y., et al.: Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence. arXiv preprint arXiv:2510.20579 (2025)

work page arXiv 2025
[14]

OpenAI: Introducing o3 and o4-mini.https://openai.com/index/introducing- o3-and-o4-mini/(2025)

work page 2025
[15]

arXiv preprint arXiv:2504.01805 (2025)

Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025) 22 B. Galoaa, S. Moezzi et al

work page arXiv 2025
[16]

In: NeurIPS (2025)

Park, J., Na, J., Kim, J., Kim, H.J.: Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo. In: NeurIPS (2025)

work page 2025
[17]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Varma, M., Delbrouck, J.B., Ostmeier, S., Chaudhari, A.S., Langlotz, C.: TRove: Discovering error-inducing static feature biases in temporal vision-language mod- els. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[18]

In: The Fourteenth International Conference on Learning Representations (2026)

Wang,H.,Li,X.,Huang,Z.,Wang,A.,Wang,J.,Zhang,T.,Bai,S.,Kang,Z.,Feng, J., Zhuochen, W., et al.: Traceable evidence enhanced visual grounded reasoning: Evaluation and method. In: The Fourteenth International Conference on Learning Representations (2026)

work page 2026
[19]

VGR: Visual Grounded Reasoning

Wang, J., Kang, Z., Wang, H., Jiang, H., Li, J., Wu, B., Wang, Y., Ran, J., Liang, X., Feng, C., et al.: Vgr: Visual grounded reasoning. arXiv preprint arXiv:2506.11991 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Wang, Q., Yu, Y., Yuan, Y., Mao, R., Zhou, T.: Videorft: Incentivizing video rea- soning capability in mllms via reinforced fine-tuning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[21]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)

Wang, Y., Wang, Z., Xu, B., Du, Y., Lin, K., Xiao, Z., Yue, Z., Ju, J., Zhang, L., Yang, D., et al.: Time-r1: Post-training large vision language model for temporal video grounding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026)

work page 2026
[22]

In: Proceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers)

Wang, Y., Zheng, Z., Zhao, X., Li, J., Wang, Y., Zhao, D.: Vstar: A video-grounded dialogue dataset for situated semantic understanding with scene and topic transi- tions. In: Proceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). pp. 5036–5048 (2023)

work page 2023
[23]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Wang, Z., Yoon, J., Yu, S., Islam, M.M., Bertasius, G., Bansal, M.: Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 28114–28128 (2025)

work page 2025
[24]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) Workshops

Zhang, X.: The escalator problem: Identifying implicit motion blindness in ai for accessibility. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV) Workshops. pp. 6635–6643 (October 2025)

work page 2025
[25]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., et al.: Group sequence policy optimization. arXiv preprint arXiv:2507.18071 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing thinking with images via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025