Closed-Loop Trace Distillation distills one-line natural-language prompts from labeled training traces to improve VLM accuracy on predicting minimal-success action chains in Exploratory Manipulation Trace QA by 0.38-0.47 across simulator and real-robot tasks.
Motion-o: Trajectory-Grounded Video Reasoning
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{<motion/>} tag summarizing direction, speed, and scale change. To supervise MCoT, we densify sparse spatio-temporal annotations into object tracks and derive motion descriptors from centroid displacement and box-area change. We then train with complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning without architectural modifications. These results suggest that an explicit motion interface can complement existing VLM pipelines by converting implicit dynamics into verifiable evidence. Code is available at~\href{https://github.com/ostadabbas/Motion-o}{\faGithub\ \texttt{ostadabbas/Motion-o}}.
fields
cs.RO 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA
Closed-Loop Trace Distillation distills one-line natural-language prompts from labeled training traces to improve VLM accuracy on predicting minimal-success action chains in Exploratory Manipulation Trace QA by 0.38-0.47 across simulator and real-robot tasks.