Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?

Dingrui Wang , Zhihao Liang , Hongyuan Ye , Zhexiao Sun , Zhaowei Lu , Yuchen Zhang , Yuyu Zhao , Yuan Gao

show 8 more authors

Marvin Seegert Finn Sch\"afer Haotong Qin Wei Li Luigi Palmieri Felix Jahncke Mattia Piccinini Johannes Betz

Authors on Pith no claims yet

classification 💻 cs.CV cs.RO

keywords planningsemanticmodelsvideoworldevaluationreasoningtarget-bench

0 comments

read the original abstract

While recent video world models can generate highly realistic videos, their ability to perform semantic reasoning and planning remains unclear and unquantified. We introduce Target-Bench, the first benchmark that enables comprehensive evaluation of video world models' semantic reasoning, spatial estimation, and planning capabilities. Target-Bench provides 450 robot-collected scenarios spanning 47 semantic categories, with SLAM-based trajectories serving as motion tendency references. Our benchmark reconstructs motion from generated videos with a metric scale recovery mechanism, enabling the evaluation of planning performance with five complementary metrics that focus on target-approaching capability and directional consistency. Our evaluation result shows that the best off-the-shelf model achieves only a 0.341 overall score, revealing a significant gap between realistic visual generation and semantic reasoning in current video world models. Furthermore, we demonstrate that fine-tuning process on a relatively small real-world robot dataset can significantly improve task-level planning performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
cs.AI 2026-04 unverdicted novelty 7.0

WorldMAP bootstraps reliable trajectory prediction in vision-language navigation by converting world-model-generated futures into structured supervision, cutting ADE by 18% and FDE by 42.1% on Target-Bench while makin...