arxiv: 2511.17792 · v2 · submitted 2025-11-21 · 💻 cs.CV · cs.RO

Recognition: no theorem link

Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?

Dingrui Wang , Zhihao Liang , Hongyuan Ye , Zhexiao Sun , Zhaowei Lu , Yuchen Zhang , Yuyu Zhao , Yuan Gao

show 8 more authors

Marvin Seegert Finn Sch\"afer Haotong Qin Wei Li Luigi Palmieri Felix Jahncke Mattia Piccinini Johannes Betz

Authors on Pith no claims yet

Pith reviewed 2026-05-17 19:56 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords video world modelssemantic path planningbenchmark evaluationmapless navigationrobot scenariosscale recoverydirectional consistency

0 comments

The pith

Video world models achieve only a 0.341 score on semantic reasoning and planning in Target-Bench

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Target-Bench, a benchmark designed to measure video world models' capabilities in semantic reasoning, spatial estimation, and planning for mapless path planning using semantic targets. It consists of 450 robot-collected scenarios from 47 categories, using SLAM trajectories as references for motion. Generated videos are evaluated after metric scale recovery with five metrics emphasizing target-approaching and directional consistency. The top model reaches only 0.341 overall, highlighting a gap in semantic capabilities, though fine-tuning on small real datasets improves results.

Core claim

Target-Bench enables evaluation of video world models by reconstructing motion from their generated videos using a metric scale recovery mechanism and comparing against SLAM-based trajectories with metrics for target-approaching capability and directional consistency, demonstrating that current models have limited semantic reasoning for planning tasks despite high visual quality.

What carries the argument

Target-Bench benchmark with its SLAM-based motion references and metric scale recovery for assessing planning from video outputs

Load-bearing premise

The combination of five metrics and SLAM trajectories serves as an accurate stand-in for real mapless semantic path planning performance.

What would settle it

Observing whether models that perform well on Target-Bench can actually navigate to semantic targets in physical robot tests without maps, or if low-scoring models perform better in reality.

Figures

Figures reproduced from arXiv: 2511.17792 by Dingrui Wang, Felix Jahncke, Finn Sch\"afer, Haotong Qin, Hongyuan Ye, Johannes Betz, Luigi Palmieri, Marvin Seegert, Mattia Piccinini, Wei Li, Yuan Gao, Yuchen Zhang, Yuyu Zhao, Zhaowei Lu, Zhexiao Sun, Zhihao Liang.

**Figure 1.** Figure 1: Target-Bench provides a dataset collected with a quadruped robot, and a benchmark for evaluating world models in mapless path planning toward text-specified goals with implicit semantic meaning. In Target-Bench, world models receive a camera frame and a textual prompt describing the target state, and predict a future video depicting the trajectory toward the goal. A world decoder then extracts the planned … view at source ↗

**Figure 2.** Figure 2: Robot setup and SLAM pipeline. (a) All trajectories. (b) Word cloud of captions. (c) Semantic target categories. (d) Target-Bench Evaluation Results [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of dataset structure and semantics. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Data sample visualization. mantic target and a point cloud map. Video frames are captured at 25Hz, yielding approximately 10 seconds of continuous observation per sample. The semantic targets are annotated and pre-selected by human experts to ensure diversity and relevance for navigation tasks. Fig. 3a shows that our trajectories span diverse directions and movement patterns, providing balanced and real… view at source ↗

**Figure 5.** Figure 5: Target Benchmark Architecture. Scale Factor Recovery. Monocular methods such as VGGT and SpaTracker estimate camera motion only up to an unknown global scale. We restore metric consistency at the segment level by anchoring predictions to a single scalar scale factor λ derived from ground truth displacement. Let E1, Ek ∈ R 3×4 be the predicted extrinsic matrices for the first and the k-th frame. We extract … view at source ↗

**Figure 6.** Figure 6: World model performance comparison with VGGT as [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Overall score comparison between different spatio [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

While recent video world models can generate highly realistic videos, their ability to perform semantic reasoning and planning remains unclear and unquantified. We introduce Target-Bench, the first benchmark that enables comprehensive evaluation of video world models' semantic reasoning, spatial estimation, and planning capabilities. Target-Bench provides 450 robot-collected scenarios spanning 47 semantic categories, with SLAM-based trajectories serving as motion tendency references. Our benchmark reconstructs motion from generated videos with a metric scale recovery mechanism, enabling the evaluation of planning performance with five complementary metrics that focus on target-approaching capability and directional consistency. Our evaluation result shows that the best off-the-shelf model achieves only a 0.341 overall score, revealing a significant gap between realistic visual generation and semantic reasoning in current video world models. Furthermore, we demonstrate that fine-tuning process on a relatively small real-world robot dataset can significantly improve task-level planning performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Target-Bench gives a concrete first benchmark for semantic planning in video world models, but the SLAM reference choice needs scrutiny to confirm it truly tests mapless reasoning.

read the letter

This paper's main point is that current video world models are weak at semantic target planning without maps, with the best off-the-shelf one scoring only 0.341 on their new benchmark, though fine-tuning on a small robot dataset lifts performance noticeably. Target-Bench collects 450 real robot scenarios across 47 semantic categories and uses SLAM trajectories as motion references. It then reconstructs motion from the generated videos via scale recovery and scores them on five metrics that check target approach and directional consistency. The concrete low score plus the fine-tuning result is the useful part; it moves evaluation past visual quality toward whether the model can actually steer toward a goal. That data point is worth having for anyone working on embodied video models. The soft spots are around the reference setup and missing details. SLAM trajectories come from systems that build metric maps internally, so the benchmark risks measuring consistency with mapped paths rather than pure mapless semantic reasoning; the stress-test concern lands here unless the full paper shows how the metrics isolate the mapless aspect. The abstract also skips exact metric definitions, scale-recovery validation steps, and any statistical checks, which makes the 0.341 claim hard to judge without the methods. Those gaps are fixable but they keep the current evidence preliminary. This is for robotics and embodied AI researchers who evaluate or train video world models for planning tasks. A reader who wants a reproducible yardstick beyond pixel-level scores will find something practical. It has enough new data and a clear evaluation angle to deserve a serious referee, even with expected questions on the reference validity and metric transparency. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Target-Bench, the first benchmark for evaluating video world models on semantic reasoning, spatial estimation, and mapless path planning to semantic targets. It uses 450 robot-collected scenarios across 47 categories, SLAM-based trajectories as motion references, a metric scale recovery mechanism to reconstruct motion from generated videos, and five complementary metrics emphasizing target-approaching capability and directional consistency. The central result is that the best off-the-shelf model achieves an overall score of only 0.341, indicating a gap between visual generation and semantic planning; fine-tuning on a small real-world robot dataset is shown to improve task-level performance.

Significance. If the benchmark and metrics validly isolate mapless semantic planning, the low off-the-shelf scores and fine-tuning gains would usefully quantify limitations in current video world models for robotic applications and demonstrate a practical improvement path. Strengths include the scale of real-robot data collection and the multi-metric design; these elements would support reproducible follow-up work if the reference validity and scale-recovery details are clarified.

major comments (2)

[§3] §3 (Benchmark and Reference Trajectories): The headline claim that the 0.341 score reveals a gap in semantic reasoning for mapless path planning depends on SLAM-based trajectories serving as a faithful proxy for mapless motion references. Because standard SLAM pipelines explicitly build and optimize metric maps to produce those trajectories, the evaluation risks measuring alignment with map-derived paths rather than pure semantic target reasoning without maps. A concrete test would be an ablation replacing SLAM references with non-metric or purely semantic references (e.g., optical-flow-only or human-annotated direction sequences) and re-computing the five metrics.
[§4] §4 (Motion Reconstruction and Metrics): The abstract states that motion is reconstructed 'with a metric scale recovery mechanism' and evaluated with five complementary metrics, yet no equations, pseudocode, or validation against ground-truth scale are provided. Without reported error statistics on scale recovery (e.g., median scale factor error or correlation with SLAM ground truth) or definitions of the directional-consistency and target-approaching terms, the numerical claim of 0.341 cannot be interpreted as a robust measure of planning capability.

minor comments (2)

[Results] Table 1 or equivalent results table: report per-metric breakdowns and standard deviations across the 450 scenarios so readers can judge whether the aggregate 0.341 is driven by a few hard categories.
[Fine-tuning section] The fine-tuning experiment would benefit from an explicit statement of the dataset split (train/val/test) and whether any overlap exists with the Target-Bench scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to improve clarity and technical detail without altering the core claims of the work.

read point-by-point responses

Referee: [§3] §3 (Benchmark and Reference Trajectories): The headline claim that the 0.341 score reveals a gap in semantic reasoning for mapless path planning depends on SLAM-based trajectories serving as a faithful proxy for mapless motion references. Because standard SLAM pipelines explicitly build and optimize metric maps to produce those trajectories, the evaluation risks measuring alignment with map-derived paths rather than pure semantic target reasoning without maps. A concrete test would be an ablation replacing SLAM references with non-metric or purely semantic references (e.g., optical-flow-only or human-annotated direction sequences) and re-computing the five metrics.

Authors: We appreciate the referee's observation on the distinction between map-based references and purely semantic reasoning. In Target-Bench the video world models receive only the initial frame and a semantic target description; they have no access to maps or SLAM output during generation. The SLAM trajectories function solely as post-collection ground-truth references that record the actual motion executed by the robot when it approached the semantic target in the real world. This allows us to measure how closely the motion implied by a generated video matches real-world target-approaching behavior. We will revise Section 3 to explicitly state this usage and to clarify that the benchmark evaluates mapless generation against real executed paths rather than against map-derived planning. An ablation with purely non-metric references would be informative but would require new data collection and annotation; we therefore treat it as valuable future work rather than a change to the current benchmark design. revision: partial
Referee: [§4] §4 (Motion Reconstruction and Metrics): The abstract states that motion is reconstructed 'with a metric scale recovery mechanism' and evaluated with five complementary metrics, yet no equations, pseudocode, or validation against ground-truth scale are provided. Without reported error statistics on scale recovery (e.g., median scale factor error or correlation with SLAM ground truth) or definitions of the directional-consistency and target-approaching terms, the numerical claim of 0.341 cannot be interpreted as a robust measure of planning capability.

Authors: We agree that the manuscript would benefit from explicit technical detail on scale recovery and metric definitions. In the revised manuscript we will add the mathematical formulation of the metric scale recovery procedure, pseudocode for the full motion-reconstruction pipeline, and quantitative validation results (median scale-factor error and Pearson correlation with SLAM ground truth). We will also provide formal definitions and equations for all five metrics, with particular emphasis on the target-approaching and directional-consistency components. These additions will appear in Section 4 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external SLAM references

full rationale

The paper introduces Target-Bench by collecting real robot scenarios, using independent SLAM-derived trajectories as motion references, and defining five metrics (target-approaching and directional consistency) to score generated videos after metric scale recovery. These scores are computed directly on off-the-shelf model outputs without parameter fitting to the test set or any self-referential definition of the target quantity. The central claim of a 0.341 performance gap follows from applying the externally grounded metrics, and the fine-tuning demonstration likewise uses separate real-world data. No load-bearing step reduces by construction to the paper's own inputs or prior self-citations; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no new physical constants, fitted parameters, or postulated entities; it relies on standard computer-vision assumptions about SLAM accuracy and video-generation fidelity.

axioms (1)

domain assumption SLAM trajectories provide reliable ground-truth motion references for semantic targets
Invoked when the benchmark uses SLAM paths to score generated videos

pith-pipeline@v0.9.0 · 5516 in / 1274 out tokens · 53515 ms · 2026-05-17T19:56:07.791008+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
cs.AI 2026-04 unverdicted novelty 7.0

WorldMAP bootstraps reliable trajectory prediction in vision-language navigation by converting world-model-generated futures into structured supervision, cutting ADE by 18% and FDE by 42.1% on Target-Bench while makin...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

Timur Akhtyamov, Mohamad Al Mdfaa, Javier Anto- nio Ramirez, Sergey Bakulin, German Devchich, De- nis Fatykhov, Alexander Mazurov, Kristina Zipa, Malik Mohrat, Pavel Kolesnik, et al. Egowalk: A multimodal dataset for robot navigation in the wild.arXiv preprint arXiv:2505.21282, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- ervisto, Amos Storkey, Tim Pearce, and Franc ¸ois Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems, pages 58757–58791. Curran Associates, Inc., 2024. 2

work page 2024
[3]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, and et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025. 2

work page 2025
[4]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brown- field, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kris- tian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung...

work page 2025
[5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

work page
[6]

Genie: Gener- ative interactive environments

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker- Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder S...

work page
[7]

Veo 3: Advanced controllable video gen- eration with physics-aware dynamics, 2025

Google DeepMind. Veo 3: Advanced controllable video gen- eration with physics-aware dynamics, 2025. 2, 6

work page 2025
[8]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 2

work page arXiv 2025
[9]

Foun- dation models in autonomous driving: A survey on sce- nario generation and scenario analysis.arXiv preprint arXiv:2506.11526, 2025

Yuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, Korbinian Moller, Roberto Brusnicki, Baha Zarrouki, Alessio Gambi, Jan Frederik Totz, Kai Storms, et al. Foun- dation models in autonomous driving: A survey on sce- nario generation and scenario analysis.arXiv preprint arXiv:2506.11526, 2025. 2

work page arXiv 2025
[10]

Recurrent world models facilitate policy evolution

David Ha and J ¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. InAdvances in Neural Informa- tion Processing Systems 31, pages 2451–2463. Curran Asso- ciates, Inc., 2018. 2

work page 2018
[11]

Dream to control: Learning behaviors by la- tent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Moham- mad Norouzi. Dream to control: Learning behaviors by la- tent imagination. InInternational Conference on Learning Representations, 2020. 2

work page 2020
[12]

Sacson: Scalable autonomous control for social nav- igation.IEEE Robotics and Automation Letters, 9(1):49–56,

Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social nav- igation.IEEE Robotics and Automation Letters, 9(1):49–56,

work page
[13]

Lelan: Learning a language-conditioned navigation policy from in-the-wild video

Noriaki Hirose, Catherine Glossop, Ajay Sridhar, Dhruv Shah, Oier Mees, and Sergey Levine. Lelan: Learning a language-conditioned navigation policy from in-the-wild video. InProceedings of The 8th Conference on Robot Learning, pages 666–688. PMLR, 2025. 3

work page 2025
[14]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025. 2, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 2

work page 2024
[16]

Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7 (4):11807–11814, 2022

Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett War- nell, S ¨oren Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7 (4):11807–11814, 2022. 3

work page 2022
[17]

G2o: A general framework for graph optimization

Rainer K ¨ummerle, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. G2o: A general framework for graph optimization. In2011 IEEE International Confer- ence on Robotics and Automation, pages 3607–3613, 2011. 3

work page 2011
[18]

A path towards autonomous machine intelli- gence version 0.9

Yann LeCun. A path towards autonomous machine intelli- gence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62,

work page 2022
[19]

Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100,

Chenhao Li, Andreas Krause, and Marco Hutter. Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100,

work page arXiv
[20]

Worldmodelbench: Judg- ing video generation models as world models.CoRR, abs/2502.20694, 2025

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judg- ing video generation models as world models.CoRR, abs/2502.20694, 2025. 2

work page arXiv 2025
[21]

Citywalker: Learning embodied urban navigation from web-scale videos

Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, and Chen Feng. Citywalker: Learning embodied urban navigation from web-scale videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6875–6885, 2025. 3

work page 2025
[22]

A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, et al. A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025. 2

work page arXiv 2025
[23]

Diffsynth-studio: examples/wanvideo, 2025

modelscope. Diffsynth-studio: examples/wanvideo, 2025. Accessed: 2025-11-14. 6

work page 2025
[24]

Toward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset

Duc M Nguyen, Mohammad Nazeri, Amirreza Payandeh, Aniket Datar, and Xuesu Xiao. Toward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset. In2023 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 7442–

work page
[25]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, and et al. Cosmos: World foundation model platform for physical ai.CoRR, abs/2501.03575, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Learning view-invariant world models for vi- sual robotic manipulation

Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin- Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, and Yang Yu. Learning view-invariant world models for vi- sual robotic manipulation. InThe Thirteenth International Conference on Learning Representations, 2025. 2

work page 2025
[27]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Fred- eric Besse, Tim Harley, Ann...

work page 2024
[28]

Generalized-ICP

Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun. Generalized-ICP. InRobotics: Science and Systems, 2009. 3

work page 2009
[29]

Sora 2 is here: Our latest video generation model is more physically accurate, realistic, and more controllable than prior systems

The OpenAI Sora Team. Sora 2 is here: Our latest video generation model is more physically accurate, realistic, and more controllable than prior systems. it also features syn- chronized dialogue and sound effects. create with it in the new sora app., 2025. Accessed: 2025-11-09. 2, 6

work page 2025
[30]

Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025

Unitree. Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025. 2

work page 2025
[31]

Sanpo: A scene understanding, accessibility and human navigation dataset

Sagar M Waghmare, Kimberly Wilber, Dave Hawkey, Xuan Yang, Matthew Wilson, Stephanie Debats, Cattalyya Nu- engsigkapian, Astuti Sharma, Lars Pandikow, Huisheng Wang, et al. Sanpo: A scene understanding, accessibility and human navigation dataset. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 7866–7875. IEEE, 2025. 3

work page 2025
[32]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.CoRR, abs/2503.20314, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437,

Dingrui Wang, Zhexiao Sun, Zhouheng Li, Cheng Wang, Youlun Peng, Hongyuan Ye, Baha Zarrouki, Wei Li, Mattia Piccinini, Lei Xie, et al. Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437,

work page arXiv
[34]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025. 2, 4, 7

work page 2025
[35]

Video models are zero-shot learners and reasoners

Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Spatialtrackerv2: 3d point tracking made easy

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Iurii Makarov, Bingyi Kang, Xin Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy. InICCV, 2025. 2, 4, 8

work page 2025
[37]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

work page arXiv
[38]

World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025. 2

work page arXiv 2025
[39]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burch- fiel, Paarth Shah, and Abhishek Gupta. Unified world mod- els: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792,

work page internal anchor Pith review Pith/arXiv arXiv