pith. machine review for the scientific record. sign in

arxiv: 2511.17792 · v2 · submitted 2025-11-21 · 💻 cs.CV · cs.RO

Recognition: no theorem link

Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?

Authors on Pith no claims yet

Pith reviewed 2026-05-17 19:56 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords video world modelssemantic path planningbenchmark evaluationmapless navigationrobot scenariosscale recoverydirectional consistency
0
0 comments X

The pith

Video world models achieve only a 0.341 score on semantic reasoning and planning in Target-Bench

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Target-Bench, a benchmark designed to measure video world models' capabilities in semantic reasoning, spatial estimation, and planning for mapless path planning using semantic targets. It consists of 450 robot-collected scenarios from 47 categories, using SLAM trajectories as references for motion. Generated videos are evaluated after metric scale recovery with five metrics emphasizing target-approaching and directional consistency. The top model reaches only 0.341 overall, highlighting a gap in semantic capabilities, though fine-tuning on small real datasets improves results.

Core claim

Target-Bench enables evaluation of video world models by reconstructing motion from their generated videos using a metric scale recovery mechanism and comparing against SLAM-based trajectories with metrics for target-approaching capability and directional consistency, demonstrating that current models have limited semantic reasoning for planning tasks despite high visual quality.

What carries the argument

Target-Bench benchmark with its SLAM-based motion references and metric scale recovery for assessing planning from video outputs

Load-bearing premise

The combination of five metrics and SLAM trajectories serves as an accurate stand-in for real mapless semantic path planning performance.

What would settle it

Observing whether models that perform well on Target-Bench can actually navigate to semantic targets in physical robot tests without maps, or if low-scoring models perform better in reality.

Figures

Figures reproduced from arXiv: 2511.17792 by Dingrui Wang, Felix Jahncke, Finn Sch\"afer, Haotong Qin, Hongyuan Ye, Johannes Betz, Luigi Palmieri, Marvin Seegert, Mattia Piccinini, Wei Li, Yuan Gao, Yuchen Zhang, Yuyu Zhao, Zhaowei Lu, Zhexiao Sun, Zhihao Liang.

Figure 1
Figure 1. Figure 1: Target-Bench provides a dataset collected with a quadruped robot, and a benchmark for evaluating world models in mapless path planning toward text-specified goals with implicit semantic meaning. In Target-Bench, world models receive a camera frame and a textual prompt describing the target state, and predict a future video depicting the trajectory toward the goal. A world decoder then extracts the planned … view at source ↗
Figure 2
Figure 2. Figure 2: Robot setup and SLAM pipeline. (a) All trajectories. (b) Word cloud of captions. (c) Semantic target categories. (d) Target-Bench Evaluation Results [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of dataset structure and semantics. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Data sample visualization. mantic target and a point cloud map. Video frames are cap￾tured at 25Hz, yielding approximately 10 seconds of con￾tinuous observation per sample. The semantic targets are annotated and pre-selected by human experts to ensure di￾versity and relevance for navigation tasks. Fig. 3a shows that our trajectories span diverse directions and movement patterns, providing balanced and real… view at source ↗
Figure 5
Figure 5. Figure 5: Target Benchmark Architecture. Scale Factor Recovery. Monocular methods such as VGGT and SpaTracker estimate camera motion only up to an unknown global scale. We restore metric consistency at the segment level by anchoring predictions to a single scalar scale factor λ derived from ground truth displacement. Let E1, Ek ∈ R 3×4 be the predicted extrinsic matrices for the first and the k-th frame. We extract … view at source ↗
Figure 6
Figure 6. Figure 6: World model performance comparison with VGGT as [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overall score comparison between different spatio [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

While recent video world models can generate highly realistic videos, their ability to perform semantic reasoning and planning remains unclear and unquantified. We introduce Target-Bench, the first benchmark that enables comprehensive evaluation of video world models' semantic reasoning, spatial estimation, and planning capabilities. Target-Bench provides 450 robot-collected scenarios spanning 47 semantic categories, with SLAM-based trajectories serving as motion tendency references. Our benchmark reconstructs motion from generated videos with a metric scale recovery mechanism, enabling the evaluation of planning performance with five complementary metrics that focus on target-approaching capability and directional consistency. Our evaluation result shows that the best off-the-shelf model achieves only a 0.341 overall score, revealing a significant gap between realistic visual generation and semantic reasoning in current video world models. Furthermore, we demonstrate that fine-tuning process on a relatively small real-world robot dataset can significantly improve task-level planning performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Target-Bench, the first benchmark for evaluating video world models on semantic reasoning, spatial estimation, and mapless path planning to semantic targets. It uses 450 robot-collected scenarios across 47 categories, SLAM-based trajectories as motion references, a metric scale recovery mechanism to reconstruct motion from generated videos, and five complementary metrics emphasizing target-approaching capability and directional consistency. The central result is that the best off-the-shelf model achieves an overall score of only 0.341, indicating a gap between visual generation and semantic planning; fine-tuning on a small real-world robot dataset is shown to improve task-level performance.

Significance. If the benchmark and metrics validly isolate mapless semantic planning, the low off-the-shelf scores and fine-tuning gains would usefully quantify limitations in current video world models for robotic applications and demonstrate a practical improvement path. Strengths include the scale of real-robot data collection and the multi-metric design; these elements would support reproducible follow-up work if the reference validity and scale-recovery details are clarified.

major comments (2)
  1. [§3] §3 (Benchmark and Reference Trajectories): The headline claim that the 0.341 score reveals a gap in semantic reasoning for mapless path planning depends on SLAM-based trajectories serving as a faithful proxy for mapless motion references. Because standard SLAM pipelines explicitly build and optimize metric maps to produce those trajectories, the evaluation risks measuring alignment with map-derived paths rather than pure semantic target reasoning without maps. A concrete test would be an ablation replacing SLAM references with non-metric or purely semantic references (e.g., optical-flow-only or human-annotated direction sequences) and re-computing the five metrics.
  2. [§4] §4 (Motion Reconstruction and Metrics): The abstract states that motion is reconstructed 'with a metric scale recovery mechanism' and evaluated with five complementary metrics, yet no equations, pseudocode, or validation against ground-truth scale are provided. Without reported error statistics on scale recovery (e.g., median scale factor error or correlation with SLAM ground truth) or definitions of the directional-consistency and target-approaching terms, the numerical claim of 0.341 cannot be interpreted as a robust measure of planning capability.
minor comments (2)
  1. [Results] Table 1 or equivalent results table: report per-metric breakdowns and standard deviations across the 450 scenarios so readers can judge whether the aggregate 0.341 is driven by a few hard categories.
  2. [Fine-tuning section] The fine-tuning experiment would benefit from an explicit statement of the dataset split (train/val/test) and whether any overlap exists with the Target-Bench scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to improve clarity and technical detail without altering the core claims of the work.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark and Reference Trajectories): The headline claim that the 0.341 score reveals a gap in semantic reasoning for mapless path planning depends on SLAM-based trajectories serving as a faithful proxy for mapless motion references. Because standard SLAM pipelines explicitly build and optimize metric maps to produce those trajectories, the evaluation risks measuring alignment with map-derived paths rather than pure semantic target reasoning without maps. A concrete test would be an ablation replacing SLAM references with non-metric or purely semantic references (e.g., optical-flow-only or human-annotated direction sequences) and re-computing the five metrics.

    Authors: We appreciate the referee's observation on the distinction between map-based references and purely semantic reasoning. In Target-Bench the video world models receive only the initial frame and a semantic target description; they have no access to maps or SLAM output during generation. The SLAM trajectories function solely as post-collection ground-truth references that record the actual motion executed by the robot when it approached the semantic target in the real world. This allows us to measure how closely the motion implied by a generated video matches real-world target-approaching behavior. We will revise Section 3 to explicitly state this usage and to clarify that the benchmark evaluates mapless generation against real executed paths rather than against map-derived planning. An ablation with purely non-metric references would be informative but would require new data collection and annotation; we therefore treat it as valuable future work rather than a change to the current benchmark design. revision: partial

  2. Referee: [§4] §4 (Motion Reconstruction and Metrics): The abstract states that motion is reconstructed 'with a metric scale recovery mechanism' and evaluated with five complementary metrics, yet no equations, pseudocode, or validation against ground-truth scale are provided. Without reported error statistics on scale recovery (e.g., median scale factor error or correlation with SLAM ground truth) or definitions of the directional-consistency and target-approaching terms, the numerical claim of 0.341 cannot be interpreted as a robust measure of planning capability.

    Authors: We agree that the manuscript would benefit from explicit technical detail on scale recovery and metric definitions. In the revised manuscript we will add the mathematical formulation of the metric scale recovery procedure, pseudocode for the full motion-reconstruction pipeline, and quantitative validation results (median scale-factor error and Pearson correlation with SLAM ground truth). We will also provide formal definitions and equations for all five metrics, with particular emphasis on the target-approaching and directional-consistency components. These additions will appear in Section 4 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses external SLAM references

full rationale

The paper introduces Target-Bench by collecting real robot scenarios, using independent SLAM-derived trajectories as motion references, and defining five metrics (target-approaching and directional consistency) to score generated videos after metric scale recovery. These scores are computed directly on off-the-shelf model outputs without parameter fitting to the test set or any self-referential definition of the target quantity. The central claim of a 0.341 performance gap follows from applying the externally grounded metrics, and the fine-tuning demonstration likewise uses separate real-world data. No load-bearing step reduces by construction to the paper's own inputs or prior self-citations; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no new physical constants, fitted parameters, or postulated entities; it relies on standard computer-vision assumptions about SLAM accuracy and video-generation fidelity.

axioms (1)
  • domain assumption SLAM trajectories provide reliable ground-truth motion references for semantic targets
    Invoked when the benchmark uses SLAM paths to score generated videos

pith-pipeline@v0.9.0 · 5516 in / 1274 out tokens · 53515 ms · 2026-05-17T19:56:07.791008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models

    cs.AI 2026-04 unverdicted novelty 7.0

    WorldMAP bootstraps reliable trajectory prediction in vision-language navigation by converting world-model-generated futures into structured supervision, cutting ADE by 18% and FDE by 42.1% on Target-Bench while makin...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

    Timur Akhtyamov, Mohamad Al Mdfaa, Javier Anto- nio Ramirez, Sergey Bakulin, German Devchich, De- nis Fatykhov, Alexander Mazurov, Kristina Zipa, Malik Mohrat, Pavel Kolesnik, et al. Egowalk: A multimodal dataset for robot navigation in the wild.arXiv preprint arXiv:2505.21282, 2025. 3

  2. [2]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- ervisto, Amos Storkey, Tim Pearce, and Franc ¸ois Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems, pages 58757–58791. Curran Associates, Inc., 2024. 2

  3. [3]

    V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, and et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025. 2

  4. [4]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brown- field, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kris- tian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung...

  5. [5]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

  6. [6]

    Genie: Gener- ative interactive environments

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker- Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder S...

  7. [7]

    Veo 3: Advanced controllable video gen- eration with physics-aware dynamics, 2025

    Google DeepMind. Veo 3: Advanced controllable video gen- eration with physics-aware dynamics, 2025. 2, 6

  8. [8]

    Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 2

  9. [9]

    Foun- dation models in autonomous driving: A survey on sce- nario generation and scenario analysis.arXiv preprint arXiv:2506.11526, 2025

    Yuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, Korbinian Moller, Roberto Brusnicki, Baha Zarrouki, Alessio Gambi, Jan Frederik Totz, Kai Storms, et al. Foun- dation models in autonomous driving: A survey on sce- nario generation and scenario analysis.arXiv preprint arXiv:2506.11526, 2025. 2

  10. [10]

    Recurrent world models facilitate policy evolution

    David Ha and J ¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. InAdvances in Neural Informa- tion Processing Systems 31, pages 2451–2463. Curran Asso- ciates, Inc., 2018. 2

  11. [11]

    Dream to control: Learning behaviors by la- tent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Moham- mad Norouzi. Dream to control: Learning behaviors by la- tent imagination. InInternational Conference on Learning Representations, 2020. 2

  12. [12]

    Sacson: Scalable autonomous control for social nav- igation.IEEE Robotics and Automation Letters, 9(1):49–56,

    Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social nav- igation.IEEE Robotics and Automation Letters, 9(1):49–56,

  13. [13]

    Lelan: Learning a language-conditioned navigation policy from in-the-wild video

    Noriaki Hirose, Catherine Glossop, Ajay Sridhar, Dhruv Shah, Oier Mees, and Sergey Levine. Lelan: Learning a language-conditioned navigation policy from in-the-wild video. InProceedings of The 8th Conference on Robot Learning, pages 666–688. PMLR, 2025. 3

  14. [14]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025. 2, 4, 8

  15. [15]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 2

  16. [16]

    Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7 (4):11807–11814, 2022

    Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett War- nell, S ¨oren Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7 (4):11807–11814, 2022. 3

  17. [17]

    G2o: A general framework for graph optimization

    Rainer K ¨ummerle, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. G2o: A general framework for graph optimization. In2011 IEEE International Confer- ence on Robotics and Automation, pages 3607–3613, 2011. 3

  18. [18]

    A path towards autonomous machine intelli- gence version 0.9

    Yann LeCun. A path towards autonomous machine intelli- gence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62,

  19. [19]

    Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100,

    Chenhao Li, Andreas Krause, and Marco Hutter. Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100,

  20. [20]

    Worldmodelbench: Judg- ing video generation models as world models.CoRR, abs/2502.20694, 2025

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judg- ing video generation models as world models.CoRR, abs/2502.20694, 2025. 2

  21. [21]

    Citywalker: Learning embodied urban navigation from web-scale videos

    Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, and Chen Feng. Citywalker: Learning embodied urban navigation from web-scale videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6875–6885, 2025. 3

  22. [22]

    A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

    Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, et al. A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025. 2

  23. [23]

    Diffsynth-studio: examples/wanvideo, 2025

    modelscope. Diffsynth-studio: examples/wanvideo, 2025. Accessed: 2025-11-14. 6

  24. [24]

    Toward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset

    Duc M Nguyen, Mohammad Nazeri, Amirreza Payandeh, Aniket Datar, and Xuesu Xiao. Toward human-like social robot navigation: A large-scale, multi-modal, social human navigation dataset. In2023 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 7442–

  25. [25]

    Cosmos World Foundation Model Platform for Physical AI

    NVIDIA, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, and et al. Cosmos: World foundation model platform for physical ai.CoRR, abs/2501.03575, 2025. 2

  26. [26]

    Learning view-invariant world models for vi- sual robotic manipulation

    Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin- Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, and Yang Yu. Learning view-invariant world models for vi- sual robotic manipulation. InThe Thirteenth International Conference on Learning Representations, 2025. 2

  27. [27]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Fred- eric Besse, Tim Harley, Ann...

  28. [28]

    Generalized-ICP

    Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun. Generalized-ICP. InRobotics: Science and Systems, 2009. 3

  29. [29]

    Sora 2 is here: Our latest video generation model is more physically accurate, realistic, and more controllable than prior systems

    The OpenAI Sora Team. Sora 2 is here: Our latest video generation model is more physically accurate, realistic, and more controllable than prior systems. it also features syn- chronized dialogue and sound effects. create with it in the new sora app., 2025. Accessed: 2025-11-09. 2, 6

  30. [30]

    Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025

    Unitree. Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025. 2

  31. [31]

    Sanpo: A scene understanding, accessibility and human navigation dataset

    Sagar M Waghmare, Kimberly Wilber, Dave Hawkey, Xuan Yang, Matthew Wilson, Stephanie Debats, Cattalyya Nu- engsigkapian, Astuti Sharma, Lars Pandikow, Huisheng Wang, et al. Sanpo: A scene understanding, accessibility and human navigation dataset. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 7866–7875. IEEE, 2025. 3

  32. [32]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.CoRR, abs/2503.20314, 2025. 2, 6

  33. [33]

    Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437,

    Dingrui Wang, Zhexiao Sun, Zhouheng Li, Cheng Wang, Youlun Peng, Hongyuan Ye, Baha Zarrouki, Wei Li, Mattia Piccinini, Lei Xie, et al. Enhancing physical consistency in lightweight world models.arXiv preprint arXiv:2509.12437,

  34. [34]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025. 2, 4, 7

  35. [35]

    Video models are zero-shot learners and reasoners

    Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025. 2

  36. [36]

    Spatialtrackerv2: 3d point tracking made easy

    Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Iurii Makarov, Bingyi Kang, Xin Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy. InICCV, 2025. 2, 4, 8

  37. [37]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

  38. [38]

    World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

    Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025. 2

  39. [39]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burch- fiel, Paarth Shah, and Abhishek Gupta. Unified world mod- els: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792,