pith. machine review for the scientific record. sign in

arxiv: 2604.11751 · v1 · submitted 2026-04-13 · 💻 cs.RO · cs.AI

Recognition: unknown

Grounded World Model for Semantically Generalizable Planning

Alexandre Alahi, Haonan Zhang, Harold Soh, Lan Feng, Letian Wang, Quanyi Li, Wuyang Li

Pith reviewed 2026-05-10 15:00 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords world modelmodel predictive controlvision-language alignmentsemantic generalizationvisuomotor planninggrounded planningVLAMPC
0
0 comments X

The pith

Grounding a world model in vision-language space lets MPC follow novel language instructions with 87% success on unseen tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a world model trained to predict outcomes inside a vision-language aligned latent space can score candidate action sequences by how closely their predicted results match a natural language task description. This replaces the need for a pre-specified goal image in visuomotor model predictive control and turns the planner into a language-conditioned system. On the WISER benchmark the resulting GWM-MPC controller reaches 87 percent success across 288 test tasks whose visual appearance and referring expressions never appeared in training. Standard vision-language action models, by contrast, drop to an average 22 percent success on the same held-out tasks after overfitting the training distribution. The core advance is therefore a planning method that keeps the sample efficiency of model-based control while gaining the semantic flexibility of language grounding.

Core claim

A world model whose predictions live in a pretrained vision-language latent space (DINO or JEPA embeddings combined with language) allows future states to be scored directly by embedding similarity to a task instruction. This converts classical image-goal MPC into a language-instruction follower that generalizes to novel visuals and referring expressions while still relying only on motions demonstrated during training.

What carries the argument

The Grounded World Model (GWM) that predicts future image embeddings inside a vision-language aligned space and scores each action proposal by cosine similarity between the predicted embedding and the language instruction embedding.

If this is right

  • MPC planners no longer require an advance goal image and can accept open-ended natural language instructions instead.
  • Visuomotor policies retain the ability to plan with demonstrated motions yet avoid the overfitting that limits direct VLA approaches.
  • The same latent-space scoring mechanism can be applied to any MPC controller whose state representation can be projected into the aligned embedding space.
  • Generalization holds for tasks whose required motions were seen in training but whose visual context and phrasing were not.
  • The approach separates motion learning from semantic interpretation, so new language goals can be added without collecting new demonstrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grounding trick could be tried in non-visual domains such as tactile or audio-based planning if suitable cross-modal encoders exist.
  • If the embedding similarity truly captures task progress, the method should also support online replanning when the language goal changes mid-execution.
  • Replacing the fixed pretrained encoder with a jointly trained one might further tighten the link between predicted futures and language, but would require checking whether generalization is preserved.
  • The benchmark result suggests that world-model-based planning may be a more robust path to semantic generalization than end-to-end imitation of language-to-action mappings.

Load-bearing premise

Embedding similarity in a pretrained vision-language space reliably indicates whether a predicted future state will fulfill the intended task, even when both the visual scene and the wording are new.

What would settle it

If GWM-MPC is re-run on the WISER test set using a different vision-language encoder whose alignment between images and language is known to be weaker, success rate should fall sharply below the reported 87 percent.

Figures

Figures reproduced from arXiv: 2604.11751 by Alexandre Alahi, Haonan Zhang, Harold Soh, Lan Feng, Letian Wang, Quanyi Li, Wuyang Li.

Figure 1
Figure 1. Figure 1: Compared to existing World Models like DINO-WM and JEPA-WM, Grounded World [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental results on WISER for VLAs. The success rate gap on training and test tasks [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The training and inference workflow of GWM-MPC. All proposed trajectories are tokenized [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the WISER Benchmark. Observations include the instruction [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation results on the GT-MPC for planning-related hyperparameter choosing. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: For all methods, we measure the inference efficiency with the rollout FPS, which is how many times the env.step is called in one second. VLA baselines have better inference efficiency than the GWM-MPC when evaluated on the test tasks. It is because we need to forward the GWM N = 12 times to get future embeddings for all proposals. Also, we generate future embeddings sequentially rather than in parallel bec… view at source ↗
Figure 7
Figure 7. Figure 7: Libero-goal environment. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Difference between GWM and its raw action conditioned version. Captured images are just [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a Grounded World Model (GWM) that operates in a vision-language-aligned latent space for use in Model Predictive Control (MPC) for visuomotor planning. By scoring action proposals based on the embedding similarity between predicted future states and the language instruction, it aims to enable semantic generalization without requiring a pre-specified goal image. The key result is a large performance improvement on the WISER benchmark, with 87% success on test tasks featuring unseen visuals and referring expressions versus 22% for standard VLAs.

Significance. This work has potential significance in bridging model-based planning with vision-language models for robotics. The use of pretrained embeddings for goal specification via language is a natural extension of visuomotor MPC. The empirical demonstration of generalization to novel language and visuals while using demonstrated motions is noteworthy, as is the introduction of the WISER benchmark for testing such capabilities. If the embedding similarity reliably proxies task success, it could lead to more interactive and generalizable planning systems.

major comments (2)
  1. [Abstract] Abstract: The central claim of 87% success rate for GWM-MPC versus 22% for traditional VLAs on the 288-task test set is presented without accompanying details on the world-model training procedure, the specific embedding alignment loss, statistical significance testing of the performance gap, or controls for potential biases in benchmark task construction. These omissions make the empirical result difficult to interpret and reproduce.
  2. [Abstract] Abstract: The method scores MPC proposals using similarity in the latent space formed by DINO/JEPA vision encoders and a language model. No analysis or ablation is provided to confirm that this similarity metric remains predictive of task completion for inputs with novel visual signals and referring expressions, rather than reflecting spurious correlations learned by the pretrained encoders.
minor comments (1)
  1. [Abstract] The term 'traditional VLAs' is used without specifying the exact models or architectures compared against in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of 87% success rate for GWM-MPC versus 22% for traditional VLAs on the 288-task test set is presented without accompanying details on the world-model training procedure, the specific embedding alignment loss, statistical significance testing of the performance gap, or controls for potential biases in benchmark task construction. These omissions make the empirical result difficult to interpret and reproduce.

    Authors: We agree that the abstract, due to its length constraints, does not include all methodological details. The world-model training procedure and the specific embedding alignment loss are described in detail in Sections 3.2 and 3.3 of the manuscript. We will revise the abstract to briefly reference these sections and mention that the alignment is achieved via contrastive learning on paired vision-language data. For statistical significance, we will include error bars or p-values from multiple runs in the results section and note them in the abstract. Regarding potential biases in the benchmark, the WISER benchmark construction ensures that test tasks use novel combinations of visuals and referring expressions while using demonstrated motions, as explained in Section 5; we will add a sentence to the abstract clarifying the benchmark design to aid reproducibility. revision: yes

  2. Referee: [Abstract] Abstract: The method scores MPC proposals using similarity in the latent space formed by DINO/JEPA vision encoders and a language model. No analysis or ablation is provided to confirm that this similarity metric remains predictive of task completion for inputs with novel visual signals and referring expressions, rather than reflecting spurious correlations learned by the pretrained encoders.

    Authors: The empirical results on the WISER benchmark demonstrate that GWM-MPC achieves high success rates on tasks with unseen visuals and referring expressions, while standard VLAs fail to generalize despite overfitting the training set. This suggests the similarity metric is effective for novel inputs. However, to directly address concerns about spurious correlations, we will add an ablation study in the revised manuscript. This will include measuring the correlation between the embedding similarity scores and actual task success on a held-out validation set with novel elements, as well as comparing against random or alternative metrics. We believe this will strengthen the claim that the metric proxies task completion reliably. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical success rates on held-out tasks with novel inputs

full rationale

The paper trains a grounded world model to predict future embeddings in a pretrained vision-language space and selects actions in MPC by cosine similarity to the language instruction embedding. Reported performance consists of measured success rates (87% GWM-MPC vs 22% baseline VLAs) on a fixed WISER test set of 288 tasks whose visuals and referring expressions are unseen at test time. These are direct experimental outcomes on held-out data, not quantities derived by construction from fitted parameters, self-citations, or ansatzes internal to the paper. No equation or claim reduces the generalization result to a tautology or to the training distribution alone.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of off-the-shelf vision-language embeddings for semantic scoring and on the world model being able to produce useful predictions in that space; no new physical entities are introduced.

axioms (1)
  • domain assumption Pretrained encoders such as DINO and JEPA produce latent spaces that can be meaningfully aligned with language embeddings for task similarity.
    Invoked when replacing image-goal distance with embedding similarity; no new alignment training is described.

pith-pipeline@v0.9.0 · 5543 in / 1471 out tokens · 58797 ms · 2026-05-10T15:00:16.302161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

Reference graph

Works this paper leans on

71 extracted references · 54 canonical work pages · cited by 1 Pith paper · 22 internal anchors

  1. [1]

    V-jepa 2: Self-supervised video models enable understanding, prediction and planning,

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xia...

  2. [2]

    URLhttps://arxiv.org/abs/2506.09985

  3. [3]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model, 2025. URL https://arxiv.org/abs/2512.13030

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  5. [5]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network, 2025. URL https...

  6. [6]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash- n...

  7. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

  8. [8]

    Genie: Generative interactive environments.arXiv preprint arXiv:2402.15391, 2024

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Si...

  9. [9]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions, 2025. URLhttps://arxiv.org/abs/2505.06111. 10

  10. [10]

    Lerobot: An open-source library for end-to-end robot learning

    Remi Cadene, Simon Alibert, Francesco Capuano, Michel Aractingi, Adil Zouitine, Pepijn Kooijmans, Jade Choghari, Martino Russi, Caroline Pascal, Steven Palma, Mustafa Shukor, Jess Moss, Alexander Soare, Dana Aubakirova, Quentin Lhoest, Quentin Gallouédec, and Thomas Wolf. Lerobot: An open-source library for end-to-end robot learning. InThe Fourteenth Inte...

  11. [11]

    Internvla-a1: Unifying understanding, generation and action for robotic manipulation, 2026

    Junhao Cai, Zetao Cai, Jiafei Cao, Yilun Chen, Zeyu He, Lei Jiang, Hang Li, Hengjie Li, Yang Li, Yufei Liu, Yanan Lu, Qi Lv, Haoxiang Ma, Jiangmiao Pang, Yu Qiao, Zherui Qiu, Yanqing Shen, Xu Shi, Yang Tian, Bolun Wang, Hanqing Wang, Jiaheng Wang, Tai Wang, Xueyuan Wei, Chao Wu, Yiman Xie, Boyang Xing, Yuqiang Yang, Yuyin Yang, Qiaojun Yu, Feng Yuan, Jia ...

  12. [12]

    Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna

    Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics, 2026. URLhttps://arxiv.org/abs/2602.19313

  13. [13]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

  14. [14]

    Value- guided action planning with jepa world models, 2025

    Matthieu Destrade, Oumayma Bounou, Quentin Le Lidec, Jean Ponce, and Yann LeCun. Value- guided action planning with jepa world models, 2025. URL https://arxiv.org/abs/2601. 00844

  15. [15]

    Learning a thousand tasks in a day.Science Robotics, 10(108), November 2025

    Kamil Dreczkowski, Pietro Vitiello, Vitalis V osylius, and Edward Johns. Learning a thousand tasks in a day.Science Robotics, 10(108), November 2025. ISSN 2470-9476. doi: 10.1126/ scirobotics.adv7594. URLhttp://dx.doi.org/10.1126/scirobotics.adv7594

  16. [16]

    Will Dabney, Mark Rowland, Marc G Bellemare, and R ´emi Munos

    Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning, 2021. URLhttps://arxiv.org/abs/2109.00137

  17. [17]

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...

  18. [18]

    arXiv preprint arXiv:2512.13644 (2025)

    Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexterous manipulation, 2025. URL https://arxiv.org/abs/ 2512.13644

  19. [19]

    Ctrl-world: A controllable generative world model for robot manipulation, 2025

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2025. URL https://arxiv.org/abs/2510. 10125

  20. [20]

    Adapting by analogy: Ood generalization of visuomotor policies via functional correspondence, 2025

    Pranay Gupta, Henny Admoni, and Andrea Bajcsy. Adapting by analogy: Ood generalization of visuomotor policies via functional correspondence, 2025. URL https://arxiv.org/abs/ 2506.12678

  21. [21]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution, 2018. URLhttps://arxiv.org/abs/1809.01999. 11

  22. [22]

    Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URLhttps://arxiv.org/abs/2509.24527

  23. [23]

    MPlib: A lightweight motion planning library, 2024

    Hao Su’s Lab. MPlib: A lightweight motion planning library, 2024. URL https://github. com/haosulab/MPlib. GitHub repository

  24. [24]

    Demystifying diffusion policies: Action memorization and simple lookup table alternatives,

    Chengyang He, Xu Liu, Gadiel Sznaier Camps, Guillaume Sartoretti, and Mac Schwager. Demystifying diffusion policies: Action memorization and simple lookup table alternatives,

  25. [25]

    URLhttps://arxiv.org/abs/2505.05787

  26. [26]

    Gaia-1: A generative world model for autonomous driving,

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving,

  27. [27]

    URLhttps://arxiv.org/abs/2309.17080

  28. [28]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  29. [29]

    Muon: An optimizer for hidden layers in neural networks

    Keller Jordan. Muon: An optimizer for hidden layers in neural networks. https:// kellerjordan.github.io/posts/muon/, 2024. Accessed: 2026-03-03

  30. [30]

    Kachaev, M

    Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, and Aleksandr I. Panov. Don’t blind your vla: Aligning visual representations for ood generalization, 2025. URL https://arxiv.org/abs/2510.25616

  31. [31]

    Clip-rt: Learning language-conditioned robotic policies from natural language supervision, 2025

    Gi-Cheon Kang, Junghyun Kim, Kyuhwan Shim, Jun Ki Lee, and Byoung-Tak Zhang. Clip-rt: Learning language-conditioned robotic policies from natural language supervision, 2025. URL https://arxiv.org/abs/2411.00508

  32. [32]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URLhttps://arxiv.org/abs/2502.19645

  33. [33]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

  34. [34]

    Scaling verification can be more effective than scaling policy learning for vision-language-action alignment

    Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, and Marco Pavone. Scaling verification can be more effective than scaling policy learning for vision-language-action alignment, 2026. URLhttps://arxiv.org/abs/2602.12281

  35. [35]

    Roboreward: General-purpose vision-language reward models for robotics, 2026

    Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Roboreward: General-purpose vision-language reward models for robotics, 2026. URL https://arxiv.org/abs/2601.00675

  36. [36]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking, 2026. URLhttps://arxiv.org/abs/2601.04720

  37. [37]

    VLAs are Confined yet Capable of Generalizing to Novel Instructions

    Quanyi Li. Task reconstruction and extrapolation for π0 using text latent, 2025. URL https: //arxiv.org/abs/2505.03500

  38. [38]

    Evaluating real-world robot manipulation policies in simulation,

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation,

  39. [39]

    URLhttps://arxiv.org/abs/2405.05941

  40. [40]

    EAGLE-2: Faster inference of language models with dynamic draft trees

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InEmpirical Methods in Natural Language Processing, 2024. 12

  41. [41]

    Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

    Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons, 2026. URL https: //arxiv.org/abs/...

  42. [42]

    Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation, 2025. URLhttps://arxiv.org/abs/2508.05635

  43. [43]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https: //arxiv.org/abs/2306.03310

  44. [44]

    ://arxiv.org/abs/2112.03227

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, 2022. URL https://arxiv.org/abs/2112.03227

  45. [45]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots, 2024. URLhttps://arxiv.org/abs/2406.02523

  46. [46]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, Y...

  47. [47]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  48. [48]

    Rlds: an ecosystem to generate, share and use datasets in reinforcement learning, 2021

    Sabela Ramos, Sertan Girgin, Léonard Hussenot, Damien Vincent, Hanna Yakubovich, Daniel Toyama, Anita Gergely, Piotr Stanczyk, Raphael Marinier, Jeremiah Harmsen, Olivier Pietquin, and Nikola Momchev. Rlds: an ecosystem to generate, share and use datasets in reinforcement learning, 2021

  49. [49]

    Springer, 2004

    Reuven Y Rubinstein and Dirk P Kroese.The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning, volume 133. Springer, 2004

  50. [50]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025. URLhttps://arxiv.org/abs/2506.01844

  51. [51]

    ReconVLA: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333, 2025

    Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision- language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333, 2025

  52. [52]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai, 2025

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Nagaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Maniskill3: Gpu parallelized robotics simulation and r...

  53. [53]

    Evaluating gemini robotics policies in a veo world simulator, 2026

    Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini robot...

  54. [54]

    Gigaworld-0: World models as data engine to empower embodied ai,

    GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. Gigaworld-0: World models as data engine to em...

  55. [55]

    What drives success in physical planning with joint-embedding predictive world models?, 2026 b

    Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, and Yann LeCun. What drives success in physical planning with joint-embedding predictive world models?, 2026. URL https://arxiv.org/abs/2512.24497

  56. [56]

    LIBERO-X: Robustness litmus for vision-language-action models

    Guodong Wang, Chenkai Zhang, Qingjie Liu, Jinjin Zhang, Jiancheng Cai, Junjie Liu, and Xinmin Liu. Libero-x: Robustness litmus for vision-language-action models, 2026. URL https://arxiv.org/abs/2602.06556

  57. [57]

    Ying Wang, Oumayma Bounou, Gaoyue Zhou, Randall Balestriero, Tim G. J. Rudner, Yann LeCun, and Mengye Ren. Temporal straightening for latent planning, 2026. URL https: //arxiv.org/abs/2603.12231

  58. [58]

    Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation.arXiv preprint arXiv:2508.06426,

    Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, and Jingkuan Song. Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation, 2025. URL https://arxiv.org/abs/2508.06426

  59. [59]

    Seeing to act, prompting to specify: A bayesian factorization of vision language action policy, 2025

    Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang, Yifei Yang, Haojian Lu, Rong Xiong, Masayoshi Tomizuka, and Yue Wang. Seeing to act, prompting to specify: A bayesian factorization of vision language action policy, 2025. URL https://arxiv.org/abs/2512. 11218

  60. [60]

    Instructvla: Vision-language-action instruction tuning from understanding to manipulation.arXiv preprint arXiv:2507.17520, 2025

    Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025. URLhttps://arxiv.org/abs/2507.17520

  61. [61]

    Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

    Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, and Zach Xu. Igniting vlms toward the embodied space, 2025. URLhttps://arxiv.org/abs/2509.11766

  62. [62]

    VLA-Arena: An open-source framework for benchmarking vision-language-action models.arXiv preprint arXiv:2512.22539, 2025

    Borong Zhang, Jiahao Li, Jiachen Shen, Yishuai Cai, Yuhao Zhang, Yuanpei Chen, Juntao Dai, Jiaming Ji, and Yaodong Yang. Vla-arena: An open-source framework for benchmarking vision-language-action models, 2025. URLhttps://arxiv.org/abs/2512.22539

  63. [63]

    How do vlas effectively inherit from vlms?, 2025

    Chuheng Zhang, Rushuai Yang, Xiaoyu Chen, Kaixin Wang, Li Zhao, Yi Chen, and Jiang Bian. How do vlas effectively inherit from vlms?, 2025. URL https://arxiv.org/abs/2511. 06619

  64. [64]

    Geoworld: Geometric world models,

    Zeyu Zhang, Danning Li, Ian Reid, and Richard Hartley. Geoworld: Geometric world models,

  65. [65]

    URLhttps://arxiv.org/abs/2602.23058

  66. [66]

    Vlmpc: Vision-language model predictive control for robotic manipulation, 2024

    Wentao Zhao, Jiaming Chen, Ziyu Meng, Donghui Mao, Ran Song, and Wei Zhang. Vlmpc: Vision-language model predictive control for robotic manipulation, 2024. URL https:// arxiv.org/abs/2407.09829

  67. [67]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URLhttps://arxiv.org/abs/2510.10274. 14

  68. [68]

    DINO-WM: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983,

    Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning, 2025. URL https://arxiv.org/abs/ 2411.04983

  69. [69]

    Ro- bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Ro- bodreamer: Learning compositional world models for robot imagination, 2024. URL https://arxiv.org/abs/2404.12377

  70. [70]

    LIBERO-PRO: Towards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

    Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair evaluation of vision-language- action models beyond memorization, 2025. URLhttps://arxiv.org/abs/2510.03827

  71. [71]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets, 2025. URLhttps://arxiv.org/abs/2504.02792. 15 Appendix 6.1 Implementation Details for Baselines We summarize the configurations of the evaluated VLAs in Table 2. We ...