pith. sign in

arxiv: 2606.12403 · v1 · pith:V67IAPT5new · submitted 2026-06-10 · 💻 cs.RO

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Pith reviewed 2026-06-27 09:38 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-actionworld modelsrobot manipulationzero-shot generalizationpolicy steeringscene dynamicsout-of-distribution robustness
0
0 comments X

The pith

World Pilot augments vision-language-action models with world-action priors via latent and action steering to reach 84.7 percent success on zero-shot out-of-distribution manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models draw semantic grounding from static image-text pretraining yet struggle with the continuous dynamics of contact-rich manipulation. World Pilot injects two complementary priors from a World-Action Model: Latent Steering conditions the perception layer on a scene-evolution latent while Action Steering supplies an anticipated trajectory as a motion prior. These additions equip the policy with an expected scene view and trajectory-level hint alongside its semantic conditioning. The scene-evolution prior remains effective even when drawn from a video-pretrained world model that received no action post-training. The resulting framework records an 84.7 percent total success rate on the LIBERO-Plus zero-shot benchmark and leads every real-robot setting across four tasks, with the biggest gains under shifts in viewpoint, geometry, deformable state, and pose.

Core claim

World Pilot augments a VLA policy by routing priors from a World-Action Model through Latent Steering, which conditions perception on a scene-evolution latent, and Action Steering, which supplies an anticipated trajectory to the action generator, thereby furnishing an anticipated scene view and motion hint that improve handling of continuous manipulation dynamics.

What carries the argument

Latent Steering and Action Steering pathways that route a scene-evolution latent and an anticipated trajectory from the World-Action Model into the VLA decision chain.

If this is right

  • The scene-evolution prior remains effective when supplied by a video-pretrained world model that has not been action-post-trained.
  • World Pilot records the highest success rate on every real-robot setting across four manipulation tasks.
  • Performance margins are largest under shifts in viewpoint, geometry, deformable state, and pose.
  • Total success rate reaches 84.7 percent on the LIBERO-Plus zero-shot OOD benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same steering approach could be tested on non-VLA policy architectures that also lack explicit dynamics modeling.
  • Video-only world models may prove sufficient for many robotics domains where full action-labeled data are scarce.
  • Combining priors from multiple world models pretrained on different data distributions could further reduce sensitivity to specific shift types.

Load-bearing premise

The world model supplies scene-evolution and trajectory predictions that remain useful for the target manipulation environments even without action-specific fine-tuning.

What would settle it

Replace the world-model priors with a mismatched or random predictor and measure whether success rates on LIBERO-Plus and the real-robot tasks fall back to the unaugmented VLA baseline.

Figures

Figures reproduced from arXiv: 2606.12403 by Junjia Xu, Lue Fan, Rongxu Cui, Wenling Li, Xiaojuan Jin, Zefu Lin, Zhaoxiang Zhang.

Figure 1
Figure 1. Figure 1: World Pilot steers a VLA with priors from a World-Action Model. VLA methods generate actions from a VLM’s encoding of the scene. World Pilot adds two priors from a WAM into the decision chain, with Latent Steering routing a scene-evolution latent into VLM hidden states and Action Steering feeding a trajectory-level motion prior to the action generator. This gives the VLA an anticipated view of the scene an… view at source ↗
Figure 2
Figure 2. Figure 2: World Pilot architecture. A semantic pathway encodes images and language with a VLM into hidden states. Two prior pathways from a World-Action Model enter the same decision chain, with Latent Steering routing a scene-evolution latent into the VLM hidden states and Action Steering compressing the anticipated trajectory into a prior token for the flow-matching action generator. 2 Related Work Vision-Language… view at source ↗
Figure 3
Figure 3. Figure 3: Real-robot evaluation setup and task scenes. The robot platform (left), in-distribution scenes matching the training conditions (middle), and out-of-distribution scenes (right) under changes in appearance, geometry, deformable state, or pose. on Camera, Light, Background, and Noise while placing close behind the strongest baselines on Language, Robot, and Layout. On the appearance axes (Light, Background, … view at source ↗
read the original abstract

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces World Pilot, a VLA framework that augments base Vision-Language-Action models with priors from a World-Action Model (WAM) via two pathways: Latent Steering (conditioning perception on a scene-evolution latent) and Action Steering (supplying an anticipated trajectory as motion prior). It claims that the scene-evolution prior remains effective even from a video-pretrained WAM without action post-training, and reports SOTA total success rate of 84.7% on LIBERO-Plus zero-shot OOD benchmark plus highest success rates on all real-robot settings across four manipulation tasks, with largest margins under viewpoint, geometry, deformable state, and pose shifts.

Significance. If the empirical claims hold with proper verification, the work would be significant for addressing the static-image limitation of VLA pretraining by incorporating dynamics priors from world models. The explicit demonstration that video-pretrained (action-untrained) WAMs can supply effective scene-evolution latents would be a notable strength, as it lowers the barrier to using such priors and could improve OOD generalization in contact-rich manipulation without requiring full action-post-training of the world model.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained' is load-bearing, because the two steering pathways are presented as the only proposed additions to the base VLA. Without an isolated ablation (e.g., comparing Latent Steering performance when the WAM is video-pretrained only versus action-post-trained), the attribution of the 84.7% LIBERO-Plus rate and real-robot margins under viewpoint/geometry/deformable/pose shifts to this specific mechanism cannot be confirmed from the provided text.
  2. [Abstract] Abstract: The SOTA and 'highest success rate on every real-robot setting' claims rest on the effectiveness of the Latent Steering pathway using the scene-evolution latent. The manuscript provides no visible methods details, training regime confirmation for the WAM, or results tables with error bars/statistical tests to support that the prior works under the stated video-only condition, making the performance margins difficult to evaluate as evidence for the proposed architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thoughtful review and recognition of the potential impact of incorporating dynamics priors into VLA models. We address each major comment below and commit to revising the manuscript accordingly to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained' is load-bearing, because the two steering pathways are presented as the only proposed additions to the base VLA. Without an isolated ablation (e.g., comparing Latent Steering performance when the WAM is video-pretrained only versus action-post-trained), the attribution of the 84.7% LIBERO-Plus rate and real-robot margins under viewpoint/geometry/deformable/pose shifts to this specific mechanism cannot be confirmed from the provided text.

    Authors: We agree that an isolated ablation would provide clearer evidence for the contribution of the video-pretrained WAM in the Latent Steering pathway. The current results demonstrate the overall performance of World Pilot using the video-pretrained WAM as described. In the revised manuscript, we will include a dedicated ablation study isolating the effect of action post-training on the WAM for Latent Steering, reporting success rates on LIBERO-Plus under both conditions to support the claim. revision: yes

  2. Referee: [Abstract] Abstract: The SOTA and 'highest success rate on every real-robot setting' claims rest on the effectiveness of the Latent Steering pathway using the scene-evolution latent. The manuscript provides no visible methods details, training regime confirmation for the WAM, or results tables with error bars/statistical tests to support that the prior works under the stated video-only condition, making the performance margins difficult to evaluate as evidence for the proposed architecture.

    Authors: We acknowledge the need for more detailed presentation of the methods and results. The full manuscript includes a methods section describing the WAM, but we will expand it in the revision to explicitly confirm the training regime (video pretraining without action post-training) and provide additional implementation details. Furthermore, we will update the results tables to include error bars across multiple runs and appropriate statistical tests to facilitate evaluation of the performance margins. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical framework with benchmark results

full rationale

The paper introduces World Pilot as an empirical augmentation of VLA models via two steering pathways from a WAM (Latent Steering and Action Steering), reporting success rates on LIBERO-Plus and real-robot tasks. No equations, derivations, fitted parameters, or first-principles predictions appear in the provided text. The central claim (effectiveness of scene-evolution prior from video-pretrained WAM) is presented as an empirical observation rather than a mathematical reduction to inputs. No self-citation chains, ansatzes, or renamings are invoked as load-bearing steps. The derivation chain is therefore self-contained as experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only; ledger limited to explicitly stated premises. No free parameters or invented physical entities are described. The two steering pathways and the WAM are presented as the novel contributions rather than external entities.

axioms (2)
  • domain assumption VLA models inherit semantic grounding from large-scale pretraining on static image-text pairs
    Stated as the starting point whose limitation motivates the work.
  • domain assumption Manipulation is a continuous, contact-rich process whose dynamics static pretraining cannot capture
    Core premise used to justify adding world-action priors.
invented entities (2)
  • Latent Steering pathway no independent evidence
    purpose: Condition the perception layer on a scene-evolution latent from the WAM
    Introduced as one of the two complementary pathways; no independent evidence supplied in abstract.
  • Action Steering pathway no independent evidence
    purpose: Supply an anticipated trajectory as a motion prior to the action generator
    Introduced as the second pathway; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5769 in / 1516 out tokens · 18319 ms · 2026-06-27T09:38:11.612618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    X. Chen, Y . Chen, Y . Fu, N. Gao, J. Jia, W. Jin, H. Li, Y . Mu, J. Pang, Y . Qiao, Y . Tian, B. Wang, B. Wang, F. Wang, H. Wang, T. Wang, Z. Wang, X. Wei, C. Wu, S. Yang, J. Ye, J. Yu, J. Zeng, J. Zhang, J. Zhang, S. Zhang, F. Zheng, B. Zhou, and Y . Zhu. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy, 2025...

  2. [2]

    H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision-language-action generative world model, 2024. URLhttps://arxiv.org/abs/ 2403.09631

  3. [3]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation, 2024. URLhttps://arxiv.org/abs/2410. 07864

  4. [4]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  5. [5]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  6. [6]

    Y . Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y . Chen, D. Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

  7. [7]

    Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  8. [8]

    S. Gu, Y . Cai, T. Wang, S. Wu, and Y . Fu. Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717, 2026. URLhttps: //arxiv.org/abs/2602.10717

  9. [9]

    Wang et al

    J. Wang et al. MVISTA-4D: View-consistent 4d world model with test-time action inference for robotic manipulation.arXiv preprint arXiv:2602.09878, 2026. URLhttps://arxiv. org/abs/2602.09878

  10. [10]

    L. Fan, Z. Xu, C. Cao, W. Zhang, M. Yuan, and J. Chen. Aim: Intent-aware unified world action modeling with spatial value maps, 2026. URLhttps://arxiv.org/abs/2604.11135

  11. [11]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016. 10

  12. [12]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.pi 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  13. [13]

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  14. [14]

    Zhang, M

    R. Zhang, M. Dong, Y . Zhang, L. Heng, X. Chi, G. Dai, L. Du, D. Wang, Y . Du, and S. Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for ef- ficient robot manipulation.arXiv preprint arXiv:2503.20384, 2025

  15. [15]

    Zhang, Y

    J. Zhang, Y . Guo, Y . Hu, X. Chen, X. Zhu, and J. Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

  16. [16]

    L. Fan, K. Chen, Z. Xu, M. Yuan, P. Huang, and W. Huang. Language reasoning in vision- language-action model for robotic grasping. In2024 China Automation Congress (CAC), pages 6656–6661. IEEE, 2024

  17. [17]

    J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters, 2025

  18. [18]

    C. Li, J. Wen, Y . Peng, Y . Peng, F. Feng, and Y . Zhu. Pointvla: Injecting the 3d world into vision-language-action models.arXiv preprint arXiv:2503.07511, 2025

  19. [19]

    J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu, et al. Hy- bridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

  20. [20]

    J. Guo, Q. Li, P. Li, Z. Chen, N. Sun, Y . Su, H. Wang, Y . Zhang, X. Li, and H. Liu. Unified 4d world action modeling from video priors with asynchronous denoising, 2026. URLhttps: //arxiv.org/abs/2604.26694

  21. [21]

    Zheng, J

    R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, A. Narayan, Y . L. Tan, G. Wang, Q. Wang, J. Xiang, Y . Xu, S. Ye, J. Kautz, F. Huang, Y . Zhu, and L. Fan. Flare: Robot learning with implicit world modeling, 2025. URLhttps://arxiv. org/abs/2505.15659

  22. [22]

    L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y . Shen, and Y . Xu. Causal world modeling for robot control, 2026. URLhttps://arxiv.org/abs/ 2601.21998

  23. [23]

    M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

  24. [24]

    J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava. mimic-video: Video- action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

  25. [25]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xi- ang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang. World action m...

  26. [26]

    A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, H. Li, J. Li, J. Lv, J. Liu, M. Cao, P. Li, Q. Deng, W. Mei, X. Wang, X. Chen, X. Zhou, Y . Wang, Y . Chang, Y . Li, Y . Zhou, Y . Ye, Z. Liu, and Z. Zhu. Gigaworld-policy: An efficient action-centered world–action model, 2026. URLhttps://arxiv.org/abs/2603.17240

  27. [27]

    Ha and J

    D. Ha and J. Schmidhuber. World models. 2018. doi:10.5281/ZENODO.1207631. URL https://zenodo.org/record/1207631

  28. [28]

    Assran, A

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-super...

  29. [29]

    C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, G. Shi, and H. Fan. Emerging properties in unified multimodal pretraining, 2025. URLhttps: //arxiv.org/abs/2505.14683

  30. [30]

    H. Zhen, Q. Sun, H. Zhang, J. Li, S. Zhou, Y . Du, and C. Gan. Tesseract: Learning 4d embodied world models, 2025. URLhttps://arxiv.org/abs/2504.20995

  31. [31]

    Bruce, M

    J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y . Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt¨aschel. Genie: Generative interactive environments, 2024. URLhttps://...

  32. [32]

    S. Zhou, Y . Du, J. Chen, Y . Li, D.-Y . Yeung, and C. Gan. Robodreamer: Learning compo- sitional world models for robot imagination, 2024. URLhttps://arxiv.org/abs/2404. 12377

  33. [33]

    J. Yang, K. Lin, J. Li, W. Zhang, T. Lin, L. Wu, Z. Su, H. Zhao, Y .-Q. Zhang, L. Chen, P. Luo, X. Yue, and H. Li. Rise: Self-improving robot policy with compositional world model, 2026. URLhttps://arxiv.org/abs/2602.11075

  34. [34]

    Cosmos-predict2: World simulation model for physical ai, 2025

    NVIDIA. Cosmos-predict2: World simulation model for physical ai, 2025. URLhttps: //github.com/nvidia-cosmos/cosmos-predict2

  35. [35]

    R. Li, H. Zhang, J. Jin, Q. Zeng, Z. Zhuang, Y . Tang, S. Lyu, and D. Wang. World-value-action model: Implicit planning for vision-language-action systems, 2026. URLhttps://arxiv. org/abs/2604.14732

  36. [36]

    Quevedo, A

    J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation, 2025. URLhttps://arxiv.org/abs/2506. 00613

  37. [37]

    S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions, 2025. URLhttps://arxiv.org/abs/2503.18938

  38. [38]

    Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  39. [39]

    M. Liu, J. Shu, H. Chen, Z. Li, C. Zhao, J. Yang, S. Gao, H. Chen, and C. Shen. Stamo: Unsupervised learning of generalizable robot motion from compact state representation, 2025

  40. [40]

    Zhang, T

    C. Zhang, T. Pearce, P. Zhang, K. Wang, X. Chen, W. Shen, L. Zhao, and J. Bian. What do latent action models actually learn?, 2025. 12

  41. [41]

    Y . Su, S. Chen, H. Shi, M. Liu, Z. Zhang, N. Huang, W. Zhong, Z. Zhu, Y . Liu, and X. Liu. World guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010, 2026

  42. [42]

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

  43. [43]

    Nasiriany, A

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024

  44. [44]

    M. Zhu, Y . Zhu, J. Li, Z. Zhou, J. Wen, X. Liu, C. Shen, Y . Peng, and F. Feng. Ob- jectvla: End-to-end open-world object manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

  45. [45]

    Dey, J.-N

    S. Dey, J.-N. Zaech, N. Nikolov, L. Van Gool, and D. P. Paudel. Revla: Reverting visual domain limitation of robotic foundation models.arXiv preprint arXiv:2409.15250, 2024

  46. [46]

    Zhang, Y

    B. Zhang, Y . Zhang, J. Ji, Y . Lei, J. Dai, Y . Chen, and Y . Yang. Safevla: Towards safety alignment of vision-language-action model via safe reinforcement learning.arXiv preprint arXiv:2503.03480, 2025

  47. [47]

    Y . Fan, P. Ding, S. Bai, X. Tong, Y . Zhu, H. Lu, F. Dai, W. Zhao, Y . Liu, S. Huang, et al. Long-vla: Unleashing long-horizon capability of vision language action model for robot ma- nipulation.arXiv preprint arXiv:2508.19958, 2025

  48. [48]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Ju- lian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, ...

  49. [49]

    H. Li, P. Ding, R. Suo, Y . Wang, Z. Ge, D. Zang, K. Yu, M. Sun, H. Zhang, D. Wang, and W. Su. Vla-rft: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025. URLhttps://arxiv.org/abs/2510.00406

  50. [50]

    Y . Shen, F. Wei, Z. Du, Y . Liang, Y . Lu, J. Yang, N. Zheng, and B. Guo. Videovla: Video generators can be generalizable robot manipulators, 2025. URLhttps://arxiv.org/abs/ 2512.06963

  51. [51]

    Cheang, G

    C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, H. Zhang, and M. Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024. URLhttps://arxiv.org/abs/2410.06158

  52. [52]

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025

  53. [53]

    X. Liu, Z. Bai, H. Ci, K. Y . Ma, and M. Z. Shou. World-vla-loop: Closed-loop learning of video world model and vla policy, 2026. URLhttps://arxiv.org/abs/2602.06508

  54. [54]

    J. Xiao, Y . Yang, X. Chang, R. Chen, F. Xiong, M. Xu, W.-S. Zheng, and Q. Zhang. World-env: Leveraging world model as a virtual environment for vla post-training, 2026. URLhttps: //arxiv.org/abs/2509.24948. 13

  55. [55]

    Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator, 2025. URLhttps://arxiv.org/abs/2505.19017

  56. [56]

    J. Won, K. Lee, H. Jang, D. Kim, and J. Shin. Dual-stream diffusion for world-model aug- mented vision-language-action model, 2025. URLhttps://arxiv.org/abs/2510.27607

  57. [57]

    T. Ma, J. Zheng, Z. Wang, C. Jiang, A. Cui, J. Liang, and S. Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control, 2026. URLhttps://arxiv. org/abs/2603.10448

  58. [58]

    H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu. Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030, 2025

  59. [59]

    Zhang, H

    W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, F. Lu, H. Wang, Z. Zhang, L. Yi, W. Zeng, and X. Jin. Dreamvla: A vision-language-action model dreamed with compre- hensive world knowledge.arXiv preprint arXiv:2507.04447, 2025

  60. [60]

    Physical Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, et al. π0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

  61. [61]

    Q. Long, Y . Wang, J. Song, J. Zhang, P. Li, W. Wang, Y . Wang, H. Li, S. Xie, G. Yao, et al. Scaling world model for hierarchical manipulation policies.arXiv preprint arXiv:2602.10983, 2026

  62. [62]

    S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model, 2025. URLhttps: //arxiv.org/abs/2503.00200

  63. [63]

    X. Xu, H. Li, J. Ye, Y . Chen, J. Zeng, X. Chen, L. Xu, D. Lin, W. Li, and J. Pang. Futurevla: Joint visuomotor prediction for vision-language-action model, 2026. URLhttps://arxiv. org/abs/2603.10712

  64. [64]

    S. Miao, N. Feng, J. Wu, Y . Lin, X. He, D. Li, and M. Long. Jepa-vla: Video predictive embedding is needed for vla models, 2026. URLhttps://arxiv.org/abs/2602.11832

  65. [65]

    Q. Lv, W. Kong, H. Li, J. Zeng, Z. Qiu, D. Qu, H. Song, Q. Chen, X. Deng, and J. Pang. F1: A vision-language-action model bridging understanding and generation to actions.CoRR, abs/2509.06951, 2025. doi:10.48550/ARXIV .2509.06951. URLhttps://doi.org/10. 48550/arXiv.2509.06951

  66. [66]

    Being-h0.7: A latent world-action model from egocentric videos

    BeingBeyond Team. Being-h0.7: A latent world-action model from egocentric videos. Techni- cal report / project page, 2026. URLhttps://research.beingbeyond.com/being-h07

  67. [67]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  68. [68]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

  69. [69]

    M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  70. [70]

    J. Cen, S. Huang, Y . Yuan, K. Li, H. Yuan, C. Yu, Y . Jiang, J. Guo, X. Li, H. Luo, F. Wang, F. Wang, and D. Zhao. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025. 14

  71. [71]

    NVIDIA, A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, P. Chattopadhyay, M. Chen, Y . Chen, Y . Chen, S. Cheng, Y . Cui, J. Diamond, Y . Ding, J. Fan, L. Fan, L. Feng, F. Ferroni, S. Fidler, X. Fu, R. Gao, Y . Ge, J. Gu, A. Gupta, S. Gururani, I. El Hanafi, A. Hassani, Z. Hao, J. Huffman, J. Jang, P. Jannaty...

  72. [72]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 15