pith. machine review for the scientific record. sign in

arxiv: 2507.23682 · v3 · submitted 2025-07-31 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:48 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords Vision-Language-ActionLatent actionsRobot manipulationZero-shot generalizationPre-trainingManipulation policiesEmbodiment transfer
0
0 comments X

The pith

villa-X improves latent action modeling in VLA models to enable zero-shot generation of action plans for unseen robot embodiments and open-vocabulary instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces villa-X as a Vision-Language-Latent-Action framework that refines both the learning of latent actions as abstract motion representations between frames and their integration into VLA pre-training. This change allows the model to produce latent action plans without examples for robot bodies it has never encountered and for instructions using novel words. A sympathetic reader would care because the method points toward robot policies that transfer across hardware without retraining from scratch. Experiments show the approach yields higher success rates on varied simulation benchmarks and on physical robots using both parallel grippers and dexterous hands.

Core claim

villa-X is a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies by improving both how latent actions are learned and how they are incorporated into VLA pre-training. This enables villa-X to generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding, resulting in superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving both gripper and dexterous hand manipulation.

What carries the argument

The villa-X ViLLA framework that refines latent action learning as abstract motion representations and integrates them into VLA pre-training to support zero-shot planning.

If this is right

  • villa-X produces superior performance across diverse simulation tasks in SIMPLER.
  • The model succeeds on two real-world robotic setups using both gripper and dexterous hand manipulation.
  • Latent action plans can be generated zero-shot for previously unseen robot embodiments.
  • The framework supports open-vocabulary symbolic understanding without task-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-action improvements could lower data requirements when adapting policies to entirely new robot hardware.
  • If the learned latent space captures general motion principles, the approach might scale to multi-step or multi-robot coordination tasks.
  • Testing the framework on additional sensor modalities such as tactile feedback would reveal whether the zero-shot benefit generalizes beyond vision and language.

Load-bearing premise

The specific changes to latent action learning and VLA pre-training are the direct cause of the observed zero-shot generalization and performance improvements.

What would settle it

Train an otherwise identical VLA model without the proposed latent action improvements and measure whether zero-shot success on new embodiments falls to the level of prior baselines on the same SIMPLER and real-robot tasks.

read the original abstract

Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of latent actions, abstract representations of motion between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. We demonstrate that villa-X can generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding. This capability enables villa-X to achieve superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving both gripper and dexterous hand manipulation. These results establish villa-X as a principled and scalable paradigm for learning generalizable robot manipulation policies. We believe it provides a strong foundation for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces villa-X, a Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling and its incorporation into VLA pre-training. It claims that villa-X enables zero-shot generation of latent action plans for unseen embodiments and open-vocabulary symbolic understanding, leading to superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving gripper and dexterous hand manipulation.

Significance. If the empirical claims hold after proper controls, villa-X would offer a meaningful step toward more generalizable robot manipulation policies by strengthening latent action representations within VLA models, potentially serving as a scalable foundation for future work in cross-embodiment robotics.

major comments (1)
  1. [Section 4] Section 4 (Experiments) and associated tables: the manuscript reports performance gains on SIMPLER and real-world tasks but provides no ablation studies that hold model scale, pre-training dataset, and optimization fixed while varying only the latent-action encoder, loss terms, or incorporation strategy. This omission leaves the central attribution of zero-shot generalization and gains to the proposed modeling changes untested and load-bearing for the abstract's claims.
minor comments (1)
  1. [Abstract] Abstract: quantitative improvements (e.g., success rates, baselines) are stated only qualitatively; adding specific numbers and comparison methods would improve clarity without altering the core contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment on Section 4 below and agree that additional controlled ablations will strengthen attribution of the reported gains.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Experiments) and associated tables: the manuscript reports performance gains on SIMPLER and real-world tasks but provides no ablation studies that hold model scale, pre-training dataset, and optimization fixed while varying only the latent-action encoder, loss terms, or incorporation strategy. This omission leaves the central attribution of zero-shot generalization and gains to the proposed modeling changes untested and load-bearing for the abstract's claims.

    Authors: We agree that isolating the contributions of the proposed latent-action encoder, loss terms, and incorporation strategy through controlled ablations is important for rigorously attributing the zero-shot generalization improvements. The current experiments compare villa-X against prior VLA baselines that differ in multiple dimensions, including scale and data. To address this directly, we will add new ablation studies to the revised Section 4. These will fix model scale, pre-training dataset, and optimization hyperparameters while varying only the latent-action components, thereby quantifying their specific impact on the reported performance and generalization results. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on experimental results

full rationale

The paper presents villa-X as an empirical ViLLA framework advancing latent action modeling within VLA pre-training. Claims of zero-shot latent action plan generation for unseen embodiments and performance gains on SIMPLER plus real-world tasks are supported by reported evaluations rather than any mathematical derivation chain. No equations, self-definitions, or fitted-input predictions appear in the abstract or description that would reduce outputs to inputs by construction. Any self-citations are incidental and non-load-bearing for the central empirical results, which remain independently testable via benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no new free parameters, axioms, or invented entities beyond standard components already present in VLA and latent-action literature; contributions are framed as framework-level improvements.

pith-pipeline@v0.9.0 · 5518 in / 1030 out tokens · 34624 ms · 2026-05-15T21:48:34.649407+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RotVLA: Rotational Latent Action for Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 7.0

    RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

  2. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  3. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  4. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  5. Learning Additively Compositional Latent Actions for Embodied AI

    cs.CV 2026-04 unverdicted novelty 7.0

    AC-LAM enforces additive composition on latent actions from visual transitions, yielding more structured and calibrated motion latents that improve downstream embodied policy learning over prior LAMs.

  6. UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

    cs.RO 2026-02 unverdicted novelty 7.0

    UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

  7. Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

  8. CUBic: Coordinated Unified Bimanual Perception and Control Framework

    cs.RO 2026-05 unverdicted novelty 6.0

    CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...

  9. Reinforcing VLAs in Task-Agnostic World Models

    cs.AI 2026-05 unverdicted novelty 6.0

    RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.

  10. Unified Noise Steering for Efficient Human-Guided VLA Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

  11. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  12. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  13. From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...

  14. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  15. UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

    cs.RO 2026-04 unverdicted novelty 6.0

    UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

  16. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  17. Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

    cs.RO 2026-04 unverdicted novelty 6.0

    Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

  18. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

  19. Towards Robotic Dexterous Hand Intelligence: A Survey

    cs.RO 2026-05 unverdicted novelty 4.0

    A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.

  20. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  21. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

  22. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 19 Pith papers · 21 internal anchors

  1. [1]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    AgiBot-World-Contributors, Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Huang, X., Jiang, S., Jiang, Y., Jing, C., Li, H., Li, J., Liu, C., Liu, Y., Lu, Y., Luo, J., Luo, P ., Mu, Y., Niu, Y., Pan, Y., Pang, J., Qiao, Y., Ren, G., Ruan, C., Shan, J., Shen, Y., Shi, C., Shi, M., Shi, M., Sima, C., Song, J., Wang, H., Wang, W., W...

  2. [2]

    Hydra: Hybrid robot actions for imitation learning.arxiv, 2023

    Belkhale, S., Cui, Y., and Sadigh, D. Hydra: Hybrid robot actions for imitation learning.arxiv, 2023

  3. [3]

    Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A., Houlsby, N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Boˇsnjak, M., Chen, X., Minderer, M., Voigtlaender, P ., Bica, I.,...

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U. π0: A vision-language- action flow model for general robot control.arXi...

  5. [5]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N. J., Julian, R. C., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.-H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta...

  6. [6]

    D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al

    Bruce, J., Dennis, M. D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  7. [7]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P ., and Li, H. Univla: Learning to act anywhere with task-centric latent actions, 2025. URLhttps://arxiv.org/abs/2505.06111

  8. [8]

    Y., Adebola, S., and Goldberg, K

    Chen, L. Y., Adebola, S., and Goldberg, K. Berkeley UR5 demonstration dataset. https://sites. google.com/view/berkeley-ur5/home

  9. [9]

    C., Zhao, L., and Bian, J

    Chen, X., Guo, J., He, T., Zhang, C., Zhang, P ., Yang, D. C., Zhao, L., and Bian, J. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

  10. [10]

    Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv: 2412.04445, 2024

    Chen, Y., Ge, Y., Li, Y., Ge, Y., Ding, M., Shan, Y., and Liu, X. Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv: 2412.04445, 2024

  11. [11]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, pp

    Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, pp. 02783649241273668, 2023

  12. [12]

    Collaboration, O. X.-E., O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., Tung, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Gupta, A., Wang, A., Kolobov, A., Singh, A., Garg, A., Kembhavi, A., Xie, A., Brohan, A., Raffin, A., Sharma, A., Yavary, A., Jain, A., Balak...

  13. [13]

    From play to policy: Conditional behavior generation from uncurated robot data

    Cui, Z. J., Wang, Y., Shafiullah, N. M. M., and Pinto, L. From play to policy: Conditional behavior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022

  14. [14]

    J., Pan, H., Iyer, A., Haldar, S., and Pinto, L

    Cui, Z. J., Pan, H., Iyer, A., Haldar, S., and Pinto, L. Dynamo: In-domain dynamics pretraining for visuo-motor control. In Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tom- czak, J. M., and Zhang, C. (eds.),Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024...

  15. [15]

    M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al

    Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

  16. [16]

    Dass, S., Yapeter, J., Zhang, J., Zhang, J., Pertsch, K., Nikolaidis, S., and Lim, J. J. CLVR jaco play dataset, 2023. URLhttps://github.com/clvrai/clvr_jaco_play_dataset

  17. [17]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Ebert, F., Yang, Y., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., Finn, C., and Levine, S. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396, 2021

  18. [18]

    Rh20t: A robotic dataset for learning diverse skills in one-shot

    Fang, H.-S., Fang, H., Tang, Z., Liu, J., Wang, J., Zhu, H., and Lu, C. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for Task and Motion Planning, 2023

  19. [19]

    The ”something something” video database for learning and evaluating visual common sense

    Goyal, R., Ebrahimi Kahou, S., Michalski, V ., Materzynska, J., Westphal, S., Kim, H., Haenel, V ., Fruend, I., Yianilos, P ., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. The ”something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Visio...

  20. [20]

    The "something something" video database for learning and evaluating visual common sense

    Goyal, R., Kahou, S. E., Michalski, V ., Materzy´nska, J., Westphal, S., Kim, H., Haenel, V ., Fruend, I., Yianilos, P ., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. The ”something something” video database for learning and evaluating visual common sense, 2017. URL https: //arxiv.org/abs/1706.04261

  21. [21]

    K., Ryan, F., Sharma, J., 11 villa-X: A Vision-Language-Latent-Action Model Wray, M., Xu, M., Xu, E

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagarajan, T., Radosavovic, I., Ramakrishnan, S. K., Ryan, F., Sharma, J., 11 villa-X: A Vision-Language-Latent-Action Model Wray, M., Xu, M., Xu, E. Z., Zhao, C., Bansal, S., Batra, D., Cartillier, V ., Crane, S., Do, T., Do...

  22. [22]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022

  23. [23]

    Deep Residual Learning for Image Recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385

  24. [24]

    Heo, M., Lee, Y., Lee, D., and Lim, J. J. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. InRobotics: Science and Systems, 2023

  25. [25]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Hu, Y., Guo, Y., Wang, P ., Chen, X., Wang, Y.-J., Zhang, J., Sreenath, K., Lu, C., and Chen, J. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  26. [26]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., and Finn, C. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learning, pp. 991–1002. PMLR, 2022

  27. [27]

    Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation

    Kalashnikov, D., Irpan, A., Pastor, P ., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrish- nan, M., Vanhoucke, V ., et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. InCoRL, pp. 651–673, 2018

  28. [28]

    K., Chen, L

    Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M. K., Chen, L. Y., Ellis, K., Fagan, P . D., Hejna, J., Itkina, M., Lepert, M., Ma, Y. J., Miller, P . T., Wu, J., Belkhale, S., Dass, S., Ha, H., Jain, A., Lee, A., Lee, Y., Memmel, M., Park, S., Radosavovic, I., Wang, K., Zhan, A., Black, K., Chi,...

  29. [29]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P ., et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  30. [30]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M. J., Finn, C., and Liang, P . Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  31. [31]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  32. [32]

    R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., and Xiao, T

    Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H. R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., and Xiao, T. Evaluating real-world robot manipulation policies in simulation. In Agrawal, P ., Kroemer, O., and Burgard, W. (eds.),Conference on Robot Learning, 6-9 November 2024, Munich, Germany, volum...

  33. [33]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H. R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., and Xiao, T. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

  34. [34]

    Towards generalist robot policies: What matters in building vision- language-action models

    Li, X., Li, P ., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Kong, T., Zhang, H., and Liu, H. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

  35. [35]

    Vision-language foundation models as effective robot imitators

    Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., Liu, H., Li, H., and Kong, T. Vision-language foundation models as effective robot imitators. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  36. [36]

    URLhttps://openreview.net/forum?id=lFYj0oibGR

  37. [37]

    Li, Y., Liu, M., and Rehg, J. M. In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the European conference on computer vision (ECCV), pp. 619–635, 2018

  38. [38]

    Egocentric prediction of action target in 3d

    Li, Y., Cao, Z., Liang, A., Liang, B., Chen, L., Zhao, H., and Feng, C. Egocentric prediction of action target in 3d. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022

  39. [39]

    Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025

    Liang, A., Czempin, P ., Hong, M., Zhou, Y., Biyik, E., and Tu, S. Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025

  40. [40]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., and Stone, P . Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

  41. [41]

    Robot learning on the job: Human-in-the-loop autonomy and learning during deployment

    Liu, H., Nasiriany, S., Zhang, L., Bao, Z., and Zhu, Y. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment. InRobotics: Science and Systems (RSS), 2023

  42. [42]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv: 2410.07864, 2024

  43. [43]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction

    Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., and Yi, L. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21013–21022, June 2022

  44. [44]

    Fmb: a functional manipulation benchmark for generalizable robotic learning

    Luo, J., Xu, C., Liu, F., Tan, L., Lin, Z., Wu, J., Abbeel, P ., and Levine, S. Fmb: a functional manipulation benchmark for generalizable robotic learning.arXiv preprint arXiv:2401.08553, 2024

  45. [45]

    Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

    Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., and Florence, P . Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

  46. [46]

    Grounding language with visual affordances over unstructured data

    Mees, O., Borja-Diaz, J., and Burgard, W. Grounding language with visual affordances over unstructured data. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023

  47. [47]

    Structured world models from human videos.CoRL, 2023

    Mendonca, R., Bahl, S., and Pathak, D. Structured world models from human videos.CoRL, 2023

  48. [48]

    Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023

    Mu, Y., Zhang, Q., Hu, M., Wang, W., Ding, M., Jin, J., Wang, B., Dai, J., Qiao, Y., and Luo, P . Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023

  49. [49]

    Learning and retrieval from prior data for skill-based imitation learning

    Nasiriany, S., Gao, T., Mandlekar, A., and Zhu, Y. Learning and retrieval from prior data for skill-based imitation learning. InConference on Robot Learning (CoRL), 2022

  50. [50]

    Latent action learning requires supervision in the presence of distractors, 2025

    Nikulin, A., Zisman, I., Tarasov, D., Lyubaykin, N., Polubarov, A., Kiselev, I., and Kurenkov, V . Latent action learning requires supervision in the presence of distractors, 2025. URL https: //arxiv.org/abs/2502.00379

  51. [52]

    NVIDIA, :, Bjorck, J., Casta ˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L. J., Fang, Y., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., Magne, L., Mandlekar, A., Narayan, A., Nasiriany, S., Reed, S., Tan, Y. L., Wang, G., Wang, Z., Wang, J., Wang, Q., Xiang, J., Xie...

  52. [53]

    Y., Sanketi, P ., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S

    Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., Kreiman, T., Tan, Y., Chen, L. Y., Sanketi, P ., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  53. [54]

    Modeling fine-grained hand-object dynamics for egocentric video representation learning, 2025

    Pei, B., Huang, Y., Xu, J., Chen, G., He, Y., Yang, L., Wang, Y., Xie, W., Qiao, Y., Wu, F., and Wang, L. Modeling fine-grained hand-object dynamics for egocentric video representation learning, 2025. URLhttps://arxiv.org/abs/2503.00986

  54. [55]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  55. [56]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., and Li, X. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URL https://arxiv.org/abs/2501.15830

  56. [57]

    Shared Control Templates for Assistive Robotics

    Quere, G., Hagengruber, A., Iskandar, M., Bustamante, S., Leidner, D., Stulp, F., and Vogel, J. Shared Control Templates for Assistive Robotics. In2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 7, Paris, France, 2020

  57. [58]

    Ren, A. Z. open-pi-zero: Re-implementation of π0 vision–language–action model, 2025. URL https://github.com/allenzren/open-pi-zero

  58. [59]

    Latent plans for task agnostic offline reinforcement learning

    Rosete-Beas, E., Mees, O., Kalweit, G., Boedecker, J., and Burgard, W. Latent plans for task agnostic offline reinforcement learning. InProceedings of the 6th Conference on Robot Learning (CoRL), 2022

  59. [60]

    and Jiang, M

    Schmidt, D. and Jiang, M. Learning to act without actions.arXiv preprint arXiv:2312.10812, 2023

  60. [61]

    Shafiullah, N. M. M., Rai, A., Etukuru, H., Liu, Y., Misra, I., Chintala, S., and Pinto, L. On bringing robots home, 2023

  61. [62]

    J., Du, M., Zheng, C., Zhao, T., Hansen-Estruch, P ., Vuong, Q., He, A., Myers, V ., Fang, K., Finn, C., and Levine, S

    Walke, H., Black, K., Lee, A., Kim, M. J., Du, M., Zheng, C., Zhao, T., Hansen-Estruch, P ., Vuong, Q., He, A., Myers, V ., Fang, K., Finn, C., and Levine, S. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

  62. [63]

    Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024

    Wang, J., Zhang, Q., Chao, Y.-W., Wen, B., Guo, X., and Xiang, Y. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024. URL https: //arxiv.org/abs/2406.06843

  63. [64]

    Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

    Wang, L., Chen, X., Zhao, J., and He, K. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Systems, volume 37, pp. 124420– 124450. Curran Associates, Inc., 2024. URL https://proceedi...

  64. [65]

    V ., Joshi, N., and Pollefeys, M

    Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F. V ., Joshi, N., and Pollefeys, M. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20270–20281, October 2023

  65. [66]

    Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv: 2001.02908, 2020

    Xu, M., Dai, W., Liu, C., Gao, X., Lin, W., Qi, G.-J., and Xiong, H. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv: 2001.02908, 2020

  66. [67]

    Como: Learning continuous latent motion from internet videos for scalable robot learning, 2025

    Yang, J., Shi, Y., Zhu, H., Liu, M., Ma, K., Wang, Y., Wu, G., He, T., and Wang, L. Como: Learning continuous latent motion from internet videos for scalable robot learning, 2025. URL https: //arxiv.org/abs/2505.17006. 14 villa-X: A Vision-Language-Latent-Action Model

  67. [68]

    Magma: A foundation model for multimodal ai agents

    Yang, J., Tan, R., Wu, Q., Zheng, R., Peng, B., Liang, Y., Gu, Y., Cai, M., Ye, S., Jang, J., et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 14203–14214, 2025

  68. [69]

    Latent Action Pretraining from Videos

    Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B. Y., Liden, L., Lee, K., Gao, J., Zettlemoyer, L., Fox, D., and Seo, M. Latent action pretraining from videos.arXiv preprint arXiv: 2410.11758, 2024

  69. [70]

    What do latent action models actually learn?, 2025

    Zhang, C., Pearce, T., Zhang, P ., Wang, K., Chen, X., Shen, W., Zhao, L., and Bian, J. What do latent action models actually learn?, 2025. URLhttps://arxiv.org/abs/2506.15691

  70. [71]

    J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., Handa, A., Liu, M.-Y., Xiang, D., Wetzstein, G., and Lin, T.-Y

    Zhao, Q., Lu, Y., Kim, M. J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., Handa, A., Liu, M.-Y., Xiang, D., Wetzstein, G., and Lin, T.-Y. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv: 2503.22020, 2025

  71. [72]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Zheng, R., Liang, Y., Huang, S., Gao, J., Daum´e III, H., Kolobov, A., Huang, F., and Yang, J. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024. 15 villa-X: A Vision-Language-Latent-Action Model A Additional Implementation Details for LAM In this appendix, we provide...