Recognition: 3 theorem links
· Lean Theoremvilla-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
Pith reviewed 2026-05-15 21:48 UTC · model grok-4.3
The pith
villa-X improves latent action modeling in VLA models to enable zero-shot generation of action plans for unseen robot embodiments and open-vocabulary instructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
villa-X is a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies by improving both how latent actions are learned and how they are incorporated into VLA pre-training. This enables villa-X to generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding, resulting in superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving both gripper and dexterous hand manipulation.
What carries the argument
The villa-X ViLLA framework that refines latent action learning as abstract motion representations and integrates them into VLA pre-training to support zero-shot planning.
If this is right
- villa-X produces superior performance across diverse simulation tasks in SIMPLER.
- The model succeeds on two real-world robotic setups using both gripper and dexterous hand manipulation.
- Latent action plans can be generated zero-shot for previously unseen robot embodiments.
- The framework supports open-vocabulary symbolic understanding without task-specific fine-tuning.
Where Pith is reading between the lines
- The same latent-action improvements could lower data requirements when adapting policies to entirely new robot hardware.
- If the learned latent space captures general motion principles, the approach might scale to multi-step or multi-robot coordination tasks.
- Testing the framework on additional sensor modalities such as tactile feedback would reveal whether the zero-shot benefit generalizes beyond vision and language.
Load-bearing premise
The specific changes to latent action learning and VLA pre-training are the direct cause of the observed zero-shot generalization and performance improvements.
What would settle it
Train an otherwise identical VLA model without the proposed latent action improvements and measure whether zero-shot success on new embodiments falls to the level of prior baselines on the same SIMPLER and real-robot tasks.
read the original abstract
Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of latent actions, abstract representations of motion between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. We demonstrate that villa-X can generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding. This capability enables villa-X to achieve superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving both gripper and dexterous hand manipulation. These results establish villa-X as a principled and scalable paradigm for learning generalizable robot manipulation policies. We believe it provides a strong foundation for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces villa-X, a Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling and its incorporation into VLA pre-training. It claims that villa-X enables zero-shot generation of latent action plans for unseen embodiments and open-vocabulary symbolic understanding, leading to superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving gripper and dexterous hand manipulation.
Significance. If the empirical claims hold after proper controls, villa-X would offer a meaningful step toward more generalizable robot manipulation policies by strengthening latent action representations within VLA models, potentially serving as a scalable foundation for future work in cross-embodiment robotics.
major comments (1)
- [Section 4] Section 4 (Experiments) and associated tables: the manuscript reports performance gains on SIMPLER and real-world tasks but provides no ablation studies that hold model scale, pre-training dataset, and optimization fixed while varying only the latent-action encoder, loss terms, or incorporation strategy. This omission leaves the central attribution of zero-shot generalization and gains to the proposed modeling changes untested and load-bearing for the abstract's claims.
minor comments (1)
- [Abstract] Abstract: quantitative improvements (e.g., success rates, baselines) are stated only qualitatively; adding specific numbers and comparison methods would improve clarity without altering the core contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment on Section 4 below and agree that additional controlled ablations will strengthen attribution of the reported gains.
read point-by-point responses
-
Referee: [Section 4] Section 4 (Experiments) and associated tables: the manuscript reports performance gains on SIMPLER and real-world tasks but provides no ablation studies that hold model scale, pre-training dataset, and optimization fixed while varying only the latent-action encoder, loss terms, or incorporation strategy. This omission leaves the central attribution of zero-shot generalization and gains to the proposed modeling changes untested and load-bearing for the abstract's claims.
Authors: We agree that isolating the contributions of the proposed latent-action encoder, loss terms, and incorporation strategy through controlled ablations is important for rigorously attributing the zero-shot generalization improvements. The current experiments compare villa-X against prior VLA baselines that differ in multiple dimensions, including scale and data. To address this directly, we will add new ablation studies to the revised Section 4. These will fix model scale, pre-training dataset, and optimization hyperparameters while varying only the latent-action components, thereby quantifying their specific impact on the reported performance and generalization results. revision: yes
Circularity Check
No circularity; empirical claims rest on experimental results
full rationale
The paper presents villa-X as an empirical ViLLA framework advancing latent action modeling within VLA pre-training. Claims of zero-shot latent action plan generation for unseen embodiments and performance gains on SIMPLER plus real-world tasks are supported by reported evaluations rather than any mathematical derivation chain. No equations, self-definitions, or fitted-input predictions appear in the abstract or description that would reduce outputs to inputs by construction. Any self-citations are incidental and non-load-bearing for the central empirical results, which remain independently testable via benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
Learning Additively Compositional Latent Actions for Embodied AI
AC-LAM enforces additive composition on latent actions from visual transitions, yielding more structured and calibrated motion latents that improve downstream embodied policy learning over prior LAMs.
-
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
-
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
-
CUBic: Coordinated Unified Bimanual Perception and Control Framework
CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...
-
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
-
Towards Robotic Dexterous Hand Intelligence: A Survey
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
-
RLDX-1 Technical Report
RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
Reference graph
Works this paper leans on
-
[1]
AgiBot-World-Contributors, Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Huang, X., Jiang, S., Jiang, Y., Jing, C., Li, H., Li, J., Liu, C., Liu, Y., Lu, Y., Luo, J., Luo, P ., Mu, Y., Niu, Y., Pan, Y., Pang, J., Qiao, Y., Ren, G., Ruan, C., Shan, J., Shen, Y., Shi, C., Shi, M., Shi, M., Sima, C., Song, J., Wang, H., Wang, W., W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Hydra: Hybrid robot actions for imitation learning.arxiv, 2023
Belkhale, S., Cui, Y., and Sadigh, D. Hydra: Hybrid robot actions for imitation learning.arxiv, 2023
work page 2023
-
[3]
Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A., Houlsby, N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Boˇsnjak, M., Chen, X., Minderer, M., Voigtlaender, P ., Bica, I.,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U. π0: A vision-language- action flow model for general robot control.arXi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N. J., Julian, R. C., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.-H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.06817 2022
-
[6]
Bruce, J., Dennis, M. D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[7]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P ., and Li, H. Univla: Learning to act anywhere with task-centric latent actions, 2025. URLhttps://arxiv.org/abs/2505.06111
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Y., Adebola, S., and Goldberg, K
Chen, L. Y., Adebola, S., and Goldberg, K. Berkeley UR5 demonstration dataset. https://sites. google.com/view/berkeley-ur5/home
-
[9]
Chen, X., Guo, J., He, T., Zhang, C., Zhang, P ., Yang, D. C., Zhao, L., and Bian, J. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024
-
[10]
Chen, Y., Ge, Y., Li, Y., Ge, Y., Ding, M., Shan, Y., and Liu, X. Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv: 2412.04445, 2024
-
[11]
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, pp. 02783649241273668, 2023
work page 2023
-
[12]
Collaboration, O. X.-E., O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., Tung, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Gupta, A., Wang, A., Kolobov, A., Singh, A., Garg, A., Kembhavi, A., Xie, A., Brohan, A., Raffin, A., Sharma, A., Yavary, A., Jain, A., Balak...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
From play to policy: Conditional behavior generation from uncurated robot data
Cui, Z. J., Wang, Y., Shafiullah, N. M. M., and Pinto, L. From play to policy: Conditional behavior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022
-
[14]
J., Pan, H., Iyer, A., Haldar, S., and Pinto, L
Cui, Z. J., Pan, H., Iyer, A., Haldar, S., and Pinto, L. Dynamo: In-domain dynamics pretraining for visuo-motor control. In Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tom- czak, J. M., and Zhang, C. (eds.),Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024...
work page 2024
-
[15]
M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020
work page 2020
-
[16]
Dass, S., Yapeter, J., Zhang, J., Zhang, J., Pertsch, K., Nikolaidis, S., and Lim, J. J. CLVR jaco play dataset, 2023. URLhttps://github.com/clvrai/clvr_jaco_play_dataset
work page 2023
-
[17]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Ebert, F., Yang, Y., Schmeckpeper, K., Bucher, B., Georgakis, G., Daniilidis, K., Finn, C., and Levine, S. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Rh20t: A robotic dataset for learning diverse skills in one-shot
Fang, H.-S., Fang, H., Tang, Z., Liu, J., Wang, J., Zhu, H., and Lu, C. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for Task and Motion Planning, 2023
work page 2023
-
[19]
The ”something something” video database for learning and evaluating visual common sense
Goyal, R., Ebrahimi Kahou, S., Michalski, V ., Materzynska, J., Westphal, S., Kim, H., Haenel, V ., Fruend, I., Yianilos, P ., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. The ”something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Visio...
work page 2017
-
[20]
The "something something" video database for learning and evaluating visual common sense
Goyal, R., Kahou, S. E., Michalski, V ., Materzy´nska, J., Westphal, S., Kim, H., Haenel, V ., Fruend, I., Yianilos, P ., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. The ”something something” video database for learning and evaluating visual common sense, 2017. URL https: //arxiv.org/abs/1706.04261
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
K., Ryan, F., Sharma, J., 11 villa-X: A Vision-Language-Latent-Action Model Wray, M., Xu, M., Xu, E
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagarajan, T., Radosavovic, I., Ramakrishnan, S. K., Ryan, F., Sharma, J., 11 villa-X: A Vision-Language-Latent-Action Model Wray, M., Xu, M., Xu, E. Z., Zhao, C., Bansal, S., Batra, D., Cartillier, V ., Crane, S., Do, T., Do...
work page 2022
-
[22]
Ego4d: Around the world in 3,000 hours of egocentric video
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022
work page 2022
-
[23]
Deep Residual Learning for Image Recognition
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[24]
Heo, M., Lee, Y., Lee, D., and Lim, J. J. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. InRobotics: Science and Systems, 2023
work page 2023
-
[25]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Hu, Y., Guo, Y., Wang, P ., Chen, X., Wang, Y.-J., Zhang, J., Sreenath, K., Lu, C., and Chen, J. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Bc-z: Zero-shot task generalization with robotic imitation learning
Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., and Finn, C. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learning, pp. 991–1002. PMLR, 2022
work page 2022
-
[27]
Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation
Kalashnikov, D., Irpan, A., Pastor, P ., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrish- nan, M., Vanhoucke, V ., et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. InCoRL, pp. 651–673, 2018
work page 2018
-
[28]
Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M. K., Chen, L. Y., Ellis, K., Fagan, P . D., Hejna, J., Itkina, M., Lepert, M., Ma, Y. J., Miller, P . T., Wu, J., Belkhale, S., Dass, S., Ha, H., Jain, A., Lee, A., Lee, Y., Memmel, M., Park, S., Radosavovic, I., Wang, K., Zhan, A., Black, K., Chi,...
work page 2024
-
[29]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P ., et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Kim, M. J., Finn, C., and Liang, P . Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H. R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., and Xiao, T. Evaluating real-world robot manipulation policies in simulation. In Agrawal, P ., Kroemer, O., and Burgard, W. (eds.),Conference on Robot Learning, 6-9 November 2024, Munich, Germany, volum...
work page 2024
-
[33]
Evaluating Real-World Robot Manipulation Policies in Simulation
Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H. R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., and Xiao, T. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Towards generalist robot policies: What matters in building vision- language-action models
Li, X., Li, P ., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Kong, T., Zhang, H., and Liu, H. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024
-
[35]
Vision-language foundation models as effective robot imitators
Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., Liu, H., Li, H., and Kong, T. Vision-language foundation models as effective robot imitators. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,
work page 2024
-
[36]
URLhttps://openreview.net/forum?id=lFYj0oibGR
-
[37]
Li, Y., Liu, M., and Rehg, J. M. In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the European conference on computer vision (ECCV), pp. 619–635, 2018
work page 2018
-
[38]
Egocentric prediction of action target in 3d
Li, Y., Cao, Z., Liang, A., Liang, B., Chen, L., Zhao, H., and Feng, C. Egocentric prediction of action target in 3d. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022
work page 2022
-
[39]
Liang, A., Czempin, P ., Hong, M., Zhou, Y., Biyik, E., and Tu, S. Clam: Continuous latent action models for robot learning from unlabeled demonstrations.arXiv preprint arXiv:2505.04999, 2025
-
[40]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., and Stone, P . Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Robot learning on the job: Human-in-the-loop autonomy and learning during deployment
Liu, H., Nasiriany, S., Zhang, L., Bao, Z., and Zhu, Y. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment. InRobotics: Science and Systems (RSS), 2023
work page 2023
-
[42]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv: 2410.07864, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction
Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., and Yi, L. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21013–21022, June 2022
work page 2022
-
[44]
Fmb: a functional manipulation benchmark for generalizable robotic learning
Luo, J., Xu, C., Liu, F., Tan, L., Lin, Z., Wu, J., Abbeel, P ., and Levine, S. Fmb: a functional manipulation benchmark for generalizable robotic learning.arXiv preprint arXiv:2401.08553, 2024
-
[45]
Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023
Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., and Florence, P . Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023
work page 2023
-
[46]
Grounding language with visual affordances over unstructured data
Mees, O., Borja-Diaz, J., and Burgard, W. Grounding language with visual affordances over unstructured data. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023
work page 2023
-
[47]
Structured world models from human videos.CoRL, 2023
Mendonca, R., Bahl, S., and Pathak, D. Structured world models from human videos.CoRL, 2023
work page 2023
-
[48]
Mu, Y., Zhang, Q., Hu, M., Wang, W., Ding, M., Jin, J., Wang, B., Dai, J., Qiao, Y., and Luo, P . Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023
work page 2023
-
[49]
Learning and retrieval from prior data for skill-based imitation learning
Nasiriany, S., Gao, T., Mandlekar, A., and Zhu, Y. Learning and retrieval from prior data for skill-based imitation learning. InConference on Robot Learning (CoRL), 2022
work page 2022
-
[50]
Latent action learning requires supervision in the presence of distractors, 2025
Nikulin, A., Zisman, I., Tarasov, D., Lyubaykin, N., Polubarov, A., Kiselev, I., and Kurenkov, V . Latent action learning requires supervision in the presence of distractors, 2025. URL https: //arxiv.org/abs/2502.00379
-
[52]
NVIDIA, :, Bjorck, J., Casta ˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L. J., Fang, Y., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., Magne, L., Mandlekar, A., Narayan, A., Nasiriany, S., Reed, S., Tan, Y. L., Wang, G., Wang, Z., Wang, J., Wang, Q., Xiang, J., Xie...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Y., Sanketi, P ., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S
Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., Kreiman, T., Tan, Y., Chen, L. Y., Sanketi, P ., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024
work page 2024
-
[54]
Modeling fine-grained hand-object dynamics for egocentric video representation learning, 2025
Pei, B., Huang, Y., Xu, J., Chen, G., He, Y., Yang, L., Wang, Y., Xie, W., Qiao, Y., Wu, F., and Wang, L. Modeling fine-grained hand-object dynamics for egocentric video representation learning, 2025. URLhttps://arxiv.org/abs/2503.00986
-
[55]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., and Li, X. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URL https://arxiv.org/abs/2501.15830
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Shared Control Templates for Assistive Robotics
Quere, G., Hagengruber, A., Iskandar, M., Bustamante, S., Leidner, D., Stulp, F., and Vogel, J. Shared Control Templates for Assistive Robotics. In2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 7, Paris, France, 2020
work page 2020
-
[58]
Ren, A. Z. open-pi-zero: Re-implementation of π0 vision–language–action model, 2025. URL https://github.com/allenzren/open-pi-zero
work page 2025
-
[59]
Latent plans for task agnostic offline reinforcement learning
Rosete-Beas, E., Mees, O., Kalweit, G., Boedecker, J., and Burgard, W. Latent plans for task agnostic offline reinforcement learning. InProceedings of the 6th Conference on Robot Learning (CoRL), 2022
work page 2022
-
[60]
Schmidt, D. and Jiang, M. Learning to act without actions.arXiv preprint arXiv:2312.10812, 2023
-
[61]
Shafiullah, N. M. M., Rai, A., Etukuru, H., Liu, Y., Misra, I., Chintala, S., and Pinto, L. On bringing robots home, 2023
work page 2023
-
[62]
Walke, H., Black, K., Lee, A., Kim, M. J., Du, M., Zheng, C., Zhao, T., Hansen-Estruch, P ., Vuong, Q., He, A., Myers, V ., Fang, K., Finn, C., and Levine, S. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[63]
Wang, J., Zhang, Q., Chao, Y.-W., Wen, B., Guo, X., and Xiang, Y. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction, 2024. URL https: //arxiv.org/abs/2406.06843
-
[64]
Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers
Wang, L., Chen, X., Zhao, J., and He, K. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Systems, volume 37, pp. 124420– 124450. Curran Associates, Inc., 2024. URL https://proceedi...
work page 2024
-
[65]
V ., Joshi, N., and Pollefeys, M
Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., Bohus, D., Feniello, A., Tekin, B., Frujeri, F. V ., Joshi, N., and Pollefeys, M. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20270–20281, October 2023
work page 2023
-
[66]
Xu, M., Dai, W., Liu, C., Gao, X., Lin, W., Qi, G.-J., and Xiong, H. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv: 2001.02908, 2020
-
[67]
Como: Learning continuous latent motion from internet videos for scalable robot learning, 2025
Yang, J., Shi, Y., Zhu, H., Liu, M., Ma, K., Wang, Y., Wu, G., He, T., and Wang, L. Como: Learning continuous latent motion from internet videos for scalable robot learning, 2025. URL https: //arxiv.org/abs/2505.17006. 14 villa-X: A Vision-Language-Latent-Action Model
-
[68]
Magma: A foundation model for multimodal ai agents
Yang, J., Tan, R., Wu, Q., Zheng, R., Peng, B., Liang, Y., Gu, Y., Cai, M., Ye, S., Jang, J., et al. Magma: A foundation model for multimodal ai agents. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 14203–14214, 2025
work page 2025
-
[69]
Latent Action Pretraining from Videos
Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B. Y., Liden, L., Lee, K., Gao, J., Zettlemoyer, L., Fox, D., and Seo, M. Latent action pretraining from videos.arXiv preprint arXiv: 2410.11758, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
What do latent action models actually learn?, 2025
Zhang, C., Pearce, T., Zhang, P ., Wang, K., Chen, X., Shen, W., Zhao, L., and Bian, J. What do latent action models actually learn?, 2025. URLhttps://arxiv.org/abs/2506.15691
-
[71]
Zhao, Q., Lu, Y., Kim, M. J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., Handa, A., Liu, M.-Y., Xiang, D., Wetzstein, G., and Lin, T.-Y. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv: 2503.22020, 2025
-
[72]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Zheng, R., Liang, Y., Huang, S., Gao, J., Daum´e III, H., Kolobov, A., Huang, F., and Yang, J. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024. 15 villa-X: A Vision-Language-Latent-Action Model A Additional Implementation Details for LAM In this appendix, we provide...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.