DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

Chong Ma; Hanyang Wang; Jianjun Zhang; Jian Zhu; Munan Yin; Taiyi Su; Tianjian Wang; Tianjiao Zhang; Weihao Ding; Yi Xu

arxiv: 2605.31286 · v2 · pith:KK7YXV6Jnew · submitted 2026-05-29 · 💻 cs.RO · cs.AI

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

Taiyi Su , Jian Zhu , Tianjian Wang , Youzhang He , Zitai Huang , Jianjun Zhang , Chong Ma , Hanyang Wang

show 4 more authors

Tianjiao Zhang Munan Yin Weihao Ding Yi Xu

This is my paper

Pith reviewed 2026-06-28 21:56 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords vision-language-actiondeformable manipulationrobot foldingfoundation modelDAggerflow matchingdual-arm robot

0 comments

The pith

DeMaVLA shows that one VLA model can acquire generalizable folding skills for varied household objects by mixing demonstrations with corrective trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replace category-specific policies with a single foundation model that handles deformable manipulation such as folding across different clothing types, geometries, materials, and home scenes. Existing VLA systems either train separate models per object category or suffer interference when tasks are mixed, limiting real-world household use. DeMaVLA pre-trains a VLM backbone plus action expert on thousands of hours of dual-arm data, then post-trains on aggregated multi-task folding data collected through human-in-the-loop corrections of real-robot failures. This combination of efficient action generation and corrective data aggregation is presented as the route to reusable skills that work from random initial states without per-task retraining. If the approach holds, robots could perform a wider range of folding jobs in unstructured homes using one policy.

Core claim

DeMaVLA adopts a VLM backbone with an action expert that uses flow matching for continuous actions; the expert is built by pruning every other transformer layer while keeping alignment with the backbone. The model is first pre-trained on approximately 5,000 hours of real-world dual-arm demonstrations to learn general manipulation priors, then post-trained on mixed folding data that combines self-collected demonstrations and corrective trajectories gathered from multiple folding tasks via a human-in-the-loop DAgger pipeline. On this basis the model reaches competitive performance on RoboTwin 2.0 and strong results on a household folding benchmark involving diverse items and scenes.

What carries the argument

Pruned action expert aligned with the VLM backbone that generates continuous actions via flow matching, trained with a DAgger pipeline that aggregates demonstrations and corrective trajectories across multiple folding tasks.

If this is right

A single policy replaces separate category-specific folding policies.
Layer pruning reduces training and inference cost while preserving performance.
Corrective trajectories from real failures improve robustness across varied initial states and materials.
Pre-training on broad dual-arm data supplies reusable priors that transfer to deformable tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aggregation of corrective data could be applied to other multi-task robotic domains such as unfolding or packing.
Efficiency from the pruned expert might allow the model to run on lower-power home robots.
Further scaling of the pre-training corpus could extend the same generalizability pattern to non-folding manipulation skills.

Load-bearing premise

Aggregating demonstrations and corrective trajectories from multiple folding tasks will avoid task interference or overfitting to the collected failure modes and scenes.

What would settle it

If the model is evaluated on a new clothing category or household scene outside the collected data and its success rate drops to the level of naive mixed training without DAgger corrections, the generalizability claim would be refuted.

read the original abstract

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeMaVLA gives a specific recipe for mixing real folding data with DAgger to move past category-specific policies, but the abstract leaves the interference fix untested.

read the letter

The paper's core idea is a VLA model for generalizable deformable manipulation that avoids category-specific policies by using mixed data and corrective DAgger training. It claims competitive results on benchmarks and strong real-world performance, but the abstract supplies no numbers or tests to back that up.

DeMaVLA takes a VLM backbone, adds a flow-matching action expert, and prunes every other transformer layer in the expert to cut costs while maintaining alignment. The training starts with pre-training on about 5,000 hours of real dual-arm demonstrations, then moves to post-training on aggregated folding data from self-collected demos and human-corrected failures across multiple tasks.

This setup is new in its particular combination of pruning and the human-in-the-loop DAgger for folding tasks. The paper does well in spotting how naive multi-task mixing leads to interference and in trying to counter it with corrective trajectories.

The main soft spot is the missing evidence for whether the DAgger actually solves the interference without introducing new problems like overfitting to specific failure modes. The stress-test note points this out correctly: no description of mitigations like task conditioning or loss balancing, and no ablations or metrics on interference. Without those, the jump to a generalizable model is an assumption rather than a demonstrated result. The soundness score from the reader matches what the abstract shows.

This paper would interest people building foundation models for household robotics, especially deformable manipulation. A reader working on VLA training pipelines could pick up the pruning trick and the DAgger approach as ideas to try.

The work shows clear thinking on the practical challenges. I recommend sending it for peer review so the full experiments can be examined.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DeMaVLA, a VLA foundation model for generalizable deformable manipulation. It uses a VLM backbone with a pruned action expert and flow matching for action generation, pre-trains on ~5,000 hours of real-world dual-arm data, and post-trains on mixed folding demonstrations plus corrective DAgger trajectories collected via human-in-the-loop across multiple tasks to overcome category-specific policies and task interference. The central empirical claim is competitive performance on RoboTwin 2.0 and strong real-world results on a household folding benchmark.

Significance. If the results hold with proper validation, the combination of large-scale real-world pre-training, efficient architecture, and corrective DAgger for multi-task deformable manipulation would represent a meaningful step toward reusable VLA policies that generalize across object categories, geometries, and scenes without per-category retraining.

major comments (2)

[Abstract] Abstract: performance claims (competitive on RoboTwin 2.0, strong real-world results) are stated without any quantitative numbers, error bars, data splits, ablation studies, or statistical tests, making it impossible to evaluate support for the generalization claim.
[Post-training and Experiments] Post-training description: the text explicitly states that naive mixed multi-task training causes task interference and degradation, yet supplies no ablations, metrics, or mitigation details (task conditioning, loss balancing, selective replay) showing that the human-in-the-loop DAgger pipeline on aggregated folding data avoids interference or overfitting to collected failure modes.

minor comments (1)

[Abstract] The phrase 'approximately 5,000 hours' should be accompanied by exact collection statistics or ranges when the full experimental section is presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of results and experimental details.

read point-by-point responses

Referee: [Abstract] Abstract: performance claims (competitive on RoboTwin 2.0, strong real-world results) are stated without any quantitative numbers, error bars, data splits, ablation studies, or statistical tests, making it impossible to evaluate support for the generalization claim.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript, we will update the abstract to report specific success rates on RoboTwin 2.0 (with baseline comparisons) and the household folding benchmark. Error bars, data splits, and statistical details remain in the main experiments section, but we will add a brief reference to them in the abstract for context. revision: yes
Referee: [Post-training and Experiments] Post-training description: the text explicitly states that naive mixed multi-task training causes task interference and degradation, yet supplies no ablations, metrics, or mitigation details (task conditioning, loss balancing, selective replay) showing that the human-in-the-loop DAgger pipeline on aggregated folding data avoids interference or overfitting to collected failure modes.

Authors: The statement on task interference in naive multi-task training reflects observations from our preliminary development runs. The DAgger pipeline mitigates this through targeted corrective data collection on real-robot failures. We acknowledge that the current version lacks explicit ablations or quantitative metrics isolating the interference reduction. In revision, we will add an ablation comparing naive mixed training against the DAgger approach, reporting multi-task success rates and interference indicators. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims rest on external data evaluation

full rationale

The paper describes a data-driven VLA model pre-trained on real-world demonstrations and post-trained via DAgger on aggregated folding trajectories, with performance claims evaluated on RoboTwin 2.0 and a household folding benchmark. No derivation chain, equations, or predictions are presented that reduce by construction to fitted parameters or self-citations defined inside the paper; the central results are empirical and externally falsifiable via benchmark metrics rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard deep-learning practices such as transformer layers and flow matching.

pith-pipeline@v0.9.1-grok · 5846 in / 1124 out tokens · 23305 ms · 2026-06-28T21:56:47.673501+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 18 linked inside Pith

[1]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[2]

Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[3]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[4]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[5]

Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

arXiv 2025
[6]

Real-time execution of action chunking flow policies.Advances in Neural Information Processing Systems, 38:33383–33407, 2026

Kevin Black, Manuel Galliker, and Sergey Levine. Real-time execution of action chunking flow policies.Advances in Neural Information Processing Systems, 38:33383–33407, 2026

2026
[7]

Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution.arXiv preprint arXiv:2602.12684, 2026

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, et al. Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution.arXiv preprint arXiv:2602.12684, 2026

arXiv 2026
[8]

Interactive imitation learning in robotics: A survey.Foundations and Trends®in Robotics, 10(1-2):1–197, 2022

Carlos Celemin, Rodrigo Pérez-Dattari, Eugenio Chisari, Giovanni Franzese, Leandro de Souza Rosa, Ravi Prakash, Zlatan Ajanović, Marta Ferraz, Abhinav Valada, and Jens Kober. Interactive imitation learning in robotics: A survey.Foundations and Trends®in Robotics, 10(1-2):1–197, 2022

2022
[9]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Pith/arXiv arXiv 2025
[10]

Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

Pith/arXiv arXiv 2025
[11]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[12]

Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

Pith/arXiv arXiv 2026
[13]

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

Pith/arXiv arXiv 2024
[14]

Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning.arXiv preprint arXiv:2109.08273, 2021

Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning.arXiv preprint arXiv:2109.08273, 2021

arXiv 2021
[15]

Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint arXiv:2509.07953, 2025

Zheyuan Hu, Robyn Wu, Naveen Enock, Jasmine Li, Riya Kadakia, Zackory Erickson, and Aviral Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint arXiv:2509.07953, 2025

arXiv 2025
[16]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025
[17]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[18]

Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025. 11

arXiv 2025
[19]

Hg-dagger: Interactive imitation learning with human experts

Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019
[20]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[21]

Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

2020
[22]

Efficient learning of safe driving policy via human-ai copilot opti- mization.arXiv preprint arXiv:2202.10341, 2022

Quanyi Li, Zhenghao Peng, and Bolei Zhou. Efficient learning of safe driving policy via human-ai copilot opti- mization.arXiv preprint arXiv:2202.10341, 2022

arXiv 2022
[23]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

arXiv 2025
[24]

Holobrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, et al. Holobrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

arXiv 2026
[25]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[26]

Being-h0

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

arXiv 2026
[27]

Human-in-the- loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human-in-the- loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

arXiv 2012
[28]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025
[29]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011

2011
[30]

Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

GigaBrain Team, Boyuan Wang, Bohan Li, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

arXiv 2026
[31]

Robocopilot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025

Philipp Wu, Yide Shentu, Qiayuan Liao, Ding Jin, Menglong Guo, Koushil Sreenath, Xingyu Lin, and Pieter Abbeel. Robocopilot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025

arXiv 2025
[32]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026
[33]

Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Pith/arXiv arXiv 2026
[34]

Checheng Yu, Chonghao Sima, Gangcheng Jiang, Hai Zhang, Haoguang Mai, Hongyang Li, Huijie Wang, Jin Chen, Kaiyang Wu, Li Chen, et al.χ0: Resource-aware robust manipulation via taming distributional inconsis- tencies.arXiv preprint arXiv:2602.09021, 2026

arXiv 2026
[35]

Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

arXiv 2025
[36]

Joyai-ra 0.1: A foundation model for robotic autonomy.arXiv preprint arXiv:2604.20100, 2026

Tianle Zhang, Zhihao Yuan, Dafeng Chi, Peidong Liu, Dongwei Li, Kejun Hu, Likui Zhang, Junnan Nie, Ziming Wei, Zengjue Chen, et al. Joyai-ra 0.1: A foundation model for robotic autonomy.arXiv preprint arXiv:2604.20100, 2026. 12

Pith/arXiv arXiv 2026
[37]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[38]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 Appendix Table A.1Evaluation on RoboTwin Simulation Benchmark. Simulation Task π0 ...

2023

[1] [1]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[2] [2]

Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[3] [3]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[4] [4]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[5] [5]

Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

arXiv 2025

[6] [6]

Real-time execution of action chunking flow policies.Advances in Neural Information Processing Systems, 38:33383–33407, 2026

Kevin Black, Manuel Galliker, and Sergey Levine. Real-time execution of action chunking flow policies.Advances in Neural Information Processing Systems, 38:33383–33407, 2026

2026

[7] [7]

Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution.arXiv preprint arXiv:2602.12684, 2026

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, et al. Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution.arXiv preprint arXiv:2602.12684, 2026

arXiv 2026

[8] [8]

Interactive imitation learning in robotics: A survey.Foundations and Trends®in Robotics, 10(1-2):1–197, 2022

Carlos Celemin, Rodrigo Pérez-Dattari, Eugenio Chisari, Giovanni Franzese, Leandro de Souza Rosa, Ravi Prakash, Zlatan Ajanović, Marta Ferraz, Abhinav Valada, and Jens Kober. Interactive imitation learning in robotics: A survey.Foundations and Trends®in Robotics, 10(1-2):1–197, 2022

2022

[9] [9]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Pith/arXiv arXiv 2025

[10] [10]

Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

Pith/arXiv arXiv 2025

[11] [11]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[12] [12]

Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

Pith/arXiv arXiv 2026

[13] [13]

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

Pith/arXiv arXiv 2024

[14] [14]

Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning.arXiv preprint arXiv:2109.08273, 2021

Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning.arXiv preprint arXiv:2109.08273, 2021

arXiv 2021

[15] [15]

Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint arXiv:2509.07953, 2025

Zheyuan Hu, Robyn Wu, Naveen Enock, Jasmine Li, Riya Kadakia, Zackory Erickson, and Aviral Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint arXiv:2509.07953, 2025

arXiv 2025

[16] [16]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025

[17] [17]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[18] [18]

Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025. 11

arXiv 2025

[19] [19]

Hg-dagger: Interactive imitation learning with human experts

Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019

[20] [20]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[21] [21]

Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

2020

[22] [22]

Efficient learning of safe driving policy via human-ai copilot opti- mization.arXiv preprint arXiv:2202.10341, 2022

Quanyi Li, Zhenghao Peng, and Bolei Zhou. Efficient learning of safe driving policy via human-ai copilot opti- mization.arXiv preprint arXiv:2202.10341, 2022

arXiv 2022

[23] [23]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

arXiv 2025

[24] [24]

Holobrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, et al. Holobrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

arXiv 2026

[25] [25]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[26] [26]

Being-h0

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

arXiv 2026

[27] [27]

Human-in-the- loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human-in-the- loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

arXiv 2012

[28] [28]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025

[29] [29]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011

2011

[30] [30]

Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

GigaBrain Team, Boyuan Wang, Bohan Li, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

arXiv 2026

[31] [31]

Robocopilot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025

Philipp Wu, Yide Shentu, Qiayuan Liao, Ding Jin, Menglong Guo, Koushil Sreenath, Xingyu Lin, and Pieter Abbeel. Robocopilot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025

arXiv 2025

[32] [32]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026

[33] [33]

Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Pith/arXiv arXiv 2026

[34] [34]

Checheng Yu, Chonghao Sima, Gangcheng Jiang, Hai Zhang, Haoguang Mai, Hongyang Li, Huijie Wang, Jin Chen, Kaiyang Wu, Li Chen, et al.χ0: Resource-aware robust manipulation via taming distributional inconsis- tencies.arXiv preprint arXiv:2602.09021, 2026

arXiv 2026

[35] [35]

Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

arXiv 2025

[36] [36]

Joyai-ra 0.1: A foundation model for robotic autonomy.arXiv preprint arXiv:2604.20100, 2026

Tianle Zhang, Zhihao Yuan, Dafeng Chi, Peidong Liu, Dongwei Li, Kejun Hu, Likui Zhang, Junnan Nie, Ziming Wei, Zengjue Chen, et al. Joyai-ra 0.1: A foundation model for robotic autonomy.arXiv preprint arXiv:2604.20100, 2026. 12

Pith/arXiv arXiv 2026

[37] [37]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[38] [38]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 Appendix Table A.1Evaluation on RoboTwin Simulation Benchmark. Simulation Task π0 ...

2023