pith. sign in

arxiv: 2605.31286 · v2 · pith:KK7YXV6Jnew · submitted 2026-05-29 · 💻 cs.RO · cs.AI

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

Pith reviewed 2026-06-28 21:56 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords vision-language-actiondeformable manipulationrobot foldingfoundation modelDAggerflow matchingdual-arm robot
0
0 comments X

The pith

DeMaVLA shows that one VLA model can acquire generalizable folding skills for varied household objects by mixing demonstrations with corrective trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replace category-specific policies with a single foundation model that handles deformable manipulation such as folding across different clothing types, geometries, materials, and home scenes. Existing VLA systems either train separate models per object category or suffer interference when tasks are mixed, limiting real-world household use. DeMaVLA pre-trains a VLM backbone plus action expert on thousands of hours of dual-arm data, then post-trains on aggregated multi-task folding data collected through human-in-the-loop corrections of real-robot failures. This combination of efficient action generation and corrective data aggregation is presented as the route to reusable skills that work from random initial states without per-task retraining. If the approach holds, robots could perform a wider range of folding jobs in unstructured homes using one policy.

Core claim

DeMaVLA adopts a VLM backbone with an action expert that uses flow matching for continuous actions; the expert is built by pruning every other transformer layer while keeping alignment with the backbone. The model is first pre-trained on approximately 5,000 hours of real-world dual-arm demonstrations to learn general manipulation priors, then post-trained on mixed folding data that combines self-collected demonstrations and corrective trajectories gathered from multiple folding tasks via a human-in-the-loop DAgger pipeline. On this basis the model reaches competitive performance on RoboTwin 2.0 and strong results on a household folding benchmark involving diverse items and scenes.

What carries the argument

Pruned action expert aligned with the VLM backbone that generates continuous actions via flow matching, trained with a DAgger pipeline that aggregates demonstrations and corrective trajectories across multiple folding tasks.

If this is right

  • A single policy replaces separate category-specific folding policies.
  • Layer pruning reduces training and inference cost while preserving performance.
  • Corrective trajectories from real failures improve robustness across varied initial states and materials.
  • Pre-training on broad dual-arm data supplies reusable priors that transfer to deformable tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same aggregation of corrective data could be applied to other multi-task robotic domains such as unfolding or packing.
  • Efficiency from the pruned expert might allow the model to run on lower-power home robots.
  • Further scaling of the pre-training corpus could extend the same generalizability pattern to non-folding manipulation skills.

Load-bearing premise

Aggregating demonstrations and corrective trajectories from multiple folding tasks will avoid task interference or overfitting to the collected failure modes and scenes.

What would settle it

If the model is evaluated on a new clothing category or household scene outside the collected data and its success rate drops to the level of naive mixed training without DAgger corrections, the generalizability claim would be refuted.

read the original abstract

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DeMaVLA, a VLA foundation model for generalizable deformable manipulation. It uses a VLM backbone with a pruned action expert and flow matching for action generation, pre-trains on ~5,000 hours of real-world dual-arm data, and post-trains on mixed folding demonstrations plus corrective DAgger trajectories collected via human-in-the-loop across multiple tasks to overcome category-specific policies and task interference. The central empirical claim is competitive performance on RoboTwin 2.0 and strong real-world results on a household folding benchmark.

Significance. If the results hold with proper validation, the combination of large-scale real-world pre-training, efficient architecture, and corrective DAgger for multi-task deformable manipulation would represent a meaningful step toward reusable VLA policies that generalize across object categories, geometries, and scenes without per-category retraining.

major comments (2)
  1. [Abstract] Abstract: performance claims (competitive on RoboTwin 2.0, strong real-world results) are stated without any quantitative numbers, error bars, data splits, ablation studies, or statistical tests, making it impossible to evaluate support for the generalization claim.
  2. [Post-training and Experiments] Post-training description: the text explicitly states that naive mixed multi-task training causes task interference and degradation, yet supplies no ablations, metrics, or mitigation details (task conditioning, loss balancing, selective replay) showing that the human-in-the-loop DAgger pipeline on aggregated folding data avoids interference or overfitting to collected failure modes.
minor comments (1)
  1. [Abstract] The phrase 'approximately 5,000 hours' should be accompanied by exact collection statistics or ranges when the full experimental section is presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of results and experimental details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: performance claims (competitive on RoboTwin 2.0, strong real-world results) are stated without any quantitative numbers, error bars, data splits, ablation studies, or statistical tests, making it impossible to evaluate support for the generalization claim.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript, we will update the abstract to report specific success rates on RoboTwin 2.0 (with baseline comparisons) and the household folding benchmark. Error bars, data splits, and statistical details remain in the main experiments section, but we will add a brief reference to them in the abstract for context. revision: yes

  2. Referee: [Post-training and Experiments] Post-training description: the text explicitly states that naive mixed multi-task training causes task interference and degradation, yet supplies no ablations, metrics, or mitigation details (task conditioning, loss balancing, selective replay) showing that the human-in-the-loop DAgger pipeline on aggregated folding data avoids interference or overfitting to collected failure modes.

    Authors: The statement on task interference in naive multi-task training reflects observations from our preliminary development runs. The DAgger pipeline mitigates this through targeted corrective data collection on real-robot failures. We acknowledge that the current version lacks explicit ablations or quantitative metrics isolating the interference reduction. In revision, we will add an ablation comparing naive mixed training against the DAgger approach, reporting multi-task success rates and interference indicators. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims rest on external data evaluation

full rationale

The paper describes a data-driven VLA model pre-trained on real-world demonstrations and post-trained via DAgger on aggregated folding trajectories, with performance claims evaluated on RoboTwin 2.0 and a household folding benchmark. No derivation chain, equations, or predictions are presented that reduce by construction to fitted parameters or self-citations defined inside the paper; the central results are empirical and externally falsifiable via benchmark metrics rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard deep-learning practices such as transformer layers and flow matching.

pith-pipeline@v0.9.1-grok · 5846 in / 1124 out tokens · 23305 ms · 2026-06-28T21:56:47.673501+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 18 linked inside Pith

  1. [1]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  3. [3]

    Gr00t n1: An open foundation model for generalist humanoid robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  4. [4]

    arXiv preprint arXiv:2410.24164, 2024

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

    Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

  6. [6]

    Real-time execution of action chunking flow policies.Advances in Neural Information Processing Systems, 38:33383–33407, 2026

    Kevin Black, Manuel Galliker, and Sergey Levine. Real-time execution of action chunking flow policies.Advances in Neural Information Processing Systems, 38:33383–33407, 2026

  7. [7]

    Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution.arXiv preprint arXiv:2602.12684, 2026

    Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, et al. Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution.arXiv preprint arXiv:2602.12684, 2026

  8. [8]

    Interactive imitation learning in robotics: A survey.Foundations and Trends®in Robotics, 10(1-2):1–197, 2022

    Carlos Celemin, Rodrigo Pérez-Dattari, Eugenio Chisari, Giovanni Franzese, Leandro de Souza Rosa, Ravi Prakash, Zlatan Ajanović, Marta Ferraz, Abhinav Valada, and Jens Kober. Interactive imitation learning in robotics: A survey.Foundations and Trends®in Robotics, 10(1-2):1–197, 2022

  9. [9]

    Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

    Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

  10. [10]

    Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

  11. [11]

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  12. [12]

    Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

    StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

  13. [13]

    Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

  14. [14]

    Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning.arXiv preprint arXiv:2109.08273, 2021

    Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning.arXiv preprint arXiv:2109.08273, 2021

  15. [15]

    Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint arXiv:2509.07953, 2025

    Zheyuan Hu, Robyn Wu, Naveen Enock, Jasmine Li, Riya Kadakia, Zackory Erickson, and Aviral Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint arXiv:2509.07953, 2025

  16. [16]

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  17. [17]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  18. [18]

    Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025. 11

  19. [19]

    Hg-dagger: Interactive imitation learning with human experts

    Michael Kelly, Chelsea Sidrane, Katherine Driggs-Campbell, and Mykel J Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

  20. [20]

    Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  21. [21]

    Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

    Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

  22. [22]

    Efficient learning of safe driving policy via human-ai copilot opti- mization.arXiv preprint arXiv:2202.10341, 2022

    Quanyi Li, Zhenghao Peng, and Bolei Zhou. Efficient learning of safe driving policy via human-ai copilot opti- mization.arXiv preprint arXiv:2202.10341, 2022

  23. [23]

    Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

    Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

  24. [24]

    Holobrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

    Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, et al. Holobrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

  25. [25]

    Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  26. [26]

    Being-h0

    Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

  27. [27]

    Human-in-the- loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

    Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human-in-the- loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

  28. [28]

    Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  29. [29]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011

  30. [30]

    Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

    GigaBrain Team, Boyuan Wang, Bohan Li, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

  31. [31]

    Robocopilot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025

    Philipp Wu, Yide Shentu, Qiayuan Liao, Ding Jin, Menglong Guo, Koushil Sreenath, Xingyu Lin, and Pieter Abbeel. Robocopilot: Human-in-the-loop interactive imitation learning for robot manipulation.arXiv preprint arXiv:2503.07771, 2025

  32. [32]

    A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  33. [33]

    Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

    Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

  34. [34]

    Checheng Yu, Chonghao Sima, Gangcheng Jiang, Hai Zhang, Haoguang Mai, Hongyang Li, Huijie Wang, Jin Chen, Kaiyang Wu, Li Chen, et al.χ0: Resource-aware robust manipulation via taming distributional inconsis- tencies.arXiv preprint arXiv:2602.09021, 2026

  35. [35]

    Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

    Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

  36. [36]

    Joyai-ra 0.1: A foundation model for robotic autonomy.arXiv preprint arXiv:2604.20100, 2026

    Tianle Zhang, Zhihao Yuan, Dafeng Chi, Peidong Liu, Dongwei Li, Kejun Hu, Likui Zhang, Junnan Nie, Ziming Wei, Zengjue Chen, et al. Joyai-ra 0.1: A foundation model for robotic autonomy.arXiv preprint arXiv:2604.20100, 2026. 12

  37. [37]

    X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

  38. [38]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 Appendix Table A.1Evaluation on RoboTwin Simulation Benchmark. Simulation Task π0 ...