pith. sign in

arxiv: 2607.01586 · v1 · pith:VL3QBDP6new · submitted 2026-07-02 · 💻 cs.CV · cs.AI· cs.RO

VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment

Pith reviewed 2026-07-03 17:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords vision-language-action modelsflow matchingco-trainingfuture latent alignmentrobotic manipulationheterogeneous datatransfer performancepre-training paradigms
0
0 comments X

The pith

Combining language-supervised co-training and future latent alignment produces the most stable transfer performance for vision-language-action models on heterogeneous robot data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up a single flow-matching framework called VLAFlow to compare four training objectives while holding architecture, backbone, action space, and a 5,000-hour mixed robot dataset fixed. It shows that pure action modeling struggles with data variety, language co-training keeps vision-language skills intact, future latent alignment strengthens prediction of state changes and outcomes, and the two signals together yield the steadiest results on LIBERO, LIBERO-Plus, and SimplerEnv. A sympathetic reader would care because most robot data in practice comes from many sources and robots, so any method that makes pre-training more robust could reduce the need for heavy per-task retraining. The work frames language and future latent representations as complementary intermediate constraints that smooth heterogeneous action supervision.

Core claim

VLAFlow fixes the pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space while training four variants on the OXEMix corpus: action-only modeling, language-supervised co-training, future latent alignment, and their combination. Action-only pre-training is sensitive to data heterogeneity; language supervision preserves generalization; future latent alignment improves state-transition and action-outcome modeling; and the combined model achieves the most stable overall transfer performance across the three benchmarks. The results support a meta-action space view in which language and future latent representations supply complementary intermediate constraints.

What carries the argument

VLAFlow, a unified flow-matching framework that isolates the effects of four training objectives by using identical architecture, backbone, action space, and the OXEMix heterogeneous robot corpus.

If this is right

  • Language-supervised co-training helps preserve vision-language generalization.
  • Future latent alignment improves state-transition and action-outcome modeling.
  • The combination of both signals produces the most stable transfer performance across benchmarks.
  • Language and future latent representations act as complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same controlled-comparison approach could be used to test whether the stability gains persist when the action space or backbone changes.
  • Real-world robot fleets that draw data from multiple manufacturers might adopt the combined paradigm to lower the cost of adapting to new tasks.
  • The meta-action space perspective could be explored in other sequential prediction settings such as video generation or autonomous driving.
  • Scaling the OXEMix corpus size while keeping the same four-objective comparison would test whether the stability advantage grows with data volume.

Load-bearing premise

Performance differences across the four training paradigms can be attributed primarily to the objectives themselves rather than to unmeasured interactions with the specific composition of the OXEMix data or the fixed architecture.

What would settle it

Re-running the four paradigms on LIBERO, LIBERO-Plus, and SimplerEnv and finding that the combined model no longer shows the most stable transfer performance would falsify the central claim.

read the original abstract

Vision-language-action models (VLAs) have recently advanced robotic manipulation, yet the effects of different robot-data pre-training paradigms remain difficult to compare because existing models often differ in architecture, data, action space, and evaluation protocol. We present VLAFlow (Vision-Language-Action Flow), a unified flow-matching framework for controlled comparison of VLA training objectives. Using a heterogeneous robot corpus, OXEMix, containing approximately 5,000 hours of data from DROID, OpenX-Embodiment, OpenX-Augmented, and RoboCOIN, we evaluate four paradigms under the same pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space: action-only modeling (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show that action-only pre-training is sensitive to heterogeneous data. In contrast, language supervision helps preserve vision-language generalization, while future latent alignment improves state-transition and action-outcome modeling. By combining both signals, MindLWPI achieves the most stable overall transfer performance across benchmarks. These results suggest a meta-action space view: language and future latent representations provide complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces VLAFlow, a unified flow-matching framework for controlled comparison of VLA training paradigms. Using the OXEMix corpus (~5k hours from DROID, OpenX-Embodiment, OpenX-Augmented, RoboCOIN), it evaluates four paradigms under identical pi0-style architecture, shared VLM backbone, action expert, and 14-dim action space: action-only (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show action-only pre-training is sensitive to heterogeneous data, language supervision preserves vision-language generalization, future latent alignment improves state-transition modeling, and MindLWPI yields the most stable transfer, supporting a meta-action space view with complementary intermediate constraints.

Significance. If the results hold under rigorous validation, the work provides a valuable standardized framework for isolating effects of VLA pre-training objectives on heterogeneous robot data, addressing a key limitation where prior models confound architecture, data, and objectives. The controlled setup and large corpus strengthen the ability to attribute benefits to language co-training and latent alignment, with potential to guide more transferable robotic policies.

major comments (2)
  1. [Abstract] Abstract: the reported benchmark results favoring MindLWPI provide no error bars, statistical tests, or details on data splits and exclusion rules, preventing verification of the claimed stability and performance differences across paradigms.
  2. [Abstract] Abstract: the attribution of performance differences primarily to the four training objectives (rather than unmeasured interactions with heterogeneous data sources) is not isolated, as no per-source breakdowns, data-mix ablations, or sampling controls are described despite distinct distributions across DROID/OpenX/etc. sources.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical reporting and experimental controls. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported benchmark results favoring MindLWPI provide no error bars, statistical tests, or details on data splits and exclusion rules, preventing verification of the claimed stability and performance differences across paradigms.

    Authors: We agree that error bars, statistical tests, and explicit details on data splits and exclusion rules are necessary for verification. The full experimental sections report results over multiple random seeds with standard deviations in the tables, but these were not referenced in the abstract and statistical significance tests were omitted. In the revision we will (i) add error bars and significance tests (e.g., paired t-tests) to all benchmark tables, (ii) document the exact train/validation splits, seed counts, and any episode exclusion criteria in the experimental setup, and (iii) update the abstract to note that all comparisons use multiple seeds. revision: yes

  2. Referee: [Abstract] Abstract: the attribution of performance differences primarily to the four training objectives (rather than unmeasured interactions with heterogeneous data sources) is not isolated, as no per-source breakdowns, data-mix ablations, or sampling controls are described despite distinct distributions across DROID/OpenX/etc. sources.

    Authors: The controlled experimental design—identical pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space across all four paradigms—already isolates the training objectives from architectural and action-space confounds. Nevertheless, we acknowledge that source-specific interactions could still contribute. In the revision we will add (i) per-source performance breakdowns on LIBERO and SimplerEnv, (ii) a controlled data-mix ablation that varies sampling ratios while keeping total hours fixed, and (iii) explicit sampling controls. These additions will strengthen the claim that the observed stability gains are attributable to the complementary language and future-latent constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison is self-contained

full rationale

The paper conducts a controlled empirical evaluation of four VLA training paradigms (action-only, language co-training, future latent alignment, and their combination) on shared pi0-style architecture, VLM backbone, action expert, and 14-dim action space using the OXEMix corpus. Results are reported as observed transfer performance on external benchmarks (LIBERO, LIBERO-Plus, SimplerEnv). No equations, fitted parameters renamed as predictions, self-citation chains, or ansatzes are present that reduce any claimed result to its inputs by construction. The meta-action-space interpretation is presented as a post-hoc suggestion from the empirical outcomes rather than a deductive step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that flow-matching is a suitable generative model for robot actions and that the shared architecture isolates the effect of training objectives. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Flow-matching provides a valid generative model for 14-dimensional robot actions under a shared VLM backbone.
    The framework is built on flow-matching for all four paradigms.

pith-pipeline@v0.9.1-grok · 5794 in / 1282 out tokens · 34436 ms · 2026-07-03T17:06:12.493998+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 34 canonical work pages · 29 internal anchors

  1. [1]

    Self-supervised learning from im- ages with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from im- ages with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 15619–15629, 2023

  2. [2]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Liang- hao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann Le- Cun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter , et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164, 2024

  7. [7]

    π0.5: a vision-language-action model with open-world generalization, 2025

    Kevin Black, Noah Brown, Danny Driess, et al. π0.5: a vision-language-action model with open-world generalization, 2025

  8. [8]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar , Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  9. [9]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111, 2025. 20

  10. [10]

    Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

    Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zoui- tine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor , Dana Aubakirova, et al. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

  11. [11]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025

  12. [12]

    Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning , pages 794–803. PMLR, 2018

  13. [13]

    Learning universal policies via text-guided video gener- ation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schu- urmans, and Pieter Abbeel. Learning universal policies via text-guided video gener- ation. Advances in neural information processing systems , 36:9156–9172, 2023

  14. [14]

    Re- mix: Optimizing data mixtures for large scale imitation learning

    Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. Re- mix: Optimizing data mixtures for large scale imitation learning. arXiv preprint arXiv:2408.14037, 2024

  15. [15]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR) , 2022

  16. [16]

    NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder , Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854, 2025

  17. [17]

    Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning

    Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu, and Ken Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning. arXiv preprint arXiv:2512.13100, 2025

  18. [18]

    LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

    Ran Jiang et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion. arXiv preprint arXiv:2602.12215, 2026

  19. [19]

    JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

    JoyAI-RA Team. Joyai-ra 0.1: A foundation model for robotic autonomy. arXiv preprint arXiv:2604.20100, 2026

  20. [20]

    Droid: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair , Ashwin Balakrishna, Sudeep Dasari, et al. Droid: A large-scale in-the-wild robot manipulation dataset. In Proceedings of Robotics: Science and Systems (RSS) , 2024

  21. [21]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair , Rafael Rafailov, Ethan Foster , Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 , 2024

  22. [22]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action mod- els: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025. 21

  23. [23]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, et al. Cosmos pol- icy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

  24. [24]

    Lap: Language-action pre-training enables zero-shot cross- embodiment transfer .arXiv preprint arXiv:2602.10556, 2026

    Haizhou Li et al. Lap: Language-action pre-training enables zero-shot cross- embodiment transfer .arXiv preprint arXiv:2602.10556, 2026

  25. [25]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Sylvia Li et al. Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626, 2025

  26. [26]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024

  27. [27]

    Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

    Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies. arXiv preprint arXiv:2508.20072, 2025

  28. [28]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Rep- resentations (ICLR), 2023

  29. [29]

    Libero: Benchmarking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , 2023

  30. [30]

    Being-H0.7: A Latent World-Action Model from Egocentric Videos

    Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, et al. Being-h0.7: A latent world-action model from egocentric videos. arXiv preprint arXiv:2605.00078, 2026

  31. [31]

    Jepa-vla: Video predictive embedding is needed for vla models

    Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, and Mingsheng Long. Jepa-vla: Video predictive embedding is needed for vla models. arXiv preprint arXiv:2602.11832, 2026

  32. [32]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar , Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar , Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  33. [33]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  34. [34]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter , Danny Driess, Suraj Nair , Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

  35. [35]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025. 22

  36. [36]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international con- ference for high performance computing, networking, storage and analysis , pages 1–16. IEEE, 2020

  37. [37]

    MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025

  38. [38]

    Vla-jepa: Enhancing vision- language-action model with latent world model

    Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, et al. Vla-jepa: Enhancing vision- language-action model with latent world model. arXiv preprint arXiv:2602.10098 , 2026

  39. [39]

    Interactive Post-Training for Vision-Language-Action Models

    Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models. arXiv preprint arXiv:2505.17016, 2025

  40. [40]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

  41. [41]

    World Action Models are Zero-shot Policies

    Wenhui Wang et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922, 2026

  42. [42]

    Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation

    Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, and Xiaodan Liang. Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  43. [43]

    RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024

  44. [44]

    RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, et al. Robocoin: An open- sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025

  45. [45]

    A Pragmatic VLA Foundation Model

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026

  46. [46]

    ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

    Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning. arXiv preprint arXiv:2602.11236, 2026

  47. [47]

    Fpc-vla: A vision-language- action framework with a supervisor for failure prediction and correction

    Yifan Yang, Zhixiang Duan, Tianshi Xie, Fuyu Cao, Pinxi Shen, Peili Song, Chenyang Zhao, Piaopiao Jin, Guokang Sun, Shaoqing Xu, et al. Fpc-vla: A vision-language- action framework with a supervisor for failure prediction and correction. Expert Systems with Applications, page 131742, 2026. 23

  48. [48]

    Rt-2: Vision- language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, et al. Rt-2: Vision- language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  49. [49]

    StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

    Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang, Yuxin Chen, Peng- guang Chen, Yilun Chen, Shu Liu, and Jiaya Jia. StarVLA- α: Reducing complexity in vision-language-action systems. arXiv preprint arXiv:2604.11757, 2026. A Implementation Details This appendix supplements the implementation details omitted from Section3.2. The main text reta...