VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment
Pith reviewed 2026-07-03 17:06 UTC · model grok-4.3
The pith
Combining language-supervised co-training and future latent alignment produces the most stable transfer performance for vision-language-action models on heterogeneous robot data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLAFlow fixes the pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space while training four variants on the OXEMix corpus: action-only modeling, language-supervised co-training, future latent alignment, and their combination. Action-only pre-training is sensitive to data heterogeneity; language supervision preserves generalization; future latent alignment improves state-transition and action-outcome modeling; and the combined model achieves the most stable overall transfer performance across the three benchmarks. The results support a meta-action space view in which language and future latent representations supply complementary intermediate constraints.
What carries the argument
VLAFlow, a unified flow-matching framework that isolates the effects of four training objectives by using identical architecture, backbone, action space, and the OXEMix heterogeneous robot corpus.
If this is right
- Language-supervised co-training helps preserve vision-language generalization.
- Future latent alignment improves state-transition and action-outcome modeling.
- The combination of both signals produces the most stable transfer performance across benchmarks.
- Language and future latent representations act as complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.
Where Pith is reading between the lines
- The same controlled-comparison approach could be used to test whether the stability gains persist when the action space or backbone changes.
- Real-world robot fleets that draw data from multiple manufacturers might adopt the combined paradigm to lower the cost of adapting to new tasks.
- The meta-action space perspective could be explored in other sequential prediction settings such as video generation or autonomous driving.
- Scaling the OXEMix corpus size while keeping the same four-objective comparison would test whether the stability advantage grows with data volume.
Load-bearing premise
Performance differences across the four training paradigms can be attributed primarily to the objectives themselves rather than to unmeasured interactions with the specific composition of the OXEMix data or the fixed architecture.
What would settle it
Re-running the four paradigms on LIBERO, LIBERO-Plus, and SimplerEnv and finding that the combined model no longer shows the most stable transfer performance would falsify the central claim.
read the original abstract
Vision-language-action models (VLAs) have recently advanced robotic manipulation, yet the effects of different robot-data pre-training paradigms remain difficult to compare because existing models often differ in architecture, data, action space, and evaluation protocol. We present VLAFlow (Vision-Language-Action Flow), a unified flow-matching framework for controlled comparison of VLA training objectives. Using a heterogeneous robot corpus, OXEMix, containing approximately 5,000 hours of data from DROID, OpenX-Embodiment, OpenX-Augmented, and RoboCOIN, we evaluate four paradigms under the same pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space: action-only modeling (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show that action-only pre-training is sensitive to heterogeneous data. In contrast, language supervision helps preserve vision-language generalization, while future latent alignment improves state-transition and action-outcome modeling. By combining both signals, MindLWPI achieves the most stable overall transfer performance across benchmarks. These results suggest a meta-action space view: language and future latent representations provide complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VLAFlow, a unified flow-matching framework for controlled comparison of VLA training paradigms. Using the OXEMix corpus (~5k hours from DROID, OpenX-Embodiment, OpenX-Augmented, RoboCOIN), it evaluates four paradigms under identical pi0-style architecture, shared VLM backbone, action expert, and 14-dim action space: action-only (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show action-only pre-training is sensitive to heterogeneous data, language supervision preserves vision-language generalization, future latent alignment improves state-transition modeling, and MindLWPI yields the most stable transfer, supporting a meta-action space view with complementary intermediate constraints.
Significance. If the results hold under rigorous validation, the work provides a valuable standardized framework for isolating effects of VLA pre-training objectives on heterogeneous robot data, addressing a key limitation where prior models confound architecture, data, and objectives. The controlled setup and large corpus strengthen the ability to attribute benefits to language co-training and latent alignment, with potential to guide more transferable robotic policies.
major comments (2)
- [Abstract] Abstract: the reported benchmark results favoring MindLWPI provide no error bars, statistical tests, or details on data splits and exclusion rules, preventing verification of the claimed stability and performance differences across paradigms.
- [Abstract] Abstract: the attribution of performance differences primarily to the four training objectives (rather than unmeasured interactions with heterogeneous data sources) is not isolated, as no per-source breakdowns, data-mix ablations, or sampling controls are described despite distinct distributions across DROID/OpenX/etc. sources.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on statistical reporting and experimental controls. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported benchmark results favoring MindLWPI provide no error bars, statistical tests, or details on data splits and exclusion rules, preventing verification of the claimed stability and performance differences across paradigms.
Authors: We agree that error bars, statistical tests, and explicit details on data splits and exclusion rules are necessary for verification. The full experimental sections report results over multiple random seeds with standard deviations in the tables, but these were not referenced in the abstract and statistical significance tests were omitted. In the revision we will (i) add error bars and significance tests (e.g., paired t-tests) to all benchmark tables, (ii) document the exact train/validation splits, seed counts, and any episode exclusion criteria in the experimental setup, and (iii) update the abstract to note that all comparisons use multiple seeds. revision: yes
-
Referee: [Abstract] Abstract: the attribution of performance differences primarily to the four training objectives (rather than unmeasured interactions with heterogeneous data sources) is not isolated, as no per-source breakdowns, data-mix ablations, or sampling controls are described despite distinct distributions across DROID/OpenX/etc. sources.
Authors: The controlled experimental design—identical pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space across all four paradigms—already isolates the training objectives from architectural and action-space confounds. Nevertheless, we acknowledge that source-specific interactions could still contribute. In the revision we will add (i) per-source performance breakdowns on LIBERO and SimplerEnv, (ii) a controlled data-mix ablation that varies sampling ratios while keeping total hours fixed, and (iii) explicit sampling controls. These additions will strengthen the claim that the observed stability gains are attributable to the complementary language and future-latent constraints. revision: yes
Circularity Check
No circularity: empirical comparison is self-contained
full rationale
The paper conducts a controlled empirical evaluation of four VLA training paradigms (action-only, language co-training, future latent alignment, and their combination) on shared pi0-style architecture, VLM backbone, action expert, and 14-dim action space using the OXEMix corpus. Results are reported as observed transfer performance on external benchmarks (LIBERO, LIBERO-Plus, SimplerEnv). No equations, fitted parameters renamed as predictions, self-citation chains, or ansatzes are present that reduce any claimed result to its inputs by construction. The meta-action-space interpretation is presented as a post-hoc suggestion from the empirical outcomes rather than a deductive step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Flow-matching provides a valid generative model for 14-dimensional robot actions under a shared VLM backbone.
Reference graph
Works this paper leans on
-
[1]
Self-supervised learning from im- ages with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from im- ages with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 15619–15629, 2023
2023
-
[2]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Liang- hao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann Le- Cun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter , et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
π0.5: a vision-language-action model with open-world generalization, 2025
Kevin Black, Noah Brown, Danny Driess, et al. π0.5: a vision-language-action model with open-world generalization, 2025
2025
-
[8]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar , Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111, 2025. 20
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024
Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zoui- tine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor , Dana Aubakirova, et al. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024
2024
-
[11]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks
Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning , pages 794–803. PMLR, 2018
2018
-
[13]
Learning universal policies via text-guided video gener- ation
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schu- urmans, and Pieter Abbeel. Learning universal policies via text-guided video gener- ation. Advances in neural information processing systems , 36:9156–9172, 2023
2023
-
[14]
Re- mix: Optimizing data mixtures for large scale imitation learning
Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. Re- mix: Optimizing data mixtures for large scale imitation learning. arXiv preprint arXiv:2408.14037, 2024
-
[15]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR) , 2022
2022
-
[16]
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder , Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning
Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu, and Ken Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning. arXiv preprint arXiv:2512.13100, 2025
-
[18]
LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion
Ran Jiang et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion. arXiv preprint arXiv:2602.12215, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
JoyAI-RA Team. Joyai-ra 0.1: A foundation model for robotic autonomy. arXiv preprint arXiv:2604.20100, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Droid: A large-scale in-the-wild robot manipulation dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair , Ashwin Balakrishna, Sudeep Dasari, et al. Droid: A large-scale in-the-wild robot manipulation dataset. In Proceedings of Robotics: Science and Systems (RSS) , 2024
2024
-
[21]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair , Rafael Rafailov, Ethan Foster , Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action mod- els: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025. 21
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, et al. Cosmos pol- icy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Haizhou Li et al. Lap: Language-action pre-training enables zero-shot cross- embodiment transfer .arXiv preprint arXiv:2602.10556, 2026
-
[25]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Sylvia Li et al. Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies. arXiv preprint arXiv:2508.20072, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Rep- resentations (ICLR), 2023
2023
-
[29]
Libero: Benchmarking knowledge transfer for lifelong robot learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , 2023
2023
-
[30]
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, et al. Being-h0.7: A latent world-action model from egocentric videos. arXiv preprint arXiv:2605.00078, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Jepa-vla: Video predictive embedding is needed for vla models
Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, and Mingsheng Long. Jepa-vla: Video predictive embedding is needed for vla models. arXiv preprint arXiv:2602.11832, 2026
-
[32]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar , Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar , Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[33]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023
2023
-
[34]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter , Danny Driess, Suraj Nair , Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025. 22
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international con- ference for high performance computing, networking, storage and analysis , pages 1–16. IEEE, 2020
2020
-
[37]
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Vla-jepa: Enhancing vision- language-action model with latent world model
Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, et al. Vla-jepa: Enhancing vision- language-action model with latent world model. arXiv preprint arXiv:2602.10098 , 2026
-
[39]
Interactive Post-Training for Vision-Language-Action Models
Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models. arXiv preprint arXiv:2505.17016, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
World Action Models are Zero-shot Policies
Wenhui Wang et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation
Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, and Xiaodan Liang. Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation. In Advances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[43]
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, et al. Robocoin: An open- sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
A Pragmatic VLA Foundation Model
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning. arXiv preprint arXiv:2602.11236, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
Fpc-vla: A vision-language- action framework with a supervisor for failure prediction and correction
Yifan Yang, Zhixiang Duan, Tianshi Xie, Fuyu Cao, Pinxi Shen, Peili Song, Chenyang Zhao, Piaopiao Jin, Guokang Sun, Shaoqing Xu, et al. Fpc-vla: A vision-language- action framework with a supervisor for failure prediction and correction. Expert Systems with Applications, page 131742, 2026. 23
2026
-
[48]
Rt-2: Vision- language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, et al. Rt-2: Vision- language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[49]
StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems
Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang, Yuxin Chen, Peng- guang Chen, Yilun Chen, Shu Liu, and Jiaya Jia. StarVLA- α: Reducing complexity in vision-language-action systems. arXiv preprint arXiv:2604.11757, 2026. A Implementation Details This appendix supplements the implementation details omitted from Section3.2. The main text reta...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.