AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
Pith reviewed 2026-05-17 21:24 UTC · model grok-4.3
The pith
Vision-language-action models achieve stable long-horizon performance by generating actions asynchronously with self-correction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that by replacing synchronous flow matching with asynchronous flow matching, action tokens are generated in a non-uniform time schedule that incorporates action context awareness, and a confidence rater is used to extract confidence levels for initially generated actions so that inaccurate tokens can be selectively refined prior to execution.
What carries the argument
Asynchronous flow matching combined with a confidence rater for selective action refinement.
If this is right
- Improved performance over existing methods in simulation and real-world robotic manipulation benchmarks.
- Greater data efficiency when training for robot control tasks.
- Self-correction ability that mitigates error propagation in long-horizon scenarios.
- Enhanced KV-cache utilization through a single model supporting multiple generation modes.
Where Pith is reading between the lines
- Such self-correction could be applied to other generative AI systems facing sequential decision problems.
- Testing the method on even longer task horizons might reveal its limits in maintaining stability.
- The flexible time schedule opens possibilities for adaptive computation based on action difficulty.
Load-bearing premise
The introduced asynchronous schedule and confidence rater will generate stable self-corrections without creating additional failure modes or demanding significant extra tuning.
What would settle it
A direct comparison on extended robotic manipulation sequences where the asynchronous model shows no reduction in failure rates compared to the synchronous baseline.
Figures
read the original abstract
Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA outperforms existing methods across both simulation and real-world evaluations. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AsyncVLA, a vision-language-action (VLA) model that replaces synchronous flow matching (SFM) with asynchronous flow matching (AFM). AFM uses a non-uniform, action-context-aware time schedule to generate action tokens and introduces a confidence rater that selectively refines low-confidence tokens before execution. A unified training procedure allows a single model to operate in both SFM and AFM modes, improving KV-cache efficiency. Experiments on robotic manipulation benchmarks claim superior performance, data efficiency, and self-correction ability in both simulation and real-world settings compared to existing VLA methods.
Significance. If the central claims hold, AsyncVLA would address a practical limitation of current flow-matching VLA models in long-horizon tasks by enabling context-dependent self-correction without separate models or retraining. The unified SFM/AFM training and code release are concrete strengths that support reproducibility and potential adoption. The work sits at the intersection of generative modeling and robotics, where even modest gains in stability can translate to meaningful improvements in deployment reliability.
major comments (3)
- [§3.2, Eq. (7) and Eq. (9)] §3.2, Eq. (7) and Eq. (9): The AFM probability path is defined with a non-uniform, context-dependent schedule t(a), yet the training loss appears to reuse the standard SFM conditional flow-matching objective without an explicit re-derivation of the required vector field or marginal consistency term. If the loss is not adapted, the learned model may converge to an incorrect transport map, undermining the self-correction mechanism that the headline performance claims rest upon.
- [§5.3, Table 4] §5.3, Table 4 (real-world results): The reported success-rate gains for AsyncVLA over baselines are presented without error bars, number of trials, or statistical tests. Given that self-correction is the key differentiator, it is unclear whether the observed improvements are robust or could be explained by variance in the evaluation protocol.
- [§4.2] §4.2: The confidence rater is introduced as an auxiliary head, but no ablation isolates its contribution from the AFM schedule itself. Without this separation, it is difficult to attribute the claimed self-correction ability specifically to the asynchronous mechanism versus other modeling choices.
minor comments (2)
- [Figure 3] Figure 3: The visualization of asynchronous trajectories would benefit from an explicit overlay of the uniform SFM schedule for direct comparison.
- [§1] The abstract and §1 use “unified training procedure” without a forward reference to the precise loss combination; a short equation or pseudocode box would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have carefully addressed each major comment with revisions that clarify the technical foundations and strengthen the empirical support. Below we respond point by point.
read point-by-point responses
-
Referee: [§3.2, Eq. (7) and Eq. (9)] The AFM probability path is defined with a non-uniform, context-dependent schedule t(a), yet the training loss appears to reuse the standard SFM conditional flow-matching objective without an explicit re-derivation of the required vector field or marginal consistency term. If the loss is not adapted, the learned model may converge to an incorrect transport map, undermining the self-correction mechanism that the headline performance claims rest upon.
Authors: We appreciate the referee’s careful scrutiny of the derivation. The original manuscript sketches the consistency of the probability path under the context-aware schedule t(a) but does not provide a complete re-derivation of the vector field and marginal term in the main text. In the revision we have expanded §3.2 with the full derivation and added Appendix B containing the proof that the conditional flow-matching objective remains valid for the non-uniform schedule; the context conditioning ensures marginal consistency is preserved, so the learned transport map is correct. We have also added a short remark clarifying that no separate loss adaptation is required beyond the schedule itself. revision: yes
-
Referee: [§5.3, Table 4] The reported success-rate gains for AsyncVLA over baselines are presented without error bars, number of trials, or statistical tests. Given that self-correction is the key differentiator, it is unclear whether the observed improvements are robust or could be explained by variance in the evaluation protocol.
Authors: We agree that statistical reporting is necessary to substantiate the self-correction claims. The revised Table 4 now reports mean success rates with standard deviations computed over five independent random seeds, explicitly states that each real-world task was evaluated on 50 trials, and includes paired t-test p-values (all p < 0.05 for the reported gains versus the strongest baseline). A brief description of the evaluation protocol has also been added to §5.3. revision: yes
-
Referee: [§4.2] The confidence rater is introduced as an auxiliary head, but no ablation isolates its contribution from the AFM schedule itself. Without this separation, it is difficult to attribute the claimed self-correction ability specifically to the asynchronous mechanism versus other modeling choices.
Authors: We thank the referee for this suggestion. We have performed a new ablation that isolates the confidence rater by training and evaluating an AFM-only variant (without the rater head) against the full AsyncVLA model. The results are now presented in a new Table 5 in §4.2; the rater contributes an additional 8–12 % absolute success-rate improvement on long-horizon tasks, confirming its specific role in selective refinement beyond the asynchronous schedule. The training procedure for the auxiliary head is also clarified in the revised text. revision: yes
Circularity Check
No significant circularity; derivation introduces independent components
full rationale
The paper defines AsyncVLA via a new asynchronous time schedule with action-context awareness plus a separate confidence rater for selective refinement. These are presented as additions to standard flow matching rather than quantities fitted from or defined in terms of the target performance metrics. The unified training procedure is described as a single-model implementation detail that supports both SFM and AFM modes; no equation or claim reduces the reported self-correction or outperformance to a tautological re-use of the same fitted values or to a self-citation chain. The central claims therefore remain externally falsifiable against simulation and real-robot benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
generating the action tokens in a non-uniform time schedule with action context awareness... unified training procedure for SFM and AFM... L = E[ ||(Vθ(ot,ℓ,âtτ)−ut:t+L)⊙m||² ]
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sample τ(i)∼Beta(1.5,1); m(i)l∼Bernoulli(y(i))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 7 Pith papers
-
DSSP: Diffusion State Space Policy with Full-History Encoding
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size...
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies
DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen ref...
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, et al. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Johan Bjorck, Fernando Casta ˜neda, Nikita Chentanez, Da Xinyue, Runyu Ding, Linxi Fan, Spencer Huang, Yifeng 9 Huang, Dieter Fox Fu, et al. GR00T N1: An open foun- dation model for generalist humanoid robots.arXiv preprint arXiv:2503.17434, 2024. 6
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Dries, Adnan Esmail, Michael Fiume, Chelsea Finn, Niccolo Fusi, Lachy Groom, Karol Hausman, Brian Ichter, and et al.π 0: A vision- language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 2, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, et al.π 0.5: A vision-language- action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbaljai, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Ja ´en, et al. RT-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1, 2, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. WorldVLA: To- wards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion VLA: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718, 2025. 2, 3, 6, 7
-
[10]
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, et al. InternVLA-M1: A spa- tially guided vision-language-action framework for general- ist robot policy.arXiv preprint arXiv:2510.13778, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, and Bowen Zhou. SDAR: A synergistic diffusion- autoregression paradigm for scalable sequence generation. arXiv preprint arXiv:2510.06303, 2025. 2, 3
-
[12]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023. 1
work page 2023
-
[13]
StarVLA: A lego-like codebase for vision-language-action model developing.GitHub reposi- tory, 2025
StarVLA Community. StarVLA: A lego-like codebase for vision-language-action model developing.GitHub reposi- tory, 2025. 2
work page 2025
-
[14]
Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan, Chu Wu, Xiaoyu Liu, Yunhang Shen, Jing Huo, De- qiang Jiang, Haoyu Cao, Yang Gao, Xing Sun, Ran He, and Caifeng Shan. VITA-VLA: Efficiently teaching vision- language models to act via action expert distillation.arXiv preprint arXiv:2510.09607, 2025. 2
-
[15]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi S M Sajjadi, Corey Chen, Jonathan Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Vuong, Tianhe Yu, Wenhao D’Costa, et al. Palm- e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Moka: Open-world robotic manipu- lation through mark-based visual prompting
Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, and Jianlan Luo. Reflective planning: Vision- language models for multi-stage long-horizon robotic ma- nipulation.arXiv preprint arXiv:2502.16707, 2025. 2
-
[17]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Vita: Vision-to-action flow matching policy, 2026
Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, and Iman Soltani. Vita: Vision-to-action flow matching policy.arXiv preprint arXiv:2507.13231, 2025. 2
-
[19]
Octo: An open-source generalist robot policy
Divya Ghosh, Homer Rich Walk, Karl Pertsck, Kevin Black, Sudeep Mees, Tobias Hejna, Charles Xu Kreisman, Jianlan Liu, and Xi Li. Octo: An open-source generalist robot policy. Robotics: Science and Systems, 2024. 1, 2, 7
work page 2024
-
[20]
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Hao Peng, Jiawei Han, and Lingpeng Kong. Scaling diffusion language models via adaptation from autoregressive models. arXiv preprint arXiv:2410.17891, 2024. 2, 3
work page internal anchor Pith review arXiv 2024
-
[21]
Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025
Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Bulkis, and Fabio Ramos. VLA-0: Building state-of-the-art VLAs with zero modification.arXiv preprint arXiv:2510.13054, 2025. 1, 2
-
[22]
A survey on vision-language-action models for embodied ai
Xiaoshuang Gu, Hongguang Liu, Yunhai Guo, Jun Li, Qingyong Yan, Hong Zhao, Shuai Liu, and Linqi Zeng. A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2401.07172, 2024. 1
-
[23]
Diffusionbert: Improving generative masked language models with diffusion models,
Junxian He et al. Diffusion-BERT: Generative masked lan- guage models.arXiv preprint arXiv:2211.15029, 2022. 2
-
[24]
Dita: Scaling diffusion transformer for generalist vision-language-action policy
Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, and et al. Dita: Scaling diffusion trans- former for generalist vision-language-action policy.arXiv preprint arXiv:2503.19757, 2025. 6
-
[25]
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu- Chiang Frank Wang, and Fu-En Yang. ThinkAct: Vision- language-action reasoning via reinforced visual latent plan- ning.arXiv preprint arXiv:2507.16815, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Jiannan Huang, Ding Ding, Zhixing Tang, Kai Liu, Yunhai Chen, Pengcheng He, and Bin Yang. A survey on integra- tion of large language models with intelligent robots.arXiv preprint arXiv:2404.09228, 2024. 1
-
[27]
Wenhui Huang, Changhe Chen, Han Qi, Chen Lv, Yilun Du, and Heng Yang. MoTVLA: A vision-language-action model with unified fast-slow reasoning.arXiv preprint arXiv:2510.18337, 2025. 1 10
-
[28]
Open-ended language-guided planning for vision-and- language navigation
Zhiling Huang, Yuke Zhu, Fei Xia, and Manolis Savva. Open-ended language-guided planning for vision-and- language navigation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 18779–18790, 2023. 1
work page 2023
-
[29]
Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism
Yuhua Jiang, Shuang Cheng, Yihao Liu, Ermo Hua, Che Jiang, Weigao Sun, Yu Cheng, Feifei Gao, Biqing Qi, and Bowen Zhou. Nirvana: A specialized generalist model with task-aware memory mechanism.arXiv preprint arXiv:2510.26083, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Rafael Rafailov, Ananya P. Foster, Pan- nag R. Sanketi, Quan Vuong, Sergey Levine, and et al. Open- VLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. 1, 2, 6, 7
work page 2024
-
[31]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. MolmoAct: Action rea- soning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Reflection-Based Task Adaptation for Self-Improving VLA
Baicheng Li, Dong Wu, Zike Yan, Xinchen Liu, Zecui Zeng, Lusong Li, and Hongbin Zha. Reflection-based task adaptation for self-improving VLA.arXiv preprint arXiv:2510.12710, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
arXiv:2405.17418 [cs.CV] doi:10
Chenxuan Li, Jiaming Liu, Guanqun Wang, Xiaoqi Li, Six- iang Chen, Liang Heng, Chuyan Xiong, Jiaxin Ge, Ren- rui Zhang, Kaichen Zhou, and Shanghang Zhang. A self- correcting vision-language-action model for fast and slow system manipulation.arXiv preprint arXiv:2405.17418,
-
[35]
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianx- ing Chen, Ganqu Cui, et al. SimpleVLA-RL: Scaling VLA training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Do as I can, not as I say: Grounding language in robotic affordances
Michael Li, Jianfong Li, Zhi-Qiang Yan, Jun Ma, Jian-Ping Zhang, Li-Ting Wang, Qing-Shan Zhou, and Hai-Ping Chen. Do as I can, not as I say: Grounding language in robotic affordances. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20281–20290, 2024. 1
work page 2024
-
[37]
Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models
Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. BridgeVLA: Input-output alignment for efficient 3d manipu- lation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025. 1
-
[38]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Siheng Xu, Yizhong Zhang, and et al. Cogact: A foundational vision-language- action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
What Matters in Building Vision-Language-Action Models for Generalist Robots
Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024. 7
work page internal anchor Pith review arXiv 2024
-
[40]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhiwei Li, Bao-Long Bi, Ling-Rui Mei, Jun- feng Fang, Xiao Liang, Zhijiang Guo, Le Song, and Cheng- Lin Liu. From system 1 to system 2: A survey of reasoning large language models.ar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Evaluat- ing real-world robot manipulation policies in simulation
Xuanlin Liang, Kyle Hsu, Jiayuan Gu, Oier Mees, Karl Pertsch, Homer Rich Walk, Chuyuan Lunawat, Isabel Ishikaa, Sean Kimani, Sergey Levine, and et al. Evaluat- ing real-world robot manipulation policies in simulation. In Conference on Robot Learning, pages 3705–3728, 2024. 7, 8
work page 2024
-
[42]
Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete diffusion VLA: Bring- ing discrete diffusion to action decoding in vision-language- action policies.arXiv preprint arXiv:2508.20072, 2025. 2, 3, 6, 7
-
[43]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Benchmarking knowledge trans- fer for lifelong robot learning
Bo Liu, Yifeng Yuan, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Han, and Peter Stone. Benchmarking knowledge trans- fer for lifelong robot learning. InAdvances in Neural Infor- mation Processing Systems, pages 44776–44791, 2023. 1, 5
work page 2023
-
[45]
Haoning Liu, Shuqiang Liu, Jun Song, Guozheng Zhang, Hong Liu, and Jianwen Zhang. A review of foundation mod- els for vision, language and action in robotics.arXiv preprint arXiv:2402.17643, 2024. 1
-
[46]
What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789,
Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can RL bring to VLA generalization? An empirical study.arXiv preprint arXiv:2505.19789, 2025. 1
-
[47]
Jun Luo, Tong Zheng, Chueru Wu, Weiyu Wang, Xinyang Luo, Zhiao Zhou, and Shuran Song. Aloha: A low-cost hardware system for bimanual robotic manipulation.arXiv preprint arXiv:2309.03055, 2023. 1
-
[48]
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiang- miao Pang. F1: A vision-language-action model bridg- ing understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Robocat: A self- improving robotic agent.arXiv preprint arXiv:2306.00287,
Daniel J Mankowitz, Ilija Radosavovic, Xuanlin Xiao, Zhi- Qiang Zhou, Ziyuan Li, Haoyang Yu, Yujia Du, Yu-Liang Chen, Bo Song, Deepali Sunder, et al. Robocat: A self- improving robotic agent.arXiv preprint arXiv:2306.00287,
-
[51]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Acorn Pooley, Arijit Gupta, Ajay Mandelkar, Ajinkya Jain, et al. Open X- Embodiment: Robotic learning datasets and RT-X models. arXiv preprint arXiv:2310.08864, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Dries, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, and Dong Wang. EmbodiedOneVision: Inter- leaved vision-text-action pretraining for general robot con- trol.arXiv preprint arXiv:2508.21112, 2025. 2
-
[55]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Jiayuan Wang, Bin Gu, and Zhiqiang Zhao. SpatialVLA: Exploring spatial representations for visual language-action model.arXiv preprint arXiv:2501.15830,
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Scott Reed, Kory Zolna, Emilio Parisotto, Sergio Matthews, Melves Bartolo, Marcus Frean, Juhani Li, Lars Buesing, Wang Po-Wei, Deqing Niu, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[57]
Multimodal diffusion transformer: Learning versatile behavior from multimodal goals
Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, Moritz L ¨owe, and Rudolf Lustig. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA, 2024. 6
work page 2024
-
[58]
Simple and effective masked dif- fusion language models.arXiv preprint arXiv:2403.01809,
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked dif- fusion language models.arXiv preprint arXiv:2403.01809,
-
[59]
Ranjan Sapkota, Yang Cao, and Manoj Karkee. Vision- language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769, 2025. 1
-
[60]
Ali Shafiullah, Shaurya Bahl, Stephen James, Deepak Pathak, and Pieter Abbeel. Language-driven generalization via CLIP for robot policy learning.IEEE Robotics and Au- tomation Letters (RA-L), 9(3):1885–1892, 2024. 1
work page 2024
-
[61]
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tian- cai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. MemoryVLA: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Ar- actingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Ca- dene. SmolVLA: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844,
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
Nan Sun, Yongchang Li, Chenxu Wang, Huiying Li, and Huaping Liu. CollabVLA: Self-reflective vision-language- action model dreaming together with human.arXiv preprint arXiv:2509.14889, 2025. 1, 2
-
[64]
Unified multimodal discrete diffusion.arXiv preprint arXiv:2503.20853, 2025
Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki. Unified multi- modal discrete diffusion.arXiv preprint arXiv:2503.20853,
-
[65]
Predictive inverse dynam- ics models are scalable learners for robotic manipulation
Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynam- ics models are scalable learners for robotic manipulation. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. 1
work page 2025
-
[66]
BridgeData V2: A dataset for robot learning at scale.arXiv preprint arXiv:2310.03816, 2023
Homer Rich Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Maximilian Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Ho Vuong, Andre Wang He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale.arXiv preprint arXiv:2310.03816, 2023. 1, 5, 6
-
[67]
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. VLA-Adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025. 1
-
[68]
Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yi- cun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, and Yi Xu. dVLA: Diffusion vision-language-action model with multimodal chain-of-thought.arXiv preprint arXiv:2509.25681, 2025. 1, 2, 3, 6
-
[69]
Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jin- ming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, and Feifei Feng. Diffusion-VLA: Generalizable and interpretable robot foundation model via self-generated reasoning.arXiv preprint arXiv:2412.03293,
-
[70]
Llada-vla: Vision language dif- fusion action models.arXiv preprint arXiv:2509.06932, 2025
Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, and Xiaoyan Sun. LLaDA-VLA: Vision language diffusion action models.arXiv preprint arXiv:2509.06932,
-
[71]
Fast- dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328,
Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dLLM v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025. 2, 3
-
[72]
Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, and Haibin Yan. MoManipVLA: Transferring vision-language- action models for general mobile manipulation.arXiv preprint arXiv:2503.13446, 2025. 1
-
[73]
Magma: A founda- tion model for multimodal ai agents
Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents.arXiv preprint arXiv:2502.13130, 2025. 7
-
[74]
Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, Zixiao Huang, Mingjie Wei, Yuqing Xie, Ke Yang, Bo Dai, Zhexuan Xu, et al. RLinf: Flexible and effi- cient large-scale reinforcement learning via macro-to-micro 12 flow transformation.arXiv preprint arXiv:2509.15965,
-
[75]
Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhex- uan Xu, Zhihao Liu, et al. RLinf-VLA: A unified and ef- ficient framework for VLA+RL training.arXiv preprint arXiv:2510.06710, 2025. 1
-
[76]
Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, and Zach Xu. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025. 2
-
[77]
Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: En- abling fast and robust 3d flow-based policy via consis- tency flow matching for robot manipulation.arXiv preprint arXiv:2412.04987, 2024. 2
-
[78]
CoT-VLA: Visual chain-of-thought rea- soning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. CoT-VLA: Visual chain-of-thought rea- soning for vision-language-action models. InCVPR, 2024. 1, 2, 4
work page 2024
-
[79]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-VLA: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[81]
TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. 6, 7
work page 2025
-
[82]
Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, and Haoang Li. FlowVLA: Vi- sual chain of thought-based motion reasoning for vision- language-action models.arXiv preprint arXiv:2508.18269,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.