pith. machine review for the scientific record. sign in

arxiv: 2604.24622 · v2 · submitted 2026-04-27 · 💻 cs.CV · cs.AI

Recognition: unknown

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

Bin Qian, Fan Du, Fei Wang, Feng Yan, Heng Yang, Jianxiong Wu, Weinong Wang, Weiye Zhang, Xinrun Xu, Yu Guo, Zhihai He

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language-action policiesflow-based modelscoarse-to-fine generationefficient inferencerobot action generationendpoint velocitylow NFE
0
0 comments X

The pith

Restructuring flow-based action generation into coarse initialization from endpoint velocity and single-step refinement enables efficient high-performance VLA policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is trying to establish that the inefficiency of multi-step sampling in flow-based vision-language-action policies can be addressed by changing the starting point rather than just shortening the trajectory. Specifically, a coarse stage learns to predict the posterior over endpoint velocity to turn Gaussian noise into a good action initialization, and then a fine stage refines it in one step. This two-stage approach, trained in a stepwise manner, is claimed to deliver better or equal performance than existing methods while using far fewer function evaluations. Experiments back this up with improved benchmark scores, much lower latency, and high success on real robots. A reader would care if they want to run sophisticated robot policies in real time without heavy compute.

Core claim

The central claim is that a coarse-to-fine two-stage formulation restructures action generation: the coarse stage learns a conditional posterior over endpoint velocity to construct an action-aware starting point from Gaussian noise, while the fine stage performs a single fixed-time refinement to correct residual errors, yielding strong efficiency-performance trade-offs under low-NFE conditions.

What carries the argument

Coarse-to-fine two-stage action generation with conditional posterior over endpoint velocity for initialization.

If this is right

  • Consistently outperforms existing NFE=2 methods on CALVIN and LIBERO.
  • Matches or surpasses NFE=10 π0.5 baseline on several metrics.
  • Reduces action sampling latency by 75.4%.
  • Achieves best average real-robot success rate of 83.0%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach highlights the importance of good initialization in generative models, which could inspire similar strategies in other domains like image or video generation.
  • Stepwise training from coarse to joint optimization might be useful for stabilizing other complex generative training processes.
  • If the coarse predictor can be made deterministic or faster, it may allow even lower latency in edge robotics.

Load-bearing premise

The initialization produced by the coarse-stage posterior over endpoint velocity is close enough to the target action distribution that one refinement step suffices to correct residual errors.

What would settle it

If on the CALVIN or LIBERO benchmarks the performance with NFE=2 using this method does not outperform other NFE=2 methods or match the NFE=10 baseline, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.24622 by Bin Qian, Fan Du, Fei Wang, Feng Yan, Heng Yang, Jianxiong Wu, Weinong Wang, Weiye Zhang, Xinrun Xu, Yu Guo, Zhihai He.

Figure 1
Figure 1. Figure 1: Teaser of CF-VLA. Standard flow matching requires multiple iterative steps to recover action structure from uninformative Gaussian noise. CF-VLA instead adopts a coarse-to-fine two-step process: a coarse stage constructs an action-prior-guided (AP-guided) noise initialization, followed by a single-step refinement. This design achieves a stronger efficiency–performance frontier across CALVIN, LIBERO, and re… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CF-VLA. CF-VLA adopts a two-phase training strategy. Phase I is a stability-oriented warm-up stage that shapes endpoint velocity and variance prediction and trains the refinement branch on a controlled proxy input, avoiding unreliable coarse outputs at the beginning of optimization. Phase II then performs full joint optimization of the final coarse-to-fine mechanism: KL-supervised endpoint post… view at source ↗
Figure 3
Figure 3. Figure 3: Geometric view of CF-VLA. Standard flow matching starts from pure Gaussian noise, forcing early steps to spend computation on global transport toward the task-conditioned action manifold. CF-VLA instead first builds an AP-guided initialization distribution with a KL-supervised coarse stage, then applies a single refinement stage to recover the ground-truth action. iterative transport at every rollout. To a… view at source ↗
Figure 4
Figure 4. Figure 4: Latency–performance trade-off on LIBERO. We compare average success rate and action sampling latency across methods with different numbers of function evaluations (NFEs). CF-VLA attains a stronger low-NFE operating point, achieving 96.5 average success at 7.81 ms with two function evaluations, compared with 95.7 at 29.17 ms for the reproduced NFE=10 𝜋0.5 baseline. This trend supports our core hypothesis: s… view at source ↗
Figure 5
Figure 5. Figure 5: Real-robot results on five representative manipulation tasks. The top panel shows representative task snapshots, and the bottom panel compares success rates of MIP, 𝜋0.5, and CF-VLA. CF-VLA achieves the best average success rate of 83.0% across five tasks, outperforming MIP (63.5%) by 19.5 points and 𝜋0.5 (79.0%) by 4.0 points. [3] Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Y… view at source ↗
read the original abstract

Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $\pi_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4%, and achieves the best average real-robot success rate of 83.0%, outperforming MIP by 19.5 points and $\pi_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CF-VLA, a coarse-to-fine two-stage formulation for flow-based vision-language-action policies. The coarse stage learns a conditional posterior over endpoint velocity to produce a structured action-aware initialization from Gaussian noise; the fine stage then applies a single fixed-time refinement step. Training uses a stepwise strategy (first coarse predictor, then joint optimization). Experiments on CALVIN and LIBERO benchmarks and real-robot tasks claim consistent outperformance of existing NFE=2 methods, parity or better with NFE=10 π_{0.5} baselines on several metrics, 75.4% reduction in action sampling latency, and the highest average real-robot success rate of 83.0%.

Significance. If the core assumption holds, the work could meaningfully advance real-time deployment of expressive flow-based VLA models by restructuring the sampling trajectory rather than simply truncating it. The stepwise training procedure offers a practical stabilization technique, and the public code release supports reproducibility. The reported gains on standard benchmarks and real robots would be notable if accompanied by stronger validation of the initialization quality.

major comments (3)
  1. [Section 3.2] Section 3.2 (Coarse Stage): The efficiency claims rest on the unverified premise that the learned conditional posterior over endpoint velocity produces an initialization sufficiently close to the target action distribution for a single fixed-time refinement step to correct residuals reliably. No supporting analysis (e.g., Wasserstein distances, per-dimension residual histograms, or mode-coverage metrics) is provided for high-dimensional multimodal robot action spaces; this is load-bearing for the NFE=2 performance and 75.4% latency reduction assertions.
  2. [Section 4] Section 4 (Experiments): Reported quantitative improvements on CALVIN and LIBERO (outperformance of NFE=2 methods and matching NFE=10 baselines) lack error bars, ablation studies isolating the coarse initialization and stepwise training, and statistical significance tests. This weakens confidence in the consistency of the gains and the claim that the method establishes a strong efficiency-performance frontier.
  3. [Section 4.3] Section 4.3 (Real-Robot Evaluation): The 83.0% average success rate (outperforming MIP by 19.5 points and π_{0.5} by 4.0 points) is presented without trial counts, variance estimates, or detailed task-variation protocols, which is necessary to substantiate the real-world applicability claim.
minor comments (2)
  1. [Abstract] Abstract: The baseline notation 'NFE=10 π_{0.5}' should be defined on first use and rendered consistently in math mode.
  2. [Section 3] The manuscript would benefit from an additional figure in Section 3 showing example coarse initializations versus target actions to illustrate the refinement step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2 (Coarse Stage): The efficiency claims rest on the unverified premise that the learned conditional posterior over endpoint velocity produces an initialization sufficiently close to the target action distribution for a single fixed-time refinement step to correct residuals reliably. No supporting analysis (e.g., Wasserstein distances, per-dimension residual histograms, or mode-coverage metrics) is provided for high-dimensional multimodal robot action spaces; this is load-bearing for the NFE=2 performance and 75.4% latency reduction assertions.

    Authors: We agree that direct quantitative analysis of initialization quality would provide stronger support. The reported benchmark gains offer indirect evidence of effective initialization, but we will add supporting material in the revision, including per-dimension residual histograms after the coarse stage and visualizations of coarse-stage action trajectories, to better substantiate the premise. revision: partial

  2. Referee: [Section 4] Section 4 (Experiments): Reported quantitative improvements on CALVIN and LIBERO (outperformance of NFE=2 methods and matching NFE=10 baselines) lack error bars, ablation studies isolating the coarse initialization and stepwise training, and statistical significance tests. This weakens confidence in the consistency of the gains and the claim that the method establishes a strong efficiency-performance frontier.

    Authors: We acknowledge these omissions reduce robustness. In the revised manuscript we will include error bars on all metrics, add ablations that isolate the coarse initialization and stepwise training components, and report statistical significance tests for the key comparisons. revision: yes

  3. Referee: [Section 4.3] Section 4.3 (Real-Robot Evaluation): The 83.0% average success rate (outperforming MIP by 19.5 points and π_{0.5} by 4.0 points) is presented without trial counts, variance estimates, or detailed task-variation protocols, which is necessary to substantiate the real-world applicability claim.

    Authors: We will expand Section 4.3 to report the exact trial counts per task, include variance or standard deviation across trials, and provide a clearer description of the task-variation protocols and evaluation setup. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper introduces CF-VLA as a new two-stage coarse-to-fine architecture for flow-based VLA policies: a coarse stage that learns a conditional posterior over endpoint velocity to produce a structured initialization from noise, followed by a single fixed-time refinement step. Training stabilization via stepwise optimization is a standard technique and does not equate any claimed performance metric (e.g., latency reduction or success rate) to a fitted parameter or input by construction. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the efficiency-quality frontier or benchmark results to tautological redefinitions of the inputs. Results are presented as empirical outcomes on external datasets (CALVIN, LIBERO) and real-robot tasks, with no load-bearing step that collapses to self-referential fitting or renaming of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the approach relies on standard assumptions of conditional generative modeling and supervised training on robot datasets.

pith-pipeline@v0.9.0 · 5635 in / 963 out tokens · 18447 ms · 2026-05-08T04:30:50.895633+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 49 canonical work pages · 15 internal anchors

  1. [1]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164

  3. [3]

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...

  4. [4]

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. 2025. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. InRobotics: Science and Systems (RSS). https: //arxiv.org/abs/2505.06111

  5. [5]

    Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud

  6. [6]

    https://arxiv.org/abs/1806.07366

    Neural Ordinary Differential Equations. arXiv:1806.07366 [cs.LG] https: //arxiv.org/abs/1806.07366

  7. [7]

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. 2024. Diffusion Policy: Visuomo- tor Policy Learning via Action Diffusion. arXiv:2303.04137 [cs.RO] https: //arxiv.org/abs/2303.04137

  8. [8]

    Samarth Chopra, Alex McMoil, Ben Carnovale, Evan Sokolson, Rajkumar Kuben- dran, and Samuel Dickerson. 2025. EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation. arXiv:2511.05397 [cs.RO] https: //arxiv.org/abs/2511.05397

  9. [9]

    Shichao Fan, Quantao Yang, Yajie Liu, Kun Wu, Zhengping Che, Qingjie Liu, and Min Wan. 2025. Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation. arXiv:2502.10040 [cs.RO] doi:10.1109/LRA.2025.3619794

  10. [10]

    Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao, Ronglei Tong, Yu Qiao, Jifeng Dai, and Yuntao Chen. 2025. Diffusion Transformer Policy. arXiv:2410.15959 [cs.RO] https://arxiv.org/abs/2410.15959

  11. [11]

    Yiyang Huang, Yuhui Hao, Bo Yu, Feng Yan, Yuxin Yang, Feng Min, Yinhe Han, Lin Ma, Shaoshan Liu, Qiang Liu, and Yiming Gan. 2025. Dadu-Corki: Algorithm- Architecture Co-Design for Embodied AI-powered Robotic Manipulation. In Proceedings of the 52nd Annual International Symposium on Computer Architecture (SIGARCH ’25). ACM, 327–343. doi:10.1145/3695053.3731099

  12. [12]

    Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, and Kai Xu. 2025. LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation. arXiv:2505.11528 [cs.RO] https://arxiv.org/abs/2505.11528

  13. [13]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szy- mon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Per...

  14. [14]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    𝜋0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054 [cs.LG] https://arxiv.org/abs/2504.16054

  15. [15]

    Yina Jian, Di Tian, Xuan-Jing Chen, Zhen-Yuan Wei, Chen-Wei Liang, and Mu- Jiang-Shan Wang. 2026. PI-VLA: Adaptive Symmetry-Aware Decision-Making for Long-Horizon Vision–Language–Action Manipulation.Symmetry18, 3 (2026). doi:10.3390/sym18030394

  16. [16]

    Xuhui Kang and Yen-Ling Kuo. 2024. Incorporating Task Progress Knowl- edge for Subgoal Generation in Robotic Manipulation through Image Edits. arXiv:2410.11013 [cs.RO] https://arxiv.org/abs/2410.11013

  17. [17]

    Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu

  18. [18]

    doi:10.1109/ACCESS

    Vision-Language-Action Models for Robotics: A Review Towards Real- World Applications.IEEE Access13 (2025), 162467–162504. doi:10.1109/access. 2025.3609980

  19. [19]

    Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-Tuning Vision-Language- Action Models: Optimizing Speed and Success. arXiv:2502.19645 [cs.RO] https: //arxiv.org/abs/2502.19645

  20. [20]

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. 2026. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning. arXiv:2601.16163 [cs.AI] https://arxiv.org/abs/2601.16163

  21. [21]

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. 2024. OpenVLA: An Open-Source Vision- Language-Action Model. arXiv:2406.09246 [cs.RO] h...

  22. [22]

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. 2025. MolmoAct: Action Reasoning Models that can Reason in Space. arXiv:2508.07917 [cs.RO] https://arxiv....

  23. [23]

    Yonghyeon Lee, Byeongho Lee, Seungyeon Kim, and Frank C. Park. 2024. Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation under Complex Task-Motion Dependencies. arXiv:2407.19681 [cs.RO] https://arxiv. org/abs/2407.19681

  24. [24]

    Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, and Jiangmiao Pang. 2025. CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language- Action Modeling. arXiv:2506.19816 [cs.RO] https://arxiv.org/abs/2506.19816

  25. [25]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG] https://arxiv.org/abs/2210.02747

  26. [26]

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. arXiv:2306.03310 [cs.AI] https://arxiv.org/abs/2306.03310

  27. [27]

    Fanfan Liu, Feng Yan, Liming Zheng, Chengjian Feng, Yiyang Huang, and Lin Ma

  28. [28]

    Robouniview: Visual-language model with unified view representation for robotic manipulation

    RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation. arXiv:2406.18977 [cs.RO] https://arxiv.org/abs/2406. 18977

  29. [29]

    Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji, Yuanhe Guo, Shenji Wan, and Hongyi Wen. 2025. From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity. arXiv:2512.02826 [cs.LG] https://arxiv.org/abs/2512.02826

  30. [30]

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. 2022. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. arXiv:2206.00927 [cs.LG] https://arxiv.org/abs/2206.00927

  31. [31]

    Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. 2023. Grounding Language with Visual Affordances over Unstructured Data. arXiv:2210.01911 [cs.RO] https://arxiv.org/abs/2210.01911

  32. [32]

    Oier Mees, Lukas Hermann, and Wolfram Burgard. 2022. What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data. arXiv:2204.06252 [cs.RO] https://arxiv.org/abs/2204.06252

  33. [33]

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. 2022. CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long- Horizon Robot Manipulation Tasks. arXiv:2112.03227 [cs.RO] https://arxiv.org/ abs/2112.03227

  34. [34]

    Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Permenter, Guannan Qu, Nicholas Boffi, Guanya Shi, and Max Simchowitz. 2026. Much Ado About Noising: Dispelling the Myths Du et al. of Generative Robotic Control. arXiv:2512.01809 [cs.RO] https://arxiv.org/abs/ 2512.01809

  35. [35]

    Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, and Rudolf Lioutikov. 2025. FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies. arXiv:2509.04996 [cs.RO] https://arxiv.org/abs/2509.04996

  36. [36]

    Moritz Reuss, Ömer Erdinç Yağmurlu, Fabian Wenzel, and Rudolf Lioutikov

  37. [37]

    Multimodal diffusion transformer: Learning versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

    Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals. arXiv:2407.05996 [cs.RO] https://arxiv.org/abs/2407.05996

  38. [38]

    Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv:2202.00512 [cs.LG] https://arxiv.org/abs/2202.00512

  39. [39]

    Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. 2025. Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey. arXiv:2508.13073 [cs.RO] https://arxiv.org/abs/2508. 13073

  40. [40]

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. 2026. MemoryVLA: Perceptual- Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation. arXiv:2508.19236 [cs.RO] https://arxiv.org/abs/2508.19236

  41. [41]

    Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Nikita Lyubaykin, Andrei Polubarov, Alexander Derevyagin, and Vladislav Kurenkov

  42. [42]

    Training VLA Models with Normaliz- ing Flows

    NinA: Normalizing Flows in Action. Training VLA Models with Normaliz- ing Flows. arXiv:2508.16845 [cs.CV] https://arxiv.org/abs/2508.16845

  43. [43]

    Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. 2025. Unified Vision-Language- Action Model. arXiv:2506.19850 [cs.CV] https://arxiv.org/abs/2506.19850

  44. [44]

    Rosa Wolf, Yitian Shi, Sheng Liu, and Rania Rayyes. 2025. Diffusion Models for Robotic Manipulation: A Survey. arXiv:2504.08438 [cs.RO] https://arxiv.org/abs/ 2504.08438

  45. [45]

    Yiming Wu, Huan Wang, Zhenghao Chen, Jianxin Pang, and Dong Xu. 2025. On-Device Diffusion Transformer Policy for Efficient Robot Manipulation. arXiv:2508.00697 [cs.RO] https://arxiv.org/abs/2508.00697

  46. [46]

    Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu

  47. [47]

    Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025

    VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching. arXiv:2502.02175 [cs.RO] https://arxiv.org/abs/2502.02175

  48. [48]

    Feng Yan, Fanfan Liu, Liming Zheng, Yufeng Zhong, Yiyang Huang, Zechao Guan, Chengjian Feng, and Lin Ma. 2025. RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation. arXiv:2412.07215 [cs.RO] https://arxiv.org/ abs/2412.07215

  49. [49]

    Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. 2026. InstructVLA: Vision- Language-Action Instruction Tuning from Understanding to Manipulation. arXiv:2507.17520 [cs.RO] https://arxiv.org/abs/2507.17520

  50. [50]

    Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. 2026. ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning. arXiv:2602.11236 [cs.CV] https: //arxiv.org/abs/2602.11236

  51. [51]

    Zebin Yang, Yijiahao Qi, Tong Xie, Bo Yu, Shaoshan Liu, and Meng Li. 2026. DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic- Static Layer-Skipping for Robot Manipulation. arXiv:2602.22896 [cs.RO] https: //arxiv.org/abs/2602.22896

  52. [52]

    Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Zheng Wang, Lianli Gao, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. 2026. A Survey on Efficient Vision-Language-Action Models. arXiv:2510.24795 [cs.CV] https: //arxiv.org/abs/2510.24795

  53. [53]

    Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. 2024. DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution. arXiv:2411.02359 [cs.RO] https://arxiv.org/abs/2411.02359

  54. [54]

    Edwin Zhang, Yujie Lu, Shinda Huang, William Wang, and Amy Zhang. 2024. Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks. arXiv:2210.15629 [cs.LG] https://arxiv.org/abs/2210.15629

  55. [55]

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. 2025. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge. arXiv:2507.04447 [cs.CV] https://arxiv.org/abs/2507.04447

  56. [56]

    Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan

  57. [57]

    Universal actions for enhanced embodied foundation models

    Universal Actions for Enhanced Embodied Foundation Models. arXiv:2501.10105 [cs.RO] https://arxiv.org/abs/2501.10105

  58. [58]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. 2025. X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model. arXiv:2510.10274 [cs.RO] https://arxiv.org/abs/2510.10274 ...