arxiv: 2604.24622 · v2 · submitted 2026-04-27 · 💻 cs.CV · cs.AI

Recognition: unknown

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

Bin Qian, Fan Du, Fei Wang, Feng Yan, Heng Yang, Jianxiong Wu, Weinong Wang, Weiye Zhang, Xinrun Xu, Yu Guo, Zhihai He

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language-action policiesflow-based modelscoarse-to-fine generationefficient inferencerobot action generationendpoint velocitylow NFE

0 comments

The pith

Restructuring flow-based action generation into coarse initialization from endpoint velocity and single-step refinement enables efficient high-performance VLA policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is trying to establish that the inefficiency of multi-step sampling in flow-based vision-language-action policies can be addressed by changing the starting point rather than just shortening the trajectory. Specifically, a coarse stage learns to predict the posterior over endpoint velocity to turn Gaussian noise into a good action initialization, and then a fine stage refines it in one step. This two-stage approach, trained in a stepwise manner, is claimed to deliver better or equal performance than existing methods while using far fewer function evaluations. Experiments back this up with improved benchmark scores, much lower latency, and high success on real robots. A reader would care if they want to run sophisticated robot policies in real time without heavy compute.

Core claim

The central claim is that a coarse-to-fine two-stage formulation restructures action generation: the coarse stage learns a conditional posterior over endpoint velocity to construct an action-aware starting point from Gaussian noise, while the fine stage performs a single fixed-time refinement to correct residual errors, yielding strong efficiency-performance trade-offs under low-NFE conditions.

What carries the argument

Coarse-to-fine two-stage action generation with conditional posterior over endpoint velocity for initialization.

If this is right

Consistently outperforms existing NFE=2 methods on CALVIN and LIBERO.
Matches or surpasses NFE=10 π0.5 baseline on several metrics.
Reduces action sampling latency by 75.4%.
Achieves best average real-robot success rate of 83.0%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach highlights the importance of good initialization in generative models, which could inspire similar strategies in other domains like image or video generation.
Stepwise training from coarse to joint optimization might be useful for stabilizing other complex generative training processes.
If the coarse predictor can be made deterministic or faster, it may allow even lower latency in edge robotics.

Load-bearing premise

The initialization produced by the coarse-stage posterior over endpoint velocity is close enough to the target action distribution that one refinement step suffices to correct residual errors.

What would settle it

If on the CALVIN or LIBERO benchmarks the performance with NFE=2 using this method does not outperform other NFE=2 methods or match the NFE=10 baseline, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.24622 by Bin Qian, Fan Du, Fei Wang, Feng Yan, Heng Yang, Jianxiong Wu, Weinong Wang, Weiye Zhang, Xinrun Xu, Yu Guo, Zhihai He.

**Figure 1.** Figure 1: Teaser of CF-VLA. Standard flow matching requires multiple iterative steps to recover action structure from uninformative Gaussian noise. CF-VLA instead adopts a coarse-to-fine two-step process: a coarse stage constructs an action-prior-guided (AP-guided) noise initialization, followed by a single-step refinement. This design achieves a stronger efficiency–performance frontier across CALVIN, LIBERO, and re… view at source ↗

**Figure 2.** Figure 2: Overview of CF-VLA. CF-VLA adopts a two-phase training strategy. Phase I is a stability-oriented warm-up stage that shapes endpoint velocity and variance prediction and trains the refinement branch on a controlled proxy input, avoiding unreliable coarse outputs at the beginning of optimization. Phase II then performs full joint optimization of the final coarse-to-fine mechanism: KL-supervised endpoint post… view at source ↗

**Figure 3.** Figure 3: Geometric view of CF-VLA. Standard flow matching starts from pure Gaussian noise, forcing early steps to spend computation on global transport toward the task-conditioned action manifold. CF-VLA instead first builds an AP-guided initialization distribution with a KL-supervised coarse stage, then applies a single refinement stage to recover the ground-truth action. iterative transport at every rollout. To a… view at source ↗

**Figure 4.** Figure 4: Latency–performance trade-off on LIBERO. We compare average success rate and action sampling latency across methods with different numbers of function evaluations (NFEs). CF-VLA attains a stronger low-NFE operating point, achieving 96.5 average success at 7.81 ms with two function evaluations, compared with 95.7 at 29.17 ms for the reproduced NFE=10 𝜋0.5 baseline. This trend supports our core hypothesis: s… view at source ↗

**Figure 5.** Figure 5: Real-robot results on five representative manipulation tasks. The top panel shows representative task snapshots, and the bottom panel compares success rates of MIP, 𝜋0.5, and CF-VLA. CF-VLA achieves the best average success rate of 83.0% across five tasks, outperforming MIP (63.5%) by 19.5 points and 𝜋0.5 (79.0%) by 4.0 points. [3] Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Y… view at source ↗

read the original abstract

Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $\pi_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4%, and achieves the best average real-robot success rate of 83.0%, outperforming MIP by 19.5 points and $\pi_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CF-VLA's coarse-to-fine split with a learned velocity posterior gives a practical efficiency boost on VLA benchmarks, but the single-step refinement claim rests on an untested closeness assumption.

read the letter

The main takeaway is that this paper replaces the usual multi-step flow sampling in VLA policies with a two-stage process: a coarse stage that learns a conditional posterior over endpoint velocity to turn noise into a structured starting point, followed by one fixed-time refinement step, plus a stepwise training schedule to keep optimization stable. That formulation is distinct from simply truncating diffusion trajectories and produces the reported speedups.

Referee Report

3 major / 2 minor

Summary. The paper proposes CF-VLA, a coarse-to-fine two-stage formulation for flow-based vision-language-action policies. The coarse stage learns a conditional posterior over endpoint velocity to produce a structured action-aware initialization from Gaussian noise; the fine stage then applies a single fixed-time refinement step. Training uses a stepwise strategy (first coarse predictor, then joint optimization). Experiments on CALVIN and LIBERO benchmarks and real-robot tasks claim consistent outperformance of existing NFE=2 methods, parity or better with NFE=10 π_{0.5} baselines on several metrics, 75.4% reduction in action sampling latency, and the highest average real-robot success rate of 83.0%.

Significance. If the core assumption holds, the work could meaningfully advance real-time deployment of expressive flow-based VLA models by restructuring the sampling trajectory rather than simply truncating it. The stepwise training procedure offers a practical stabilization technique, and the public code release supports reproducibility. The reported gains on standard benchmarks and real robots would be notable if accompanied by stronger validation of the initialization quality.

major comments (3)

[Section 3.2] Section 3.2 (Coarse Stage): The efficiency claims rest on the unverified premise that the learned conditional posterior over endpoint velocity produces an initialization sufficiently close to the target action distribution for a single fixed-time refinement step to correct residuals reliably. No supporting analysis (e.g., Wasserstein distances, per-dimension residual histograms, or mode-coverage metrics) is provided for high-dimensional multimodal robot action spaces; this is load-bearing for the NFE=2 performance and 75.4% latency reduction assertions.
[Section 4] Section 4 (Experiments): Reported quantitative improvements on CALVIN and LIBERO (outperformance of NFE=2 methods and matching NFE=10 baselines) lack error bars, ablation studies isolating the coarse initialization and stepwise training, and statistical significance tests. This weakens confidence in the consistency of the gains and the claim that the method establishes a strong efficiency-performance frontier.
[Section 4.3] Section 4.3 (Real-Robot Evaluation): The 83.0% average success rate (outperforming MIP by 19.5 points and π_{0.5} by 4.0 points) is presented without trial counts, variance estimates, or detailed task-variation protocols, which is necessary to substantiate the real-world applicability claim.

minor comments (2)

[Abstract] Abstract: The baseline notation 'NFE=10 π_{0.5}' should be defined on first use and rendered consistently in math mode.
[Section 3] The manuscript would benefit from an additional figure in Section 3 showing example coarse initializations versus target actions to illustrate the refinement step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (Coarse Stage): The efficiency claims rest on the unverified premise that the learned conditional posterior over endpoint velocity produces an initialization sufficiently close to the target action distribution for a single fixed-time refinement step to correct residuals reliably. No supporting analysis (e.g., Wasserstein distances, per-dimension residual histograms, or mode-coverage metrics) is provided for high-dimensional multimodal robot action spaces; this is load-bearing for the NFE=2 performance and 75.4% latency reduction assertions.

Authors: We agree that direct quantitative analysis of initialization quality would provide stronger support. The reported benchmark gains offer indirect evidence of effective initialization, but we will add supporting material in the revision, including per-dimension residual histograms after the coarse stage and visualizations of coarse-stage action trajectories, to better substantiate the premise. revision: partial
Referee: [Section 4] Section 4 (Experiments): Reported quantitative improvements on CALVIN and LIBERO (outperformance of NFE=2 methods and matching NFE=10 baselines) lack error bars, ablation studies isolating the coarse initialization and stepwise training, and statistical significance tests. This weakens confidence in the consistency of the gains and the claim that the method establishes a strong efficiency-performance frontier.

Authors: We acknowledge these omissions reduce robustness. In the revised manuscript we will include error bars on all metrics, add ablations that isolate the coarse initialization and stepwise training components, and report statistical significance tests for the key comparisons. revision: yes
Referee: [Section 4.3] Section 4.3 (Real-Robot Evaluation): The 83.0% average success rate (outperforming MIP by 19.5 points and π_{0.5} by 4.0 points) is presented without trial counts, variance estimates, or detailed task-variation protocols, which is necessary to substantiate the real-world applicability claim.

Authors: We will expand Section 4.3 to report the exact trial counts per task, include variance or standard deviation across trials, and provide a clearer description of the task-variation protocols and evaluation setup. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper introduces CF-VLA as a new two-stage coarse-to-fine architecture for flow-based VLA policies: a coarse stage that learns a conditional posterior over endpoint velocity to produce a structured initialization from noise, followed by a single fixed-time refinement step. Training stabilization via stepwise optimization is a standard technique and does not equate any claimed performance metric (e.g., latency reduction or success rate) to a fitted parameter or input by construction. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the efficiency-quality frontier or benchmark results to tautological redefinitions of the inputs. Results are presented as empirical outcomes on external datasets (CALVIN, LIBERO) and real-robot tasks, with no load-bearing step that collapses to self-referential fitting or renaming of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the approach relies on standard assumptions of conditional generative modeling and supervised training on robot datasets.

pith-pipeline@v0.9.0 · 5635 in / 963 out tokens · 18447 ms · 2026-05-08T04:30:50.895633+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 49 canonical work pages · 15 internal anchors

[1]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164

work page internal anchor Pith review arXiv
[3]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...

work page internal anchor Pith review arXiv 2023
[4]

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. 2025. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. InRobotics: Science and Systems (RSS). https: //arxiv.org/abs/2505.06111

work page internal anchor Pith review arXiv 2025
[5]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud
[6]

https://arxiv.org/abs/1806.07366

Neural Ordinary Differential Equations. arXiv:1806.07366 [cs.LG] https: //arxiv.org/abs/1806.07366

work page arXiv
[7]

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. 2024. Diffusion Policy: Visuomo- tor Policy Learning via Action Diffusion. arXiv:2303.04137 [cs.RO] https: //arxiv.org/abs/2303.04137

work page internal anchor Pith review arXiv 2024
[8]

Samarth Chopra, Alex McMoil, Ben Carnovale, Evan Sokolson, Rajkumar Kuben- dran, and Samuel Dickerson. 2025. EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation. arXiv:2511.05397 [cs.RO] https: //arxiv.org/abs/2511.05397

work page arXiv 2025
[9]

Shichao Fan, Quantao Yang, Yajie Liu, Kun Wu, Zhengping Che, Qingjie Liu, and Min Wan. 2025. Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation. arXiv:2502.10040 [cs.RO] doi:10.1109/LRA.2025.3619794

work page doi:10.1109/lra.2025.3619794 2025
[10]

Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao, Ronglei Tong, Yu Qiao, Jifeng Dai, and Yuntao Chen. 2025. Diffusion Transformer Policy. arXiv:2410.15959 [cs.RO] https://arxiv.org/abs/2410.15959

work page arXiv 2025
[11]

Yiyang Huang, Yuhui Hao, Bo Yu, Feng Yan, Yuxin Yang, Feng Min, Yinhe Han, Lin Ma, Shaoshan Liu, Qiang Liu, and Yiming Gan. 2025. Dadu-Corki: Algorithm- Architecture Co-Design for Embodied AI-powered Robotic Manipulation. In Proceedings of the 52nd Annual International Symposium on Computer Architecture (SIGARCH ’25). ACM, 327–343. doi:10.1145/3695053.3731099

work page doi:10.1145/3695053.3731099 2025
[12]

Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, and Kai Xu. 2025. LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation. arXiv:2505.11528 [cs.RO] https://arxiv.org/abs/2505.11528

work page arXiv 2025
[13]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szy- mon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Per...
[14]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

𝜋0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054 [cs.LG] https://arxiv.org/abs/2504.16054

work page internal anchor Pith review arXiv
[15]

Yina Jian, Di Tian, Xuan-Jing Chen, Zhen-Yuan Wei, Chen-Wei Liang, and Mu- Jiang-Shan Wang. 2026. PI-VLA: Adaptive Symmetry-Aware Decision-Making for Long-Horizon Vision–Language–Action Manipulation.Symmetry18, 3 (2026). doi:10.3390/sym18030394

work page doi:10.3390/sym18030394 2026
[16]

Xuhui Kang and Yen-Ling Kuo. 2024. Incorporating Task Progress Knowl- edge for Subgoal Generation in Robotic Manipulation through Image Edits. arXiv:2410.11013 [cs.RO] https://arxiv.org/abs/2410.11013

work page arXiv 2024
[17]

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu
[18]

doi:10.1109/ACCESS

Vision-Language-Action Models for Robotics: A Review Towards Real- World Applications.IEEE Access13 (2025), 162467–162504. doi:10.1109/access. 2025.3609980

work page doi:10.1109/access 2025
[19]

Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-Tuning Vision-Language- Action Models: Optimizing Speed and Success. arXiv:2502.19645 [cs.RO] https: //arxiv.org/abs/2502.19645

work page internal anchor Pith review arXiv 2025
[20]

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. 2026. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning. arXiv:2601.16163 [cs.AI] https://arxiv.org/abs/2601.16163

work page internal anchor Pith review arXiv 2026
[21]

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. 2024. OpenVLA: An Open-Source Vision- Language-Action Model. arXiv:2406.09246 [cs.RO] h...

work page internal anchor Pith review arXiv 2024
[22]

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. 2025. MolmoAct: Action Reasoning Models that can Reason in Space. arXiv:2508.07917 [cs.RO] https://arxiv....

work page internal anchor Pith review arXiv 2025
[23]

Yonghyeon Lee, Byeongho Lee, Seungyeon Kim, and Frank C. Park. 2024. Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation under Complex Task-Motion Dependencies. arXiv:2407.19681 [cs.RO] https://arxiv. org/abs/2407.19681

work page arXiv 2024
[24]

Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, and Jiangmiao Pang. 2025. CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language- Action Modeling. arXiv:2506.19816 [cs.RO] https://arxiv.org/abs/2506.19816

work page arXiv 2025
[25]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG] https://arxiv.org/abs/2210.02747

work page internal anchor Pith review arXiv 2023
[26]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. arXiv:2306.03310 [cs.AI] https://arxiv.org/abs/2306.03310

work page internal anchor Pith review arXiv 2023
[27]

Fanfan Liu, Feng Yan, Liming Zheng, Chengjian Feng, Yiyang Huang, and Lin Ma
[28]

Robouniview: Visual-language model with unified view representation for robotic manipulation

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation. arXiv:2406.18977 [cs.RO] https://arxiv.org/abs/2406. 18977

work page arXiv
[29]

Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji, Yuanhe Guo, Shenji Wan, and Hongyi Wen. 2025. From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity. arXiv:2512.02826 [cs.LG] https://arxiv.org/abs/2512.02826

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. 2022. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. arXiv:2206.00927 [cs.LG] https://arxiv.org/abs/2206.00927

work page arXiv 2022
[31]

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. 2023. Grounding Language with Visual Affordances over Unstructured Data. arXiv:2210.01911 [cs.RO] https://arxiv.org/abs/2210.01911

work page arXiv 2023
[32]

Oier Mees, Lukas Hermann, and Wolfram Burgard. 2022. What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data. arXiv:2204.06252 [cs.RO] https://arxiv.org/abs/2204.06252

work page arXiv 2022
[33]

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. 2022. CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long- Horizon Robot Manipulation Tasks. arXiv:2112.03227 [cs.RO] https://arxiv.org/ abs/2112.03227

work page arXiv 2022
[34]

Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Permenter, Guannan Qu, Nicholas Boffi, Guanya Shi, and Max Simchowitz. 2026. Much Ado About Noising: Dispelling the Myths Du et al. of Generative Robotic Control. arXiv:2512.01809 [cs.RO] https://arxiv.org/abs/ 2512.01809

work page arXiv 2026
[35]

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, and Rudolf Lioutikov. 2025. FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies. arXiv:2509.04996 [cs.RO] https://arxiv.org/abs/2509.04996

work page arXiv 2025
[36]

Moritz Reuss, Ömer Erdinç Yağmurlu, Fabian Wenzel, and Rudolf Lioutikov
[37]

Multimodal diffusion transformer: Learning versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals. arXiv:2407.05996 [cs.RO] https://arxiv.org/abs/2407.05996

work page arXiv
[38]

Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv:2202.00512 [cs.LG] https://arxiv.org/abs/2202.00512

work page internal anchor Pith review arXiv 2022
[39]

Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. 2025. Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey. arXiv:2508.13073 [cs.RO] https://arxiv.org/abs/2508. 13073

work page arXiv 2025
[40]

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. 2026. MemoryVLA: Perceptual- Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation. arXiv:2508.19236 [cs.RO] https://arxiv.org/abs/2508.19236

work page arXiv 2026
[41]

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Nikita Lyubaykin, Andrei Polubarov, Alexander Derevyagin, and Vladislav Kurenkov
[42]

Training VLA Models with Normaliz- ing Flows

NinA: Normalizing Flows in Action. Training VLA Models with Normaliz- ing Flows. arXiv:2508.16845 [cs.CV] https://arxiv.org/abs/2508.16845

work page arXiv
[43]

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. 2025. Unified Vision-Language- Action Model. arXiv:2506.19850 [cs.CV] https://arxiv.org/abs/2506.19850

work page arXiv 2025
[44]

Rosa Wolf, Yitian Shi, Sheng Liu, and Rania Rayyes. 2025. Diffusion Models for Robotic Manipulation: A Survey. arXiv:2504.08438 [cs.RO] https://arxiv.org/abs/ 2504.08438

work page arXiv 2025
[45]

Yiming Wu, Huan Wang, Zhenghao Chen, Jianxin Pang, and Dong Xu. 2025. On-Device Diffusion Transformer Policy for Efficient Robot Manipulation. arXiv:2508.00697 [cs.RO] https://arxiv.org/abs/2508.00697

work page arXiv 2025
[46]

Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu
[47]

Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching. arXiv:2502.02175 [cs.RO] https://arxiv.org/abs/2502.02175

work page arXiv
[48]

Feng Yan, Fanfan Liu, Liming Zheng, Yufeng Zhong, Yiyang Huang, Zechao Guan, Chengjian Feng, and Lin Ma. 2025. RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation. arXiv:2412.07215 [cs.RO] https://arxiv.org/ abs/2412.07215

work page arXiv 2025
[49]

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. 2026. InstructVLA: Vision- Language-Action Instruction Tuning from Understanding to Manipulation. arXiv:2507.17520 [cs.RO] https://arxiv.org/abs/2507.17520

work page arXiv 2026
[50]

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. 2026. ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning. arXiv:2602.11236 [cs.CV] https: //arxiv.org/abs/2602.11236

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Zebin Yang, Yijiahao Qi, Tong Xie, Bo Yu, Shaoshan Liu, and Meng Li. 2026. DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic- Static Layer-Skipping for Robot Manipulation. arXiv:2602.22896 [cs.RO] https: //arxiv.org/abs/2602.22896

work page arXiv 2026
[52]

Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Zheng Wang, Lianli Gao, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. 2026. A Survey on Efficient Vision-Language-Action Models. arXiv:2510.24795 [cs.CV] https: //arxiv.org/abs/2510.24795

work page arXiv 2026
[53]

Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. 2024. DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution. arXiv:2411.02359 [cs.RO] https://arxiv.org/abs/2411.02359

work page arXiv 2024
[54]

Edwin Zhang, Yujie Lu, Shinda Huang, William Wang, and Amy Zhang. 2024. Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks. arXiv:2210.15629 [cs.LG] https://arxiv.org/abs/2210.15629

work page arXiv 2024
[55]

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. 2025. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge. arXiv:2507.04447 [cs.CV] https://arxiv.org/abs/2507.04447

work page arXiv 2025
[56]

Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan
[57]

Universal actions for enhanced embodied foundation models

Universal Actions for Enhanced Embodied Foundation Models. arXiv:2501.10105 [cs.RO] https://arxiv.org/abs/2501.10105

work page arXiv
[58]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. 2025. X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model. arXiv:2510.10274 [cs.RO] https://arxiv.org/abs/2510.10274 ...

work page internal anchor Pith review arXiv 2025