dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models

Haotian Liang; Lei Xu; Mishuo Han; Ran Zheng; Ru Ying; Weijie Shen; Wei Sui; Wenbo Ding; Wenjie Xu; Xiaokang Yang

arxiv: 2606.23623 · v1 · pith:GRP3WIIUnew · submitted 2026-06-22 · 💻 cs.RO

dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models

Yuhao Wu , Yitian Liu , Weijie Shen , Mishuo Han , Wenjie Xu , Haotian Liang , Zhongshan Liu , Yinan Mao

show 8 more authors

Lei Xu Xinping Guan Ru Ying Ran Zheng Wei Sui Xiaokang Yang Wenbo Ding Yao Mu

This is my paper

Pith reviewed 2026-06-26 08:09 UTC · model grok-4.3

classification 💻 cs.RO

keywords discrete diffusionvision-language-actionreinforcement learningdenoising trajectoriesrobotic manipulationMarkov decision process

0 comments

The pith

By modeling denoising in discrete diffusion VLAs as a Markov Decision Process, reinforcement learning can optimize the joint probability of entire generation paths instead of intractable marginal action probabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to apply reinforcement learning to Discrete Diffusion Vision-Language-Action models, which had been limited to supervised fine-tuning. The central obstacle is that the marginal probability of the final generated action cannot be computed directly. By recasting the iterative denoising steps as an MDP and defining the learning target as the product of transition probabilities along the full sampled path, the method supplies a tractable objective that works for any number of denoising steps. This formulation also supports a single training procedure that assigns different step counts to tasks of varying difficulty.

Core claim

dVLA-RL shifts the RL objective from the marginal probability of the final action to the joint probability of the sampled denoising trajectory; by treating the denoising process as an MDP, this joint probability is expressed exactly as the product of the step-wise transition probabilities, yielding a unified objective that natively handles variable numbers of denoising steps.

What carries the argument

The trajectory-level objective that formulates the path probability as the product of MDP transition probabilities over the denoising sequence.

If this is right

A single training run can accommodate tasks that require different numbers of denoising steps without changing the loss formulation.
Success rate reaches 99.7 percent on the LIBERO benchmark suite.
The same method produces a 30.6 percent absolute gain over its SFT baseline on RoboTwin 2.0 while remaining competitive with strong world-action-model approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same MDP framing could be applied to any iterative generative process whose marginal likelihood is hard to evaluate, not only discrete diffusion VLAs.
Adaptive step scheduling learned from task complexity might translate into variable compute budgets during online robot deployment.
If the path probability correlates with downstream task success, similar trajectory objectives could be tested on continuous diffusion or flow-matching policies in robotics.

Load-bearing premise

The joint probability along the sampled denoising trajectory serves as an effective surrogate for the true marginal action probability when the latter is intractable.

What would settle it

A controlled experiment in which policies trained with the trajectory objective show no improvement, or degrade, relative to the SFT baseline when both are evaluated on the same held-out tasks with identical inference budgets.

Figures

Figures reproduced from arXiv: 2606.23623 by Haotian Liang, Lei Xu, Mishuo Han, Ran Zheng, Ru Ying, Weijie Shen, Wei Sui, Wenbo Ding, Wenjie Xu, Xiaokang Yang, Xinping Guan, Yao Mu, Yinan Mao, Yitian Liu, Yuhao Wu, Zhongshan Liu.

**Figure 1.** Figure 1: Overview of UDVLA-RL. (a) We present a unified probability modeling approach for denoising trajectories, seamlessly integrating 1-step parallel decoding and multi-step denoising paradigms. (b) Extensive evaluations on the LIBERO and RoboTwin 2.0 benchmarks demonstrate that our RL-finetuned policy outperforms both SFT-only baselines and remains highly competitive among modern VLA architectures, achieving pe… view at source ↗

**Figure 2.** Figure 2: The overall pipeline of dVLA-RL. (1) Rollout & Action Generation: During interaction, the agent conditions on multimodal observations st to generate actions via a K-step iterative masked denoising process, executing them to collect trajectory rewards. (2) Trajectory-Level Policy Optimization: To bypass intractable marginals, we formulate the joint probability of the unrolled denoising path, log pθ(τ |st)… view at source ↗

**Figure 3.** Figure 3: Comparison between the trajectory-level log-probability objective and the last-step log [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of denoising-trajectory length on RL optimization in RoboTwin 2.0. Step-1 col [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: PPO training curves on the LIBERO benchmark suites. Blue points denote raw evaluation [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: RoboTwin training curves on Beat Block Hammer, Place Phone Stand, Place Empty Cup [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: RoboTwin training curves on Move Can Pot, Place A2B Left, Handover Mic, and Pick [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have established a powerful paradigm for generalist robotic manipulation by grounding control into the semantic reasoning of VLMs. Prevailing architectures typically model actions continuously via diffusion or flow processes, or discretely through either autoregressive generation or parallel decoding. Recently, Discrete Diffusion VLAs (dVLAs) have emerged as a distinct alternative, unifying vision, language, and action into a single discrete token space via masked generative modeling. While combining iterative refinement with unified representations, its training has thus far been restricted to Supervised Fine-Tuning (SFT), leaving the potential of Reinforcement Learning (RL) for further policy refinement largely unexplored. A fundamental challenge in RL for dVLAs is that the marginal probability of the final action generated by dVLAs remains intractable. To solve this problem, we propose \textbf{dVLA-RL}, shifting the learning objective from the marginal action probability to the joint probability of the sampled generation path. Specifically, by modeling the denoising process as a Markov Decision Process (MDP), we mathematically formulate this path probability as a product of step-wise transitions. This trajectory-level objective provides a unified formulation that natively accommodates variable denoising steps. Leveraging this intrinsic fexibility, we introduce a unified step scheduling approach for complex multi-task learning, tailoring denoising steps to specific task complexities to maximize both success rates and computational effciency. Extensive evaluations demonstrate that our approach achieves a success rate of \textbf{99.7\%} on LIBERO. Furthermore, it establishes strong VLA-based results on RoboTwin 2.0 by delivering a \textbf{30.6\%} improvement over the SFT baseline, remaining competitive with strong World-Action Model baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core move is modeling denoising as an MDP to enable RL on trajectory probabilities for discrete diffusion VLAs, which handles variable steps and yields strong reported numbers, but the surrogate's link to marginal action improvement lacks any derivation.

read the letter

The main point is that dVLA-RL reframes RL for these models by optimizing the joint probability of the full denoising path rather than the intractable marginal over final actions. They cast the denoising steps as an MDP so the path probability factors into step-wise transitions, then apply RL to that. This also supports a unified step scheduler that assigns different denoising lengths to tasks of varying complexity.

The formulation appears new relative to the cited prior work on dVLAs, which had been limited to SFT. Treating the generation path as the object of optimization is a direct response to the marginal intractability, and the variable-step flexibility is a practical byproduct that fits multi-task settings. The numbers they report—99.7% success on LIBERO and a 30.6% gain over SFT on RoboTwin 2.0—are competitive with strong baselines if the runs are reproducible.

The soft spot is the missing justification for the surrogate. The abstract supplies no derivation showing that maximizing expected reward under the trajectory distribution improves the marginal action policy, leaves it unbiased, or even guarantees monotonic progress. When denoising step counts differ across tasks, credit could attach to the path rather than the final action, which risks bias. No ablations, error bars, or controls are described, so the performance claims cannot be assessed from the given material.

This work is for groups already building or extending discrete diffusion VLAs and looking to add RL. Readers focused on robotics policy learning with unified token models would find the scheduling trick and the objective shift useful to examine.

It deserves peer review. The limitation it targets is genuine, the results are high enough to merit checking the math and experiments, and the idea is concrete enough for referees to evaluate.

Referee Report

2 major / 1 minor

Summary. The paper proposes dVLA-RL for discrete diffusion Vision-Language-Action models. It reformulates the RL objective from the intractable marginal probability of the final action to the joint probability of the full denoising trajectory by modeling the denoising process as an MDP whose transitions are optimized directly; this is claimed to natively support variable denoising step counts. A unified step-scheduling method is introduced for multi-task settings. The work reports a 99.7% success rate on LIBERO and a 30.6% improvement over the SFT baseline on RoboTwin 2.0.

Significance. If the trajectory-level objective can be shown to be a valid, unbiased surrogate for the marginal action policy and the performance numbers hold under standard controls, the approach would provide a practical route to apply RL to masked discrete diffusion VLAs, extending beyond SFT and potentially improving sample efficiency and task success in robotic manipulation.

major comments (2)

[Abstract] Abstract: the central claim that 'modeling the denoising process as a Markov Decision Process' and optimizing the joint trajectory probability yields an effective RL surrogate for the marginal action distribution is stated without any derivation, equivalence proof, or bias analysis. No equations are supplied showing the product of step-wise transitions, the resulting policy gradient, or why credit assignment to intermediate denoising steps improves (rather than biases) the final-action marginal when step counts vary across tasks.
[Abstract] Abstract: the reported 99.7% success rate on LIBERO and 30.6% improvement on RoboTwin 2.0 are presented without error bars, number of evaluation seeds, ablation on the step-scheduling component, or comparison against an RL baseline that directly optimizes a tractable marginal, rendering the quantitative claims impossible to assess for statistical reliability or attribution to the proposed objective.

minor comments (1)

[Abstract] Abstract: typographical errors ('fexibility', 'effciency') should be corrected.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each major comment point-by-point below, clarifying the content of the full paper and noting revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'modeling the denoising process as a Markov Decision Process' and optimizing the joint trajectory probability yields an effective RL surrogate for the marginal action distribution is stated without any derivation, equivalence proof, or bias analysis. No equations are supplied showing the product of step-wise transitions, the resulting policy gradient, or why credit assignment to intermediate denoising steps improves (rather than biases) the final-action marginal when step counts vary across tasks.

Authors: The abstract is intentionally concise, but Section 3.2 of the manuscript provides the full derivation: the denoising process is formalized as an MDP whose states are partially denoised token sequences, actions are per-step token predictions, and the joint trajectory probability is the product of the step-wise transition probabilities under the discrete diffusion kernel. The policy gradient is derived directly on this trajectory objective. Because the MDP exactly replicates the generative process, the trajectory objective is an unbiased surrogate for the marginal action probability; variable step counts are accommodated without bias by the unified scheduling that preserves the same effective generative distribution. We will revise the abstract to include a parenthetical reference to Section 3 for the derivation and equivalence. revision: partial
Referee: [Abstract] Abstract: the reported 99.7% success rate on LIBERO and 30.6% improvement on RoboTwin 2.0 are presented without error bars, number of evaluation seeds, ablation on the step-scheduling component, or comparison against an RL baseline that directly optimizes a tractable marginal, rendering the quantitative claims impossible to assess for statistical reliability or attribution to the proposed objective.

Authors: The abstract omits these details due to length constraints, but the experimental sections report results averaged over multiple seeds with standard deviations, include ablations isolating the step-scheduling component, and explain that no tractable marginal RL baseline exists for discrete diffusion VLAs (the intractability of the marginal is the central motivation stated in Section 3.1). We will revise the abstract to note that the reported figures are means over multiple evaluation seeds, with full statistical details and ablations provided in the experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; new surrogate objective defined by construction but independent of fitted inputs

full rationale

The paper defines a new learning objective by explicitly modeling the denoising process as an MDP and setting the trajectory probability equal to the product of its step-wise transition probabilities. This is presented as a deliberate modeling shift to bypass the intractable marginal, not as a derivation that reduces to prior fitted quantities or self-citations. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or renamings of known results appear in the provided text. The formulation is self-contained as a choice of surrogate; empirical gains are reported as outcomes of optimizing this defined objective rather than tautological re-expressions of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full paper may contain additional free parameters or assumptions not visible here.

free parameters (1)

task-specific denoising step counts
Abstract states a unified step scheduling approach tailored to task complexities; values are chosen or fitted per task.

axioms (2)

domain assumption The denoising process of dVLAs can be modeled as a Markov Decision Process
Explicitly invoked to formulate the path probability as a product of step-wise transitions.
domain assumption The joint probability of the generation path equals the product of individual step transitions
Stated as the mathematical basis for the new objective.

pith-pipeline@v0.9.1-grok · 5904 in / 1390 out tokens · 26833 ms · 2026-06-26T08:09:33.275933+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages

[2]

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al

URLhttps: //arxiv.org/abs/2107.03006. Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 35101–35113,

arXiv
[3]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

work page doi:10.15607/rss.2025.xxi.010 2025
[4]

doi: 10.15607/RSS.2025. XXI.014. Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion VLA: Vision-language-action model via joint discrete denos- ing diffusion process. InThe F ourteenth International Conference on Learning Representations, 2026a. URLhttps://openreview.net/forum?id=U...

work page doi:10.15607/rss.2025 2025
[5]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-Tuning Vision-Language-Action Models: Opti- mizing Speed and Success. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025a. doi: 10.15607/RSS.2025.XXI.017. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, P...

work page doi:10.15607/rss.2025.xxi.017 2025
[6]

URLhttps://openreview.net/forum?id=wPEIStHxYH. Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Yang Zhaohui, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding. SimpleVLA-RL: Scaling VLA training via reinfo...

Pith/arXiv arXiv
[7]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, qiang liu, Yuke Zhu, and Peter Stone

URLhttps://arxiv.org/abs/ 2508.20072. Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, qiang liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

Pith/arXiv arXiv
[8]

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang

URLhttps: //openreview.net/forum?id=xzEtNSuDJk. Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.Ad- vances in neural information processing systems, 38:40783–40818, 2026a. Yang Liu, Pengxiang Ding, Tengyue Jiang, Xudong Wang, Wenxuan ...

arXiv
[9]

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji- Rong Wen, and Chongxuan Li

URLhttps://arxiv.org/abs/2505.18719. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji- Rong Wen, and Chongxuan Li. Large language diffusion models.Advances in Neural Information Processing Systems, 38:50608–50646,

Pith/arXiv arXiv
[11]

Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl

URLhttp://arxiv.org/abs/ 1707.06347. Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision- language-action models,

Pith/arXiv arXiv
[12]

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang

URLhttps://arxiv.org/abs/2505.17016. Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing rein- forcement learning framework for diffusion large language models. InThe F ourteenth Interna- tional Conference on Learning Representations,

Pith/arXiv arXiv
[14]

Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke

URLhttps://doi.org/10.48550/ arXiv.2509.25681. Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke. Rl token: Bootstrapping online rl with vision-language-action models,

arXiv
[15]

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang

URL https://arxiv.org/abs/2604.23073. Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.Advances in Neural Information Processing Sys- tems, 38:138867–138907,

Pith/arXiv arXiv
[16]

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong

URL https://arxiv.org/abs/2603.17240. Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

arXiv
[17]

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao

URLhttps: //arxiv.org/abs/2509.15965. Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?,

arXiv
[18]

URLhttps://arxiv.org/abs/2603.16666. Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Peihong Wang, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, and Yu Wang. Rlinf-vla: A unified and efficient framework for reinforcement learning of vision-language-action models,

Pith/arXiv arXiv
[19]

Mu Zhang, Tianren Ma, Yunfan Liu, Kun Hu, and Qixiang Ye

URLhttps://arxiv.org/ abs/2510.06710. Mu Zhang, Tianren Ma, Yunfan Liu, Kun Hu, and Qixiang Ye. Rebrl: Reinforcing discrete visual diffusion models with rebalanced timestep credits. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 43135–43144, June

arXiv
[20]

URLhttps://proceedings.neurips.cc/paper_files/paper/2025/ file/52190a0362148d179f1cbd9080956872-Paper-Conference.pdf. A PSEUDOCODE OFUDVLA-RL Algorithm 1Unified Discrete VLA Reinforcement Learning (dVLA-RL) 1:Input:Pre-trained discrete diffusion VLA modelπ θ, Value networkV ϕ, EnvironmentE 2:Hyperparameters:Denoising stepsK, PPO clip ratioϵ, PPO epochsE, ...

2025

[1] [2]

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al

URLhttps: //arxiv.org/abs/2107.03006. Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 35101–35113,

arXiv

[2] [3]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

work page doi:10.15607/rss.2025.xxi.010 2025

[3] [4]

doi: 10.15607/RSS.2025. XXI.014. Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion VLA: Vision-language-action model via joint discrete denos- ing diffusion process. InThe F ourteenth International Conference on Learning Representations, 2026a. URLhttps://openreview.net/forum?id=U...

work page doi:10.15607/rss.2025 2025

[4] [5]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-Tuning Vision-Language-Action Models: Opti- mizing Speed and Success. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025a. doi: 10.15607/RSS.2025.XXI.017. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, P...

work page doi:10.15607/rss.2025.xxi.017 2025

[5] [6]

URLhttps://openreview.net/forum?id=wPEIStHxYH. Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Yang Zhaohui, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding. SimpleVLA-RL: Scaling VLA training via reinfo...

Pith/arXiv arXiv

[6] [7]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, qiang liu, Yuke Zhu, and Peter Stone

URLhttps://arxiv.org/abs/ 2508.20072. Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, qiang liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

Pith/arXiv arXiv

[7] [8]

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang

URLhttps: //openreview.net/forum?id=xzEtNSuDJk. Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.Ad- vances in neural information processing systems, 38:40783–40818, 2026a. Yang Liu, Pengxiang Ding, Tengyue Jiang, Xudong Wang, Wenxuan ...

arXiv

[8] [9]

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji- Rong Wen, and Chongxuan Li

URLhttps://arxiv.org/abs/2505.18719. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji- Rong Wen, and Chongxuan Li. Large language diffusion models.Advances in Neural Information Processing Systems, 38:50608–50646,

Pith/arXiv arXiv

[9] [11]

Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl

URLhttp://arxiv.org/abs/ 1707.06347. Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision- language-action models,

Pith/arXiv arXiv

[10] [12]

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang

URLhttps://arxiv.org/abs/2505.17016. Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing rein- forcement learning framework for diffusion large language models. InThe F ourteenth Interna- tional Conference on Learning Representations,

Pith/arXiv arXiv

[11] [14]

Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke

URLhttps://doi.org/10.48550/ arXiv.2509.25681. Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke. Rl token: Bootstrapping online rl with vision-language-action models,

arXiv

[12] [15]

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang

URL https://arxiv.org/abs/2604.23073. Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.Advances in Neural Information Processing Sys- tems, 38:138867–138907,

Pith/arXiv arXiv

[13] [16]

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong

URL https://arxiv.org/abs/2603.17240. Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

arXiv

[14] [17]

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao

URLhttps: //arxiv.org/abs/2509.15965. Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?,

arXiv

[15] [18]

URLhttps://arxiv.org/abs/2603.16666. Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Peihong Wang, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, and Yu Wang. Rlinf-vla: A unified and efficient framework for reinforcement learning of vision-language-action models,

Pith/arXiv arXiv

[16] [19]

Mu Zhang, Tianren Ma, Yunfan Liu, Kun Hu, and Qixiang Ye

URLhttps://arxiv.org/ abs/2510.06710. Mu Zhang, Tianren Ma, Yunfan Liu, Kun Hu, and Qixiang Ye. Rebrl: Reinforcing discrete visual diffusion models with rebalanced timestep credits. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 43135–43144, June

arXiv

[17] [20]

URLhttps://proceedings.neurips.cc/paper_files/paper/2025/ file/52190a0362148d179f1cbd9080956872-Paper-Conference.pdf. A PSEUDOCODE OFUDVLA-RL Algorithm 1Unified Discrete VLA Reinforcement Learning (dVLA-RL) 1:Input:Pre-trained discrete diffusion VLA modelπ θ, Value networkV ϕ, EnvironmentE 2:Hyperparameters:Denoising stepsK, PPO clip ratioϵ, PPO epochsE, ...

2025