Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

Jiayu Sun; Minqian Wang; Xingjian Mao; Yao Mu; Ziqian Wang

arxiv: 2606.08015 · v1 · pith:6ESMROJMnew · submitted 2026-06-06 · 💻 cs.RO

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

Ziqian Wang , Jiayu Sun , Xingjian Mao , Minqian Wang , Yao Mu This is my paper

Pith reviewed 2026-06-27 19:55 UTC · model grok-4.3

classification 💻 cs.RO

keywords Q-VGMVGG-Flowflow-matchingVLA policiesvalue-gradient matchingoff-policy RLrobot learningdenoising

0 comments

The pith

Q-VGM updates flow-matching VLA policies by matching value gradients to the velocity field at each denoising step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Q-VGM as an off-policy RL method to improve flow-matching vision-language-action policies using signals from a learned Q-function. It converts the critic gradient into a per-denoising-step field called VGG-Flow so that the velocity field can be supervised without backpropagating through the full denoising chain or needing action likelihoods. This operates on a fixed replay buffer after a few-shot supervised initialization. A sympathetic reader would care because the approach produces large gains in task success on both simulated and real robot benchmarks while avoiding numerical instability. The reported results show average success rising from 75.0 percent to 92.5 percent on LIBERO, 76.4 percent to 87.2 percent on RoboTwin 2.0, and 40.0 percent to 67.5 percent on real-robot tasks.

Core claim

Q-VGM introduces VGG-Flow to transform the value gradient of an action-sensitive Cal-QL ensemble into a denoising-time value-gradient field that directly supervises the flow policy velocity; this requires no backpropagation through the multi-step process and no action likelihoods, enabling stable off-policy improvement from a fixed replay buffer after few-shot SFT initialization.

What carries the argument

VGG-Flow, the transformation of the critic value gradient into a per-denoising-step supervision field for the flow velocity.

Load-bearing premise

That converting the value gradient into a per-denoising-step field supplies stable and effective supervision for the velocity without full-chain backpropagation or action likelihoods.

What would settle it

Training the same flow policy with direct backpropagation of the value through the entire denoising chain at VLA scale and checking whether the resulting success rates match or exceed those obtained with Q-VGM.

Figures

Figures reproduced from arXiv: 2606.08015 by Jiayu Sun, Minqian Wang, Xingjian Mao, Yao Mu, Ziqian Wang.

**Figure 1.** Figure 1: Q-Guided Value-Gradient Matching (Q-VGM). (a) An action-sensitive Cal-QL critic is trained offline on rollout data using frozen VLM and RLT features. (b) Critic action gradients ∇AQ are converted into denoising-time velocity corrections via value-gradient matching. (c) Starting from a few-shot-SFT π0.5 VLA, Q-VGM learns from self-generated experience to improve task success across LIBERO, RoboTwin 2.0, and… view at source ↗

**Figure 2.** Figure 2: Overview of Q-Guided Value-Gradient Matching (Q-VGM). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Real-robot tabletop platform: 7-DoF arm observed by two RGB-D cameras both facing [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-VGM gives a workable route to supervise flow-matching VLA policies with value gradients without chain backprop or likelihoods, but the transformation's explicit form and stability checks are missing from the abstract.

read the letter

Q-VGM is worth knowing about because it gives a concrete way to use a learned Q-function to improve a flow-matching VLA policy without the usual headaches of backprop through denoising or needing tractable likelihoods.

The new piece is VGG-Flow. It converts the value gradient into a field that supervises the flow velocity at each denoising time step. This lets the method run off-policy on a fixed replay buffer after a few-shot initialization. The critic uses a Cal-QL ensemble on compact features with action injection. The results look solid on paper: success rates jump from 75 to 92.5 percent on LIBERO, 76.4 to 87.2 on RoboTwin 2.0, and 40 to 67.5 on the real-robot tasks, beating same-backbone baselines.

What the paper does well is frame the problem clearly and deliver a practical pipeline that starts from pi0.5 and improves with rollout data. That matches what many groups are trying to do in robot learning.

The soft spots sit in the verification. The abstract claims the transformation avoids instability but does not show the explicit mapping from Q-gradient to the per-step target or any check on numerical properties. Experiments report point estimates without error bars or tests, and there is no ablation isolating the matching loss from the critic design. If the gains trace back to the ensemble or data curation instead of the gradient matching, the central claim weakens. The assumption that this field stably carries first-order return information at VLA scale still needs direct evidence.

This paper is for people working on fine-tuning generative policies with RL in robotics. A reader focused on engineering fixes for scaling will find the approach useful even if the theory side stays light. It deserves a serious referee because the engineering problem is genuine and the reported improvements are large enough to warrant checking the details.

I would send it out for review, asking for the full derivation of VGG-Flow and more rigorous stats on the results.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Q-Guided Value-Gradient Matching (Q-VGM), an off-policy RL approach for fine-tuning flow-matching VLA policies. It introduces VGG-Flow to convert critic value gradients into a per-denoising-step supervision field for the flow velocity, avoiding direct backpropagation through the iterative denoising chain and eliminating the need for action likelihoods. The critic is implemented as an action-sensitive Cal-QL ensemble over RLT features. The method starts from a few-shot SFT initialization and improves via self-generated rollouts on a fixed replay buffer. Reported results include average success rate increases from 75.0% to 92.5% on LIBERO, 76.4% to 87.2% on RoboTwin 2.0, and 40.0% to 67.5% on two real-robot tabletop tasks, with outperformance over same-backbone, same-critic baselines.

Significance. If the VGG-Flow construction is shown to preserve first-order return information and remain stable at VLA scale, the approach could enable practical RL fine-tuning of expressive generative policies where full-chain differentiation is intractable. The fixed-buffer, few-shot-to-experience paradigm and use of Cal-QL ensembles are pragmatic strengths for robotics deployment. The empirical gains across simulation and real-robot settings, if robustly attributed to the proposed mechanism, would represent a meaningful advance in scaling value-based improvement for flow-matching VLAs.

major comments (2)

[Method section (VGG-Flow definition)] The VGG-Flow construction (method section): the paper states that the transformation converts the Q-gradient into a denoising-time value-gradient field that directly supervises the velocity without end-to-end differentiation, but supplies neither the explicit mapping equation from critic gradient to velocity target nor any analysis of its bias, Lipschitz constant, or behavior under iterative denoising. This is load-bearing for the central claim that the method sidesteps numerical instability while preserving expected-return information.
[Experimental results section] Experimental results (LIBERO, RoboTwin, and real-robot tables): the success-rate improvements (75.0% o 92.5%, 76.4% o 87.2%, 40.0% o 67.5%) are reported as single point estimates with no error bars, standard deviations across seeds, or statistical significance tests against baselines. No ablation isolates the contribution of the value-gradient matching loss from the Cal-QL ensemble or replay-buffer curation, which is required to attribute gains to the claimed supervision mechanism.

minor comments (2)

[Abstract] The abstract claims outperformance over 'all same-backbone, same-critic baselines' but does not list the specific baseline methods or their numerical results; adding a compact comparison table would improve clarity.
[Critic architecture subsection] Notation for the per-layer action injection in the Cal-QL critic and the RLT feature extractor could be defined more explicitly with a small diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate clarifications and additional results in the revised version.

read point-by-point responses

Referee: [Method section (VGG-Flow definition)] The VGG-Flow construction (method section): the paper states that the transformation converts the Q-gradient into a denoising-time value-gradient field that directly supervises the velocity without end-to-end differentiation, but supplies neither the explicit mapping equation from critic gradient to velocity target nor any analysis of its bias, Lipschitz constant, or behavior under iterative denoising. This is load-bearing for the central claim that the method sidesteps numerical instability while preserving expected-return information.

Authors: We agree the explicit mapping and supporting analysis are necessary for the central claim. The current manuscript describes VGG-Flow conceptually but omits the precise equation. In revision we will insert the closed-form mapping that converts the critic gradient abla_a Q(s, a) into the per-denoising-step velocity target via the flow-matching alignment objective. We will also add a short derivation showing that the mapping is unbiased with respect to expected return under the Cal-QL ensemble and provide a Lipschitz-constant bound that holds under the compact RLT feature representation, thereby confirming stability without end-to-end differentiation. revision: yes
Referee: [Experimental results section] Experimental results (LIBERO, RoboTwin, and real-robot tables): the success-rate improvements (75.0% to 92.5%, 76.4% to 87.2%, 40.0% to 67.5%) are reported as single point estimates with no error bars, standard deviations across seeds, or statistical significance tests against baselines. No ablation isolates the contribution of the value-gradient matching loss from the Cal-QL ensemble or replay-buffer curation, which is required to attribute gains to the claimed supervision mechanism.

Authors: We concur that single-point estimates and missing ablations limit attribution. The reported numbers reflect single training runs performed under tight compute budgets typical for VLA-scale models. In the revision we will rerun the primary LIBERO and RoboTwin experiments over three independent seeds, reporting means and standard deviations, and will add a targeted ablation that disables only the value-gradient matching term while retaining the identical Cal-QL ensemble and replay buffer. This will allow direct isolation of VGG-Flow’s contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent supervision mechanism

full rationale

The paper presents Q-VGM as a new off-policy RL method that uses VGG-Flow to transform critic gradients into a per-denoising-step field for supervising flow velocity, explicitly avoiding backpropagation through the chain and action likelihoods. No equations, self-citations, or fitted parameters are shown that reduce the claimed performance gains (on LIBERO, RoboTwin, real-robot tasks) to quantities already present in the inputs by construction. The central transformation is framed as addressing a distinct numerical instability problem rather than renaming or refitting prior results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level description of the critic ensemble and VGG-Flow transformation.

pith-pipeline@v0.9.1-grok · 5896 in / 1119 out tokens · 22527 ms · 2026-06-27T19:55:42.540561+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 12 linked inside Pith

[1]

Xiao, Carles Domingo-Enrich, Weiyang Liu, and Dinghuai Zhang

Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich, Weiyang Liu, and Dinghuai Zhang. Value gradient guidance for flow matching alignment. InAdvances in Neural Information Processing Systems, 2025

2025
[2]

RT-2: Vision- language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision- language-action models transfer web knowledge to robotic control. InConference on Robot Learning, 2023

2023
[3]

Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Octo Model Team. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024
[4]

OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[5]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[6]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[7]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Matt Le, and Maximilian Nickel. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

2023
[8]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023

2023
[9]

πRL: Online RL fine-tuning for flow-based vision-language-action models.arXiv preprint arXiv:2510.25889, 2025

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Xi- ang Li, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, and Chao Yu. πRL: Online RL fine-tuning for flow-based vision-language-action models.arXiv preprint arXiv:2510.25889, 2025

arXiv 2025
[10]

Ren, Justin Lidard, Lars L

Allen Z. Ren, Justin Lidard, Lars L. Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. InInternational Conference on Learning Representations, 2025

2025
[11]

ReinFlow: Fine-tuning flow matching policy with online reinforcement learning.arXiv preprint arXiv:2505.22094, 2025

Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. ReinFlow: Fine-tuning flow matching policy with online reinforcement learning.arXiv preprint arXiv:2505.22094, 2025

arXiv 2025
[12]

Reinforcement fine-tuning of flow-matching policies for vision-language-action models.arXiv preprint arXiv:2510.09976, 2025

Mingyang Lyu, Yinqian Sun, Erliang Lin, Huangrui Li, Ruolin Chen, Feifei Zhao, and Yi Zeng. Reinforcement fine-tuning of flow-matching policies for vision-language-action models.arXiv preprint arXiv:2510.09976, 2025

Pith/arXiv arXiv 2025
[13]

Flow-GRPO: Training flow matching models via online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. arXiv preprint arXiv:2505.05470, 2025

Pith/arXiv arXiv 2025
[14]

Reward score matching: Unify- ing reward-based fine-tuning for flow and diffusion models.arXiv preprint arXiv:2604.17415, 2026

Jeongjae Lee, Jinho Chang, Jeongsol Kim, and Jong Chul Ye. Reward score matching: Unify- ing reward-based fine-tuning for flow and diffusion models.arXiv preprint arXiv:2604.17415, 2026

Pith/arXiv arXiv 2026
[15]

Diffusion policies as an expressive policy class for offline reinforcement learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. InInternational Conference on Learning Rep- resentations, 2023. 10

2023
[16]

Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432, 2025

Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, Dong Wang, and Xuelong Li. Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432, 2025

arXiv 2025
[17]

RoboMonkey: Scaling test-time sampling and verification for vision- language-action models.arXiv preprint arXiv:2506.17811, 2025

Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirho- seini, and Marco Pavone. RoboMonkey: Scaling test-time sampling and verification for vision- language-action models.arXiv preprint arXiv:2506.17811, 2025

arXiv 2025
[18]

Policy agnostic RL: Offline RL and online RL fine- tuning of any class and backbone

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnostic RL: Offline RL and online RL fine- tuning of any class and backbone. InInternational Conference on Learning Representations, 2025

2025
[19]

Efficient diffusion policies for offline reinforcement learning

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning. InAdvances in Neural Information Processing Systems, 2023

2023
[20]

Diffusion policies creating a trust region for offline reinforcement learning

Tianyu Chen, Zhendong Wang, and Mingyuan Zhou. Diffusion policies creating a trust region for offline reinforcement learning. InAdvances in Neural Information Processing Systems, 2024

2024
[21]

Learning a diffusion model policy from rewards via Q-score matching

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via Q-score matching. InInternational Conference on Machine Learning, 2024

2024
[22]

Verifier- free test-time sampling for vision language action models

Suhyeok Jang, Dongyoung Kim, Changyeon Kim, Youngsuk Kim, and Jinwoo Shin. Verifier- free test-time sampling for vision language action models. InInternational Conference on Learning Representations, 2026

2026
[23]

Fine- tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024

Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, and Sergey Levine. Fine- tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024

arXiv 2024
[24]

Fine-tuning of diffusion models via stochastic control: en- tropy regularization and beyond.arXiv preprint arXiv:2403.06279, 2024

Wenpin Tang and Fuzhong Zhou. Fine-tuning of diffusion models via stochastic control: en- tropy regularization and beyond.arXiv preprint arXiv:2403.06279, 2024

arXiv 2024
[25]

Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky T. Q. Chen. Adjoint match- ing: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. InInternational Conference on Learning Representations, 2025

2025
[26]

Q-learning with adjoint matching.arXiv preprint arXiv:2601.14234, 2026

Qiyang Li and Sergey Levine. Q-learning with adjoint matching.arXiv preprint arXiv:2601.14234, 2026

Pith/arXiv arXiv 2026
[27]

Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018

Pith/arXiv arXiv 2018
[28]

RL token: Bootstrapping online RL with vision-language-action mod- els

Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke. RL token: Bootstrapping online RL with vision-language-action mod- els. Physical Intelligence whitepaper; arXiv:2604.23073, 2026. URLhttps://www.pi. website/download/rlt.pdf

Pith/arXiv arXiv 2026
[29]

Reinforcement learning with action chunking

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. InAdvances in Neural Information Processing Systems, 2025

2025
[30]

Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning

Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. InAdvances in Neural Information Processing Systems, 2023

2023
[31]

LIBERO: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems, 2023. 11

2023
[32]

RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[33]

Steering your generalists: Improving robotic foundation models via value guidance

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InConference on Robot Learning, 2024

2024
[34]

Policy representation via diffusion probability model for reinforcement learning.arXiv preprint arXiv:2305.13122, 2023

Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning.arXiv preprint arXiv:2305.13122, 2023

arXiv 2023
[35]

Trust region q adjoint matching.arXiv preprint arXiv:2605.27079, 2026

Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, and Jinwoo Shin. Trust region q adjoint matching.arXiv preprint arXiv:2605.27079, 2026

Pith/arXiv arXiv 2026
[36]

Horizon reduction makes RL scalable

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon reduction makes RL scalable. InAdvances in Neural Information Processing Systems, 2025. 12 A Critic Architecture and Training Details Critic architecture.The critic state is¯s= LayerNorm([z rl ∥W p p])∈R 2304, wherez rl ∈R 2048 is the cached RLT encodin...

2025

[1] [1]

Xiao, Carles Domingo-Enrich, Weiyang Liu, and Dinghuai Zhang

Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich, Weiyang Liu, and Dinghuai Zhang. Value gradient guidance for flow matching alignment. InAdvances in Neural Information Processing Systems, 2025

2025

[2] [2]

RT-2: Vision- language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision- language-action models transfer web knowledge to robotic control. InConference on Robot Learning, 2023

2023

[3] [3]

Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Octo Model Team. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024

[4] [4]

OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[5] [5]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[6] [6]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[7] [7]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Matt Le, and Maximilian Nickel. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

2023

[8] [8]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023

2023

[9] [9]

πRL: Online RL fine-tuning for flow-based vision-language-action models.arXiv preprint arXiv:2510.25889, 2025

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Xi- ang Li, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, and Chao Yu. πRL: Online RL fine-tuning for flow-based vision-language-action models.arXiv preprint arXiv:2510.25889, 2025

arXiv 2025

[10] [10]

Ren, Justin Lidard, Lars L

Allen Z. Ren, Justin Lidard, Lars L. Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. InInternational Conference on Learning Representations, 2025

2025

[11] [11]

ReinFlow: Fine-tuning flow matching policy with online reinforcement learning.arXiv preprint arXiv:2505.22094, 2025

Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. ReinFlow: Fine-tuning flow matching policy with online reinforcement learning.arXiv preprint arXiv:2505.22094, 2025

arXiv 2025

[12] [12]

Reinforcement fine-tuning of flow-matching policies for vision-language-action models.arXiv preprint arXiv:2510.09976, 2025

Mingyang Lyu, Yinqian Sun, Erliang Lin, Huangrui Li, Ruolin Chen, Feifei Zhao, and Yi Zeng. Reinforcement fine-tuning of flow-matching policies for vision-language-action models.arXiv preprint arXiv:2510.09976, 2025

Pith/arXiv arXiv 2025

[13] [13]

Flow-GRPO: Training flow matching models via online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. arXiv preprint arXiv:2505.05470, 2025

Pith/arXiv arXiv 2025

[14] [14]

Reward score matching: Unify- ing reward-based fine-tuning for flow and diffusion models.arXiv preprint arXiv:2604.17415, 2026

Jeongjae Lee, Jinho Chang, Jeongsol Kim, and Jong Chul Ye. Reward score matching: Unify- ing reward-based fine-tuning for flow and diffusion models.arXiv preprint arXiv:2604.17415, 2026

Pith/arXiv arXiv 2026

[15] [15]

Diffusion policies as an expressive policy class for offline reinforcement learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. InInternational Conference on Learning Rep- resentations, 2023. 10

2023

[16] [16]

Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432, 2025

Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, Dong Wang, and Xuelong Li. Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432, 2025

arXiv 2025

[17] [17]

RoboMonkey: Scaling test-time sampling and verification for vision- language-action models.arXiv preprint arXiv:2506.17811, 2025

Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirho- seini, and Marco Pavone. RoboMonkey: Scaling test-time sampling and verification for vision- language-action models.arXiv preprint arXiv:2506.17811, 2025

arXiv 2025

[18] [18]

Policy agnostic RL: Offline RL and online RL fine- tuning of any class and backbone

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnostic RL: Offline RL and online RL fine- tuning of any class and backbone. InInternational Conference on Learning Representations, 2025

2025

[19] [19]

Efficient diffusion policies for offline reinforcement learning

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning. InAdvances in Neural Information Processing Systems, 2023

2023

[20] [20]

Diffusion policies creating a trust region for offline reinforcement learning

Tianyu Chen, Zhendong Wang, and Mingyuan Zhou. Diffusion policies creating a trust region for offline reinforcement learning. InAdvances in Neural Information Processing Systems, 2024

2024

[21] [21]

Learning a diffusion model policy from rewards via Q-score matching

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via Q-score matching. InInternational Conference on Machine Learning, 2024

2024

[22] [22]

Verifier- free test-time sampling for vision language action models

Suhyeok Jang, Dongyoung Kim, Changyeon Kim, Youngsuk Kim, and Jinwoo Shin. Verifier- free test-time sampling for vision language action models. InInternational Conference on Learning Representations, 2026

2026

[23] [23]

Fine- tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024

Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, and Sergey Levine. Fine- tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024

arXiv 2024

[24] [24]

Fine-tuning of diffusion models via stochastic control: en- tropy regularization and beyond.arXiv preprint arXiv:2403.06279, 2024

Wenpin Tang and Fuzhong Zhou. Fine-tuning of diffusion models via stochastic control: en- tropy regularization and beyond.arXiv preprint arXiv:2403.06279, 2024

arXiv 2024

[25] [25]

Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky T. Q. Chen. Adjoint match- ing: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. InInternational Conference on Learning Representations, 2025

2025

[26] [26]

Q-learning with adjoint matching.arXiv preprint arXiv:2601.14234, 2026

Qiyang Li and Sergey Levine. Q-learning with adjoint matching.arXiv preprint arXiv:2601.14234, 2026

Pith/arXiv arXiv 2026

[27] [27]

Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909, 2018

Pith/arXiv arXiv 2018

[28] [28]

RL token: Bootstrapping online RL with vision-language-action mod- els

Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, and Liyiming Ke. RL token: Bootstrapping online RL with vision-language-action mod- els. Physical Intelligence whitepaper; arXiv:2604.23073, 2026. URLhttps://www.pi. website/download/rlt.pdf

Pith/arXiv arXiv 2026

[29] [29]

Reinforcement learning with action chunking

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. InAdvances in Neural Information Processing Systems, 2025

2025

[30] [30]

Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning

Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. InAdvances in Neural Information Processing Systems, 2023

2023

[31] [31]

LIBERO: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems, 2023. 11

2023

[32] [32]

RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[33] [33]

Steering your generalists: Improving robotic foundation models via value guidance

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InConference on Robot Learning, 2024

2024

[34] [34]

Policy representation via diffusion probability model for reinforcement learning.arXiv preprint arXiv:2305.13122, 2023

Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning.arXiv preprint arXiv:2305.13122, 2023

arXiv 2023

[35] [35]

Trust region q adjoint matching.arXiv preprint arXiv:2605.27079, 2026

Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, and Jinwoo Shin. Trust region q adjoint matching.arXiv preprint arXiv:2605.27079, 2026

Pith/arXiv arXiv 2026

[36] [36]

Horizon reduction makes RL scalable

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon reduction makes RL scalable. InAdvances in Neural Information Processing Systems, 2025. 12 A Critic Architecture and Training Details Critic architecture.The critic state is¯s= LayerNorm([z rl ∥W p p])∈R 2304, wherez rl ∈R 2048 is the cached RLT encodin...

2025