pith. sign in

arxiv: 2606.05468 · v1 · pith:4ECJ7NSInew · submitted 2026-06-03 · 💻 cs.RO

FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

Pith reviewed 2026-06-28 05:36 UTC · model grok-4.3

classification 💻 cs.RO
keywords flow-matchingvision-language-actionpreference optimizationreward-free fine-tuningbimanual manipulationproximal regularizationrobot learningoffline RL
0
0 comments X

The pith

FlowPRO fine-tunes flow-matching VLAs on real robots from preference pairs collected via human interventions alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reward-free reinforced fine-tuning is possible for flow-matching vision-language-action models by replacing reward design with a tailored preference optimization objective. A new loss called RPRO adds a proximal regularizer to a contrastive optimizer so that the implicit reward stays anchored in magnitude and avoids hacking. On the data side, one operator produces paired successful and failed trajectories through intervention and rollback; a Smooth Interpolation step then spreads those signals across states while batch mixing keeps the original policy intact. Experiments on four long-horizon bimanual tasks show FlowPRO reaching the highest success rates and outperforming four baselines, with ablations confirming each added term contributes.

Core claim

FlowPRO is a reward-free offline framework that applies RPRO (Robotic Flow-matching Proximalized Preference Optimization) to the flow-matching action head of VLA models. RPRO combines a contrastive preference term with an explicit proximal regularizer that fixes the scale of the implicit reward. Paired trajectories are gathered on a physical robot by a single teleoperator who intervenes and rolls back on failure; Smooth Interpolation plus batch mixing converts the sparse pairs into dense per-state supervision without degrading base capabilities. On four long-horizon bimanual tasks the resulting policy records the highest success rate among four representative baselines.

What carries the argument

RPRO objective: a contrastive optimizer paired with an explicit proximal regularizer that anchors the absolute magnitude of the implicit reward inside the flow-matching action head.

If this is right

  • Flow-matching VLAs can receive reinforced updates from offline preference data without any hand-crafted reward function.
  • A single operator's intervention-and-rollback actions suffice to generate training pairs that improve long-horizon bimanual performance.
  • Each component of the RPRO loss (contrastive term, proximal regularizer, and data interpolation) contributes measurably to final success rate.
  • The same pipeline outperforms standard SFT, DAgger, and plain Flow-DPO on the reported tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The intervention paradigm could be reused with other generative action heads that admit an implicit reward formulation.
  • Because the method stays fully offline, it may scale to larger robot fleets where online reward signals remain impractical.
  • The proximal anchoring step might generalize to other preference-optimization settings that suffer from reward-scale drift.

Load-bearing premise

The Smooth Interpolation procedure combined with batch mixing converts these sparse corrections into dense per-state supervision while preserving the base policy's capabilities.

What would settle it

Running the same four long-horizon bimanual tasks and finding that FlowPRO does not achieve the highest success rate, or that removing the proximal regularizer produces no measurable difference in performance.

Figures

Figures reproduced from arXiv: 2606.05468 by He Zhang, Junbo Tan, Xueqian Wang, Yihao Wu, Zhengyou Zhang.

Figure 1
Figure 1. Figure 1: Overview of the FlowPRO framework. (a) SFT Base Model: Stage 1 trains πθ on DSFT. (b) Data Collection: operator-triggered rollback and operator teleoperation yield paired positive and negative trajectories τ w, τ l . (c) Preference Dataset: pairs are aggregated into Dk pref. (d) Smooth Interpolation: Bezier interpolation synthesizes the missing counterpart at action-chunk granularity ´ (e.g., M →J, J →N′ )… view at source ↗
Figure 2
Figure 2. Figure 2: Experimental setup for the four real-robot tasks: (a) cosmetic packaging (PACK), (b) pen-cap assembly (CAP), (c) USB insertion (USB), and (d) pencil-case packing (CASE). Due to space constraints, this figure shows only a simplified overview; the complete per-stage pipeline for each task is provided in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-iteration success rates (SR) on the four real-robot tasks, with PI0.5 as the base policy. Iteration 0 corresponds to the shared SFT checkpoint. The corresponding curves with PI0 as the base policy are reported in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Loss-component ablations of RPRO. (a) Implicit-reward dynamics during training: per￾step rewards on positive actions r w (left) and negative actions r l (right) for RPRO, PRO, DPO+SFT, and DPO. (b) Task success rates under in-distribution (left) and out-of-distribution (right) initial conditions across loss-component variants. Here, out-of-distribution (OOD) refers to object initial positions sampled from … view at source ↗
Figure 5
Figure 5. Figure 5: Why Smooth Interpolation stays physically plausible, illustrated on the USB task. In both panels: red dashed = executed negative rollout τ l (length L, ending at the socket rim); green solid = successful teleoperation τ w (ending inside the socket); blue arc = one synthetic positive chunk of length H, a cubic Bezier bridge from ´ M ∈τ l to J ∈τ w followed by direct tracking of τ w up to N′ . (a) Generic ca… view at source ↗
Figure 6
Figure 6. Figure 6: Hardware platform used for all real-robot experiments. The Dobot XTrainer bimanual setup includes two 6-DoF arms with parallel-jaw grippers, a top-down global RGB camera, two wrist-mounted RGB cameras (one per arm). G.2 Training Setup This subsection details the compute resources, data scale, and all training hyper-parameters used to produce every model evaluated in the main paper. Both the SFT base policy… view at source ↗
Figure 7
Figure 7. Figure 7: Complete per-stage pipeline for the four real-robot tasks. Extended version of [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-iteration success rates (SR) on the four real-robot tasks, with PI0 as the base policy. Iteration 0 corresponds to the shared SFT checkpoint. Companion to [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: User study on the teleoperated data-collection system. Stacked-bar distribution of 5-point Likert responses across the five evaluation dimensions. Larger agree/strongly agree (top, green/teal) regions indicate more favorable ratings. SFT. Pure flow-matching regression on preferred actions a w only: LSFT(θ) = E(s,aw)∼D ℓθ(s, aw) [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. Algorithmically, we propose RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the flow-matching action head of VLA models. RPRO pairs a contrastive optimizer with an explicit proximal regularizer that anchors the absolute magnitude of the implicit reward, thereby eliminating the reward-hacking failure mode of plain Flow-DPO. On the data side, a teleoperated intervention-and-rollback paradigm produces naturally paired positive and negative trajectories $(\tau^w, \tau^l)$ on a real robot from a single operator action; a Smooth Interpolation procedure, combined with batch mixing, then converts these sparse corrections into dense per-state supervision while preserving the base policy's capabilities. On four long-horizon bimanual tasks, FlowPRO attains the highest success rate, outperforming four representative baselines, and ablations confirm the contribution of each loss component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. It proposes the RPRO objective, which pairs a contrastive optimizer with an explicit proximal regularizer to anchor implicit reward magnitudes and avoid reward-hacking. Data collection uses a teleoperated intervention-and-rollback paradigm to generate paired positive/negative trajectories, converted to dense per-state supervision via Smooth Interpolation and batch mixing. The central empirical claim is that FlowPRO achieves the highest success rates on four long-horizon bimanual tasks, outperforming four baselines, with ablations confirming each loss component.

Significance. If the quantitative results hold with proper statistical support and baseline details, the work could meaningfully advance post-training of VLAs by offering a practical alternative to reward design and indirect failure exploitation in real-robot settings. The explicit proximal regularizer and intervention-based data pipeline are potentially useful contributions to preference optimization for flow-matching policies.

major comments (1)
  1. [Abstract] Abstract: The claim that FlowPRO 'attains the highest success rate' on four tasks and that 'ablations confirm the contribution of each loss component' is presented without any numerical success rates, number of evaluation trials, error bars, baseline descriptions, or dataset sizes. This absence makes it impossible to judge whether the data support the central empirical claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We agree that the current abstract would be strengthened by including concrete numerical results and evaluation details to better support the central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that FlowPRO 'attains the highest success rate' on four tasks and that 'ablations confirm the contribution of each loss component' is presented without any numerical success rates, number of evaluation trials, error bars, baseline descriptions, or dataset sizes. This absence makes it impossible to judge whether the data support the central empirical claim.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we will update the abstract to report the success rates achieved by FlowPRO and the four baselines on each of the four tasks, the number of evaluation trials per task, and a brief reference to the error bars and ablation outcomes. These quantitative details already appear with full statistical support in the experimental section; adding a concise version to the abstract will make the empirical contribution immediately verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript presents FlowPRO/RPRO as an explicit algorithmic construction: a contrastive optimizer paired with a stated proximal regularizer term, plus an intervention-and-rollback data pipeline followed by Smooth Interpolation and batch mixing. No equations appear in the supplied text that define a quantity in terms of itself, rename a fitted parameter as a prediction, or reduce the central result to a self-citation chain. The empirical claims rest on reported success rates and ablations rather than any definitional equivalence or imported uniqueness theorem. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5772 in / 1016 out tokens · 27462 ms · 2026-06-28T05:36:24.522492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 15 canonical work pages · 13 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  2. [2]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. pages 2165–2183, 2023

  3. [3]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  4. [4]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  5. [5]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  6. [6]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. RDT-1B: a diffusion foundation model for bimanual manipulation. 2025:29982–30009, 2025

  7. [7]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  8. [8]

    X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  9. [9]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021

  10. [10]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. pages 627–635, 2011

  11. [11]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  12. [12]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  13. [13]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. 2017

  14. [14]

    G. Lu, W. Guo, C. Zhang, Y . Zhou, H. Jiang, Z. Gao, Y . Tang, and Z. Wang. VLA-RL: To- wards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

  15. [15]

    Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language- action model with online reinforcement learning. pages 15665–15672, 2025

  16. [16]

    A. Ren, J. Lidard, L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. 2025:77288–77329, 2025. 9

  17. [17]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: a VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  18. [18]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  19. [19]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

  20. [20]

    Zhang, K

    Z. Zhang, K. Zheng, Z. Chen, J. Jang, Y . Li, S. Han, C. Wang, M. Ding, D. Fox, and H. Yao. GRAPE: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024

  21. [21]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  22. [22]

    K. Guo, Y . Li, and Z. Chen. Proximalized preference optimization for diverse feedback types: A decomposed perspective on dpo.Advances in Neural Information Processing Systems, 38: 94533–94576, 2026

  23. [23]

    Kelly, C

    M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. HG-DAgger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

  24. [24]

    Kostrikov, A

    I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit Q-learning. 2021

  25. [25]

    Nakamoto, S

    M. Nakamoto, S. Zhai, A. Singh, M. Sobol Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. volume 36, pages 62244–62269, 2023

  26. [26]

    Black, M

    K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations, volume 2024, pages 4965–4987, 2024

  27. [27]

    Zhang, Y

    B. Zhang, Y . Zhang, J. Ji, Y . Lei, J. Dai, Y . Chen, and Y . Yang. SafeVLA: Towards safety alignment of vision-language-action model via constrained learning.Advances in Neural In- formation Processing Systems, 38:153335–153373, 2026

  28. [28]

    KTO: Model Alignment as Prospect Theoretic Optimization

    K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela. KTO: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  29. [29]

    M. G. Azar, M. Rowland, B. Piot, D. Guo, D. Calandriello, M. Valko, and R. Munos. A general theoretical paradigm to understand learning from human preferences.arXiv preprint arXiv:2310.12036, 2024

  30. [30]

    Y . Meng, M. Xia, and D. Chen. SimPO: Simple preference optimization with a reference-free reward. volume 37, pages 124198–124235, 2024

  31. [31]

    Xiong, H

    W. Xiong, H. Dong, C. Ye, Z. Wang, H. Zhong, H. Ji, N. Jiang, and T. Zhang. Iterative preference learning from human feedback: Bridging theory and practice for RLHF under KL- constraint. 2023

  32. [32]

    Hejna, R

    J. Hejna, R. Rafailov, H. Sikchi, C. Finn, S. Niekum, W. B. Knox, and D. Sadigh. Contrastive preference learning: Learning from human feedback without reinforcement learning. InInter- national Conference on Learning Representations, volume 2024, pages 18770–18798, 2024. 10

  33. [33]

    Wallace, M

    B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  34. [34]

    K. Yang, J. Tao, J. Lyu, C. Ge, J. Chen, W. Shen, X. Zhu, and X. Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

  35. [35]

    J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, W. Qin, M. Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

  36. [36]

    R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  37. [37]

    Robins, S

    J. Robins, S. Greenland, and N. E. BRESLOW. A general estimator for the variance of the Mantel–Haenszel odds ratio.American journal of epidemiology, 124(5):719–723, 1986

  38. [38]

    W. G. Cochran. Some methods for strengthening the commonχ 2 tests.Biometrics, 10(4): 417–451, 1954

  39. [39]

    Mantel and W

    N. Mantel and W. Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease.Journal of the national cancer institute, 22(4):719–748, 1959. 11 A Asymptotic Vanishing of the Hyper-Response Reward in Continuous Action Spaces This appendix provides the formal asymptotic argument supporting the simplificationr θ(s,H)≈0 used in ...

  40. [40]

    the left arm picks up the closed soft-fabric pencil case from a marker region on the table

  41. [41]

    the right arm grasps the zipper slider, and the two arms cooperate tounzipthe case

  42. [42]

    the left arm places the now-open case at the table center

  43. [43]

    the right arm reaches inside the case and holds it open from within

  44. [44]

    the right arm picks a pen lying on the table and drops it into the case

  45. [45]

    correct-then-retrain

    the right arm grasps the pencil case again and the two arms cooperate tozip the case closedand return it to the table center. Object positions are fixed at marker regions with per-episode randomization. Three properties to- gether make this task uniquely challenging:(a) length and chaining—a single early-stage error (e.g., a slipped grasp at stage 1 or an...