pith. machine review for the scientific record. sign in

arxiv: 2605.12416 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Aligning Flow Map Policies with Optimal Q-Guidance

Alessandra Russo, Avishek Joey Bose, Christos Ziakas

Pith reviewed 2026-05-13 04:59 UTC · model grok-4.3

classification 💻 cs.LG
keywords flow map policiesoffline-to-online RLQ-guidancegenerative policiestrust-region optimizationflow matchingrobotic manipulationone-step inference
0
0 comments X

The pith

Flow map policies adapt to online RL via a derived closed-form optimal Q-guidance target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces flow map policies that learn arbitrary-size jumps across the dynamics of existing flow-based generative models to enable fast, often one-step, action generation. It frames online adaptation of an offline policy as a trust-region problem that improves the critic's Q-value while staying close to the original policy. Solving this problem exactly produces FLOW MAP Q-GUIDANCE (FMQ), a closed-form learning target the authors show is optimal under the stated constraint. The work also adds Q-GUIDED BEAM SEARCH for iterative refinement at inference time. Readers would care because the approach cuts the repeated simulation steps that make diffusion-style policies slow in sequential decision tasks while delivering measurable gains on robotic manipulation and locomotion benchmarks.

Core claim

Flow map policies are defined as a class of generative policies that learn to take arbitrary-size jumps, including single-step jumps, across the generative dynamics of flow-based policies. Offline-to-online adaptation is cast as a trust-region optimization problem whose objective is to increase the critic's Q-value subject to a proximity constraint on the policy. Solving this problem yields the closed-form FLOW MAP Q-GUIDANCE (FMQ) target, which the authors prove is optimal for the given constraint. The same framework introduces Q-GUIDED BEAM SEARCH, a stochastic sampler that combines renoising with beam search to refine actions iteratively at test time.

What carries the argument

FLOW MAP Q-GUIDANCE (FMQ), the closed-form optimal learning target obtained by exactly solving the critic-guided trust-region problem that maximizes Q-value improvement while keeping the adapted flow map policy close to its offline predecessor.

Load-bearing premise

The flow dynamics must allow accurate learning of arbitrary-size jumps, and the trust-region problem with Q-value improvement must admit an exact closed-form solution without hidden approximations or data-dependent fitting.

What would settle it

An experiment in which the FMQ target produces no measurable Q-value increase or in which the closed-form expression deviates from the numerically solved optimum under the same trust-region constraint would falsify the central optimality claim.

Figures

Figures reproduced from arXiv: 2605.12416 by Alessandra Russo, Avishek Joey Bose, Christos Ziakas.

Figure 1
Figure 1. Figure 1: (Left) FMQ: one-step flow map policy transports noise a0 to action a1; then, trust-region projection displaces action a1 to a ∗ 1 that maximizes Q-value. (Right): QGBS (M=1, B=2): renoising corrupts a ∗ 1 into B intermediate states at ′ , which the flow map policy then denoises to generate B candidate actions; candidates are updated via the optimal trust-region displacement to maximize Qϕ, and the highest-… view at source ↗
Figure 2
Figure 2. Figure 2: Training curves on 6 environments. Average success rate at every 100K during 1M offline followed by 1M online steps over 5 seeds. Shaded regions indicate 95% CIs. art method MVP [Zhan et al., 2026], which trains a mean flow policy with an initial velocity constraint. For MVP, we distinguish MVP∗ as results taken directly from the original paper, which is only avail￾able in 6/12 environments considered here… view at source ↗
Figure 3
Figure 3. Figure 3: Convergence speedup of FMQ com￾pared to MVP at success targets (ξ), with 95% CIs. We report our main quantitative results in table 1 and observe that FMQ achieves the highest IQM score (0.91; [0.89, 0.93]), outperforms MVP by 21.3% (0.75; [0.73, 0.77]) and QC by 5.8% (0.86; [0.84, 0.87]). The improvement is most pronounced on challenging environments: on cube-trl-t4, FMQ reaches 0.88 compared to 0.37 for Q… view at source ↗
Figure 4
Figure 4. Figure 4: Convergence during online training on 3 environments. Distance between the online and offline flow map policies (blue) converges to the adaptive radius ηeff (red) as the policy incorporates the Q-guidance. Flow map policy variants. We next ablate the offline flow map policy variants and their impact on performance. In table 3. ESD and LSD achieve the same IQM of 0.79, while PSD lags slightly at 0.77. Howev… view at source ↗
Figure 5
Figure 5. Figure 5: can (Robomimic). Pick up a can from the table and place it into the bin [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: square (Robomimic). Pick up a square nut and fit it onto a peg [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: cube-double-task3 (OGBench). Rearrange 2 cubes to target positions [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: cube-double-task4 (OGBench). Swap the positions of 2 cubes [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: cube-triple-task3 (OGBench). Unstack 3 cubes and place them at separate target positions. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: cube-triple-task4 (OGBench). Cyclically permute 3 cubes to new positions [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: scene-task4 (OGBench). Unlock the drawer button, open the drawer, and place the cube inside [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: scene-task5 (OGBench). Place the cube in the drawer and open the window [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: humanoidmaze-medium-task3 (OGBench). Navigate a humanoid to a goal in a medium maze [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: humanoidmaze-medium-task4 (OGBench). Navigate a humanoid to a distant goal in a medium maze [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: antmaze-giant-task4 (OGBench). Navigate an ant to a goal across a giant maze. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: antmaze-giant-task5 (OGBench). Navigate an ant to a nearby goal in a giant maze. C Algorithms Algorithm 1 FLOW MAP Q-GUIDANCE (FMQ) Require: Offline policy u off r,1 , online policy u θ r,1 , critics Qϕ1 , Qϕ2 , buffer D 1: for each environment step do 2: a1 ← a0 + u θ 0,1 (a0|s), a0 ∼ N (0, I) 3: D ← D ∪ {(s, a1, r, s′ )} 4: Sample batch from D; update critics via Eq. 8 5: r ∼ U[0, 1); a0 ∼ N (0, I); ar … view at source ↗
Figure 17
Figure 17. Figure 17: Offline-to-online learning curves for QC, MVP, and [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Trust-region convergence for FMQ (β=0.3) across all 12 environments. Blue: action displacement ∥u on r,1 − u off r,1 ∥2. Red dashed: adaptive trust-region radius ηeff. Orange dotted (right axis): implied Q-uncertainty σ˜Q = (η −1 eff − 1)/β. becomes more confident—automatically tightening the trust region and confirming that the adaptive mechanism prevents overshooting in low-confidence regions. 20 [PITH… view at source ↗
Figure 19
Figure 19. Figure 19: Convergence speedup of FMQ over MVP (TMVP/TFMQ) during online phase (1M–2M steps). Each dot represents one environment; black diamonds show the mean with 95% CI. The dashed line marks equal convergence speed (1×) [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
read the original abstract

Generative policies based on expressive model classes, such as diffusion and flow matching, are well-suited to complex control problems with highly multimodal action distributions. Their expressivity, however, comes at a significant inference cost: generating each action typically requires simulating many steps of the generative process, compounding latency across sequential decision-making rollouts. We introduce flow map policies, a novel class of generative policies designed for fast action generation by learning to take arbitrary-size jumps including one-step jumps-across the generative dynamics of existing flow-based policies. We instantiate flow map policies for offline-to-online reinforcement learning (RL) and formulate online adaptation as a trust-region optimization problem that improves the critic's Q-value while remaining close to the offline policy. We theoretically derive FLOW MAP Q-GUIDANCE (FMQ), a principled closed-form learning target that is optimal for adapting offline flow map policies under a critic-guided trust-region constraint. We further introduce Q-GUIDED BEAM SEARCH (QGBS), a stochastic flow-map sampler that combines renoising with beam search to enable iterative inference-time refinement. Across 12 challenging robotic manipulation and locomotion tasks from OGBench and RoboMimic, FMQ achieves state-of-the-art performance in offline-to-online RL, outperforming the previous one-step policy MVP by a relative improvement of 21.3% on the average success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces flow map policies as a novel class of generative policies that learn to perform arbitrary-size jumps (including one-step) across the dynamics of existing flow-based models, enabling fast inference for complex control tasks. For offline-to-online RL, adaptation is cast as a trust-region problem that maximizes critic Q-value improvement while staying close to the offline policy; the authors derive FLOW MAP Q-GUIDANCE (FMQ) as a closed-form optimal learning target for this problem and introduce Q-GUIDED BEAM SEARCH (QGBS) for stochastic inference-time refinement. Experiments across 12 robotic manipulation and locomotion tasks from OGBench and RoboMimic report state-of-the-art results, with a 21.3% relative gain in average success rate over the prior one-step baseline MVP.

Significance. If the closed-form optimality of FMQ holds without hidden approximations, the work offers a principled way to achieve both expressivity and low-latency inference in generative RL policies, directly addressing a key practical bottleneck. The empirical gains on challenging multimodal tasks suggest immediate applicability in robotics. Credit is due for the new policy class and the attempt at an exact trust-region solution, which could influence future work on guided generative models if the derivation is fully substantiated.

major comments (2)
  1. [§3] §3 (theoretical derivation of FMQ): The central claim that the critic-guided trust-region problem admits an exact closed-form optimal solution for flow-map policies must be supported by explicit equations showing the optimization objective, the trust-region constraint, and the resulting target. It is unclear whether this solution remains exact once the flow map is learned via regression on jump sizes or when expectations over the generative process are involved; any data-dependent fitting or implicit approximation would contradict the 'closed-form' and 'optimal' assertions and undermine the theoretical contribution.
  2. [§4] §4 (flow map parameterization and dynamics): The assumption that learned flow-map dynamics support accurate arbitrary-size jumps while preserving the exact structure needed for the Q-guidance derivation is load-bearing. If the flow is parameterized such that one-step jumps or the Q-improvement objective require numerical solves or Monte-Carlo estimates, the optimality guarantee does not follow directly from the trust-region formulation.
minor comments (2)
  1. [Introduction] The abstract and introduction introduce several new acronyms (FMQ, QGBS) and entities without immediate comparison to prior flow-matching or diffusion-policy literature; a short related-work paragraph clarifying distinctions would improve readability.
  2. [§5] Experimental results report a 21.3% relative improvement but omit variance, number of seeds, or statistical tests; adding these in §5 would strengthen the empirical claims without altering the core contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the need for greater clarity on the theoretical foundations of FMQ. We address each major comment below with direct references to the manuscript's derivations and parameterization. We commit to revisions that make the equations and assumptions fully explicit while preserving the exactness of the closed-form result under the stated conditions.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical derivation of FMQ): The central claim that the critic-guided trust-region problem admits an exact closed-form optimal solution for flow-map policies must be supported by explicit equations showing the optimization objective, the trust-region constraint, and the resulting target. It is unclear whether this solution remains exact once the flow map is learned via regression on jump sizes or when expectations over the generative process are involved; any data-dependent fitting or implicit approximation would contradict the 'closed-form' and 'optimal' assertions and undermine the theoretical contribution.

    Authors: Section 3 formulates the problem as the constrained optimization max_π E_{a ∼ π_θ} [Q(s, a)] subject to D_KL(π_θ || π_off) ≤ ε, where π_θ is the flow-map policy. We solve this exactly by substituting the flow-map jump parameterization into the objective and applying the method of Lagrange multipliers, yielding the closed-form FMQ target that shifts the regression target by a term proportional to ∇_a Q. The derivation is exact with respect to any fixed flow map; the regression step learns an approximation to the map but does not alter the optimality of the target once the map is given. Expectations over the generative process are eliminated analytically because the flow-map structure permits direct evaluation of the jump without sampling. We will revise §3 to include a boxed statement of the objective, constraint, and FMQ solution together with the full derivation steps. revision: partial

  2. Referee: [§4] §4 (flow map parameterization and dynamics): The assumption that learned flow-map dynamics support accurate arbitrary-size jumps while preserving the exact structure needed for the Q-guidance derivation is load-bearing. If the flow is parameterized such that one-step jumps or the Q-improvement objective require numerical solves or Monte-Carlo estimates, the optimality guarantee does not follow directly from the trust-region formulation.

    Authors: The flow map is parameterized as a direct regressor from (state, jump size, noise) to the jumped state, trained on integrated trajectories from the base flow model; this permits exact one-step evaluation by setting the jump size to the full horizon without any numerical integration or Monte-Carlo sampling at inference or guidance time. The Q-guidance modifies only the regression target in closed form and does not invoke additional solves. The optimality therefore holds exactly for the learned parameterization. We will add a clarifying paragraph and an appendix figure in the revision that explicitly shows the absence of numerical solves in both the one-step jump and the FMQ target computation. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in FMQ derivation

full rationale

The paper presents a theoretical derivation of FLOW MAP Q-GUIDANCE (FMQ) as the closed-form solution to a critic-guided trust-region optimization problem over flow-map jumps. This is a standard RL-style derivation where the optimal target is obtained by solving the stated objective (max Q improvement subject to policy closeness) under the given flow dynamics. No equations or steps reduce the claimed optimality to a fitted parameter, self-citation, or input by construction; the result follows from the math of the trust-region problem rather than re-labeling data-dependent fits. The derivation is self-contained against the paper's own assumptions about exact closed-form solvability, warranting a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

Abstract-only view reveals several new entities and assumptions introduced without independent evidence or full justification; free parameters not mentioned but likely implicit in the flow model training.

axioms (2)
  • domain assumption Existing flow-based policies admit extension to arbitrary-size jumps that can be learned directly
    Foundational for defining flow map policies
  • ad hoc to paper The critic-guided trust-region problem for policy adaptation possesses a closed-form optimal solution
    Directly supports the FMQ derivation claim
invented entities (3)
  • Flow map policies no independent evidence
    purpose: Generative policies that learn large jumps across flow dynamics for fast inference
    New policy class introduced to address inference cost
  • FLOW MAP Q-GUIDANCE (FMQ) no independent evidence
    purpose: Closed-form optimal learning target for offline-to-online adaptation
    Derived target claimed to be principled and optimal
  • Q-GUIDED BEAM SEARCH (QGBS) no independent evidence
    purpose: Stochastic sampler combining renoising and beam search for inference refinement
    New inference-time technique for iterative improvement

pith-pipeline@v0.9.0 · 5541 in / 1564 out tokens · 80170 ms · 2026-05-13T04:59:09.433662+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 15 internal anchors

  1. [1]

    Agarwal, M

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare. Deep reinforcement learning at the edge of the statistical precipice.Advances in neural information processing systems, 34:29304–29320, 2021.(Cited on page

  2. [2]

    URL https://arxiv.org/abs/2303.08797.(Cited on pages 1 and

  3. [3]

    URL https://arxiv.org/abs/2505.18825.(Cited on page

  4. [4]

    T. Chen, Z. Wang, and M. Zhou. Diffusion policies creating a trust region for offline reinforcement learning.Advances in Neural Information Processing Systems, 37:50098–50125, 2024.(Cited on page

  5. [5]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023.(Cited on page

  6. [6]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025.(Cited on page

  7. [7]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    O.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 1(2), 2023.(Cited on page

  8. [8]

    Ding and C

    Z. Ding and C. Jin. Consistency models as a rich and efficient policy class for reinforcement learning. arXiv preprint arXiv:2309.16984, 2023.(Cited on page

  9. [9]

    J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020.(Cited on page

  10. [10]

    Fujimoto and S

    S. Fujimoto and S. S. Gu. A minimalist approach to offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 34, pages 20132–20145, 2021.(Cited on page

  11. [11]

    Fujimoto, D

    S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. InInternational Conference on Machine Learning, pages 2052–2062. PMLR, 2019.(Cited on page

  12. [12]

    D. Garg, J. Hejna, M. Geist, and S. Ermon. Extreme q-learning: Maxent rl without entropy. In International Conference on Learning Representations, 2023.(Cited on page

  13. [13]

    URLhttps://arxiv.org/abs/2505.13447.(Cited on page

  14. [14]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023.(Cited on page

  15. [15]

    10 L. He, L. Shen, and X. Wang. Aligniql: Policy alignment in implicit q-learning through constrained optimization.arXiv preprint arXiv:2405.18187, 2024.(Cited on page

  16. [16]

    Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

    P. Holderrieth, D. Chen, L. Eyring, I. Shah, G. Anantharaman, Y . He, Z. Akata, T. Jaakkola, N. M. Boffi, and M. Simchowitz. Diamond maps: Efficient reward alignment via stochastic flow maps. arXiv preprint arXiv:2602.05993, 2026.(Cited on page

  17. [17]

    Way off-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456,

    N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456, 2019.(Cited on page

  18. [18]

    G. Kahn, A. Villaflor, P. Abbeel, and S. Levine. Composable action-conditioned predictors: Flexible off-policy learning for robot navigation. InConference on robot learning, pages 806–816. PMLR, 2018.(Cited on page

  19. [19]

    B. Kang, X. Ma, C. Du, T. Pang, and S. Yan. Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023.(Cited on page

  20. [20]

    Kostrikov, A

    I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.(Cited on page

  21. [21]

    A. Kumar. Data-driven deep reinforcement learning.Berkeley Artificial Intelligence Research (BAIR), Tech. Rep, 2019.(Cited on page

  22. [22]

    Kumar, A

    A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179–1191, 2020.(Cited on page

  23. [23]

    J. Lee, C. Paduraru, D. J. Mankowitz, N. Heess, D. Precup, K.-E. Kim, and A. Guez. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. InConference on Robot Learning, pages 1602–1612. PMLR, 2022.(Cited on page

  24. [24]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020.(Cited on pages 1 and

  25. [25]

    Q. Li, Z. Zhou, and S. Levine. Reinforcement learning with action chunking.arXiv preprint arXiv:2507.07969, 2025.(Cited on pages 6, 9, and

  26. [26]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.(Cited on pages 1 and

  27. [27]

    X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022.(Cited on pages 1 and

  28. [28]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021.(Cited on page

  29. [29]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    A. Nair and M. r. l. w. o. d. Dalal. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.(Cited on page

  30. [30]

    Nakamoto, Y

    M. Nakamoto, Y . Zhai, A. Singh, M. Radin, A. Kumar, C. Finn, and S. Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023.(Cited on page

  31. [31]

    T. X. Nguyen and C. D. Yoo. One-step flow q-learning: Addressing the diffusion policy bottleneck in offline reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026.(Cited on page

  32. [32]

    11 S. Park, K. Frans, B. Eysenbach, and S. Levine. Ogbench: Benchmarking offline goal-conditioned rl. arXiv preprint arXiv:2410.20092, 2024.(Cited on pages 6 and

  33. [33]

    S. Park, Q. Li, and S. Levine. Flow q-learning. InInternational Conference on Machine Learning (ICML), 2025.(Cited on pages 2 and

  34. [34]

    Pearceet al

    T. Pearce, T. Rashid, A. Kanervisto, D. Bignell, M. Sun, R. Georgescu, S. V . Macua, S. Z. Tan, I. Mo- mennejad, K. Hofmann, and S. Devlin. Imitating human behaviour with diffusion models.ArXiv, abs/2301.10677,

  35. [35]

    Peluchetti

    S. Peluchetti. Non-denoising forward-time diffusions.arXiv preprint arXiv:2312.14589, 2023.(Cited on page

  36. [36]

    Potaptchik, A

    P. Potaptchik, A. Saravanan, A. Mammadov, A. Prat, M. S. Albergo, and Y . W. Teh. Meta flow maps enable scalable reward alignment.arXiv preprint arXiv:2601.14430, 2026.(Cited on page

  37. [37]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015.(Cited on page

  38. [38]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020.(Cited on page

  39. [39]

    Y . Song, Y . Zhou, A. Sekhari, J. A. Bagnell, A. Krishnamurthy, and W. Sun. Hybrid rl: Using both offline and online data can make rl efficient. InInternational Conference on Learning Representations, 2023.(Cited on page

  40. [40]

    MIT press Cambridge, 1998.(Cited on page

  41. [41]

    Tarasov, V

    D. Tarasov, V . Kurenkov, A. Nikulin, and S. Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36:11592–11620, 2023.(Cited on page

  42. [42]

    A. Tong, K. Fatras, N. Malkin, G. Huguet, Y . Zhang, J. Rector-Brooks, G. Wolf, and Y . Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.arXiv preprint arXiv:2302.00482, 2023.(Cited on page

  43. [43]

    Z. Wang, J. J. Hunt, and M. Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022.(Cited on page

  44. [44]

    Z. Wang, J. J. Hunt, and M. Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. InInternational Conference on Learning Representations, 2023.(Cited on page

  45. [45]

    Y . Wu, G. Tucker, and O. Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019.(Cited on page

  46. [46]

    L. Yang, Z. Huang, F. Lei, Y . Zhong, Y . Yang, C. Fang, S. Wen, B. Zhou, and Z. Lin. Policy represen- tation via diffusion probability model for reinforcement learning.arXiv preprint arXiv:2305.13122, 2023.(Cited on page

  47. [47]

    T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma. Mopo: Model-based offline policy optimization. InAdvances in Neural Information Processing Systems, volume 33, pages 14129–14142, 2020.(Cited on page

  48. [48]

    G. Zhan, Y . Jiang, S. E. Li, Y . Lyu, X. Zhang, and Y . Yin. A transformation-aggregation frame- work for state representation of autonomous driving systems.IEEE Transactions on Intelligent Transportation Systems, 25(7):7311–7322, 2024.(Cited on page

  49. [49]

    G. Zhan, X. An, Y . Jiang, J. Duan, H. Zhao, and S. E. Li. Physics informed neural pose estimation for real-time shape reconstruction of soft continuum robots.IEEE Robotics and Automation Letters, 2025.(Cited on page

  50. [50]

    12 G. Zhan, L. Tao, P. Wang, Y . Wang, Y . Li, Y . Chen, H. Li, M. Tomizuka, and S. E. Li. Mean flow policy with instantaneous velocity constraint for one-step action generation.arXiv preprint arXiv:2602.13810, 2026.(Cited on pages 2, 4, 5, 7, 9, 15, and

  51. [51]

    Zhang, Z

    R. Zhang, Z. Luo, J. Sjölund, T. B. Schön, and P. Mattsson. Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.Advances in neural information processing systems, 37:98871–98897, 2024.(Cited on page

  52. [52]

    Z. Zhu, H. Zhao, H. He, Y . Zhong, S. Zhang, H. Guo, T. Chen, and W. Zhang. Diffusion models for reinforcement learning: A survey.arXiv preprint arXiv:2311.01223, 2023.(Cited on page

  53. [53]

    +λ ∗(u∗ r,1(ar |s)−u ref r,1(ar |s)) = 0(Stationarity) (23) Assuming ∇aQϕ(s, a 1)̸= 0 , stationarity equation 23 requires λ∗ >0 . Complementary slack- ness equation 22 then forces the constraint to be active: ∥u∗ r,1(ar |s)−u ref r,1(ar |s)∥ 2 =η(24) Taking the norm of the stationarity condition gives λ∗η=∥∇ aQϕ(s, a 1)∥2, so λ∗ = ∥∇aQϕ(s, a 1)∥2/η. Subst...

  54. [54]

    (30) succinctly: ∇θL(θ) = 2E ∇θuθ r,t(ar |s) T uθ r,t(ar |s)−sg uθ r,t(ar |s) + (t−r) d dr ur,t(ar |s)

    We now leverage and rewrite eq. (30) succinctly: ∇θL(θ) = 2E ∇θuθ r,t(ar |s) T uθ r,t(ar |s)−sg uθ r,t(ar |s) + (t−r) d dr ur,t(ar |s) . This loss gradient matches the Mean Flow Policies’ loss gradient, with the main distinction being the usage of the ground truth velocity v∗ r as opposed to the network’s predictionuθ r,r. Furthermore, the instantaneous v...

  55. [55]

    NFE =M(1 +KB) , where M is the number of initial candidates, K the number of renoising steps, and B the number of completions per candidate; K=0 reduces to standard best-of- M. The per-environment results confirm that the gains from renoising ( K=1) are consistent across task domains—manipulation, multi-object rearrangement, and locomotion—with the most n...

  56. [56]

    At inference, the ODE is integrated from t=0 to t=1 with 10 Euler steps, producing 22 Table 7: Inference procedures

    QC [Li et al., 2025] trains a standard CFM velocity field vθ(at, t|s) with the straight-line interpo- lation objective. At inference, the ODE is integrated from t=0 to t=1 with 10 Euler steps, producing 22 Table 7: Inference procedures. NFE = network forward evaluations per action. Method Action selection StepsNNFE QC Best-of-N(Euler)10 32 320 MVP Best-of...