arxiv: 2604.14698 · v1 · submitted 2026-04-16 · 💻 cs.LG

Recognition: unknown

Mean Flow Policy Optimization

Jian Cheng, Xiaoyi Dong, Xi Sheryl Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningpolicy optimizationflow-based modelsmaximum entropy RLdiffusion policiesMuJoCocontinuous controlgenerative policies

0 comments

The pith

Mean Flow Policy Optimization uses few-step flow models to represent RL policies, matching diffusion performance while cutting training and inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces diffusion models with MeanFlow models as policy representations in online reinforcement learning. Diffusion policies are expressive but slow because of their iterative generation steps during both training and inference. MeanFlow models require only a few steps, and the method optimizes them under the maximum entropy framework by adapting soft policy iteration to handle action likelihood evaluation and policy improvement. On MuJoCo and DeepMind Control Suite tasks, this yields performance at or above diffusion baselines together with large reductions in compute time.

Core claim

Representing policies as MeanFlow models and optimizing them via soft policy iteration under the maximum entropy RL framework produces policies whose performance on standard continuous-control benchmarks equals or exceeds that of diffusion-based methods while substantially lowering both training and inference cost.

What carries the argument

MeanFlow models, a class of few-step flow-based generative models serving as the policy class, combined with maximum-entropy soft policy iteration adapted for action-likelihood evaluation and soft improvement.

If this is right

Expressive policy classes in RL need not incur the full iterative cost of diffusion if few-step flow alternatives exist.
The maximum-entropy framework can be applied to generative-model families other than diffusion without losing its theoretical guarantees.
Reducing the number of sampling steps in the policy directly translates into faster online RL training loops.
Once action likelihoods and soft improvement are tractable, any few-step generative model becomes a candidate for entropy-regularized policy optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same few-step flow construction could be tried in offline RL or model-based settings where repeated policy evaluation is the dominant cost.
If the efficiency advantage persists at scale, complex policies could be deployed on hardware with tighter latency budgets than current diffusion methods allow.
Hybrid approaches that combine MeanFlow with existing acceleration tricks such as distillation or consistency models remain unexplored in the paper.

Load-bearing premise

The two MeanFlow-specific obstacles of action likelihood evaluation and soft policy improvement can be solved without introducing instabilities or bias that would undermine the maximum-entropy guarantees.

What would settle it

A set of runs on MuJoCo or DeepMind Control Suite in which MFPO either underperforms the diffusion baselines by a clear margin or shows no substantial reduction in training and inference wall-clock time would falsify the central claim.

read the original abstract

Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/MFPolicy/MFPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MFPO is a practical port of MeanFlow to max-entropy RL policies that delivers measurable speedups on standard benchmarks, but its theoretical grounding depends on whether the likelihood and improvement fixes are unbiased.

read the letter

The core contribution is adapting an existing few-step flow model (MeanFlow) to serve as a policy in online RL. They keep the maximum-entropy objective and soft policy iteration, then supply the two missing pieces: a way to evaluate action likelihood under the MeanFlow representation and a compatible soft improvement operator. That combination is new relative to the diffusion-RL papers they cite, and it directly targets the training and sampling cost that has limited those methods so far. On MuJoCo and DeepMind Control Suite they report performance at or above the diffusion baselines with lower wall-clock time, and they release code, which is the right move for a methods paper like this. Those are the concrete positives. The soft spot is the one the stress-test note flags. Soft policy iteration only converges to the correct soft Q-function if the log-probability term is unbiased and the improvement step does not inject systematic error. MeanFlow is a few-step approximation, so any Monte-Carlo estimator or velocity-field approximation they use for the likelihood could create bias in the entropy regularizer. The abstract claims the challenges are solved, but the strength of the result rests on whether those solutions are exact or merely practical. If the derivations hold up under inspection, the efficiency claim is credible; if they rely on heuristics, the performance numbers may be harder to interpret. This paper is aimed at researchers already working on generative policies for continuous control who need faster sampling. It is solid enough to merit peer review because the benchmarks are standard, the code is public, and the engineering problem it attacks is real, even if the theoretical details will need careful checking in revision.

Referee Report

2 major / 1 minor

Summary. The paper proposes Mean Flow Policy Optimization (MFPO), which represents RL policies via MeanFlow (few-step flow-based) generative models and optimizes them under the maximum-entropy objective using soft policy iteration. It claims to resolve two MeanFlow-specific challenges—action likelihood evaluation and soft policy improvement—thereby achieving performance on MuJoCo and DeepMind Control Suite benchmarks that is comparable to or better than diffusion-based baselines while substantially lowering training and inference time. Code is released.

Significance. If the MeanFlow-specific implementations of likelihood evaluation and policy improvement are shown to be unbiased and to preserve the fixed-point guarantees of soft policy iteration, the approach would provide a practical efficiency improvement over diffusion policies without sacrificing the theoretical benefits of maximum-entropy RL. The public code release is a clear strength for reproducibility.

major comments (2)

[Method section (action likelihood evaluation)] The manuscript does not supply the explicit estimator or derivation for the action log-likelihood under the MeanFlow policy (referenced in the abstract and the method section). Without this, it is impossible to confirm that the entropy term remains unbiased, which is load-bearing for the claim that soft policy iteration converges to the true soft-optimal policy.
[Method section (soft policy improvement)] No analysis or fixed-point argument is given for the soft policy improvement operator when applied to the few-step MeanFlow parameterization (abstract and method section). Any approximation in the probability path or velocity field could introduce bias into the KL penalty, undermining the theoretical justification for the reported benchmark gains.

minor comments (1)

[Experiments] The number of flow steps and the precise form of the velocity field used in the MeanFlow policy should be stated explicitly in the experimental setup for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and valuable comments on the theoretical underpinnings of Mean Flow Policy Optimization. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and derivations.

read point-by-point responses

Referee: [Method section (action likelihood evaluation)] The manuscript does not supply the explicit estimator or derivation for the action log-likelihood under the MeanFlow policy (referenced in the abstract and the method section). Without this, it is impossible to confirm that the entropy term remains unbiased, which is load-bearing for the claim that soft policy iteration converges to the true soft-optimal policy.

Authors: We agree that an explicit derivation of the action log-likelihood estimator is required to rigorously establish unbiasedness of the entropy term. Section 3.2 describes the Monte Carlo estimation procedure based on the MeanFlow probability path, but the full mathematical steps were not expanded for brevity. In the revised manuscript we will add a dedicated appendix containing the complete derivation, showing that the estimator is unbiased for the few-step MeanFlow parameterization and therefore preserves the fixed-point properties of soft policy iteration under the maximum-entropy objective. revision: yes
Referee: [Method section (soft policy improvement)] No analysis or fixed-point argument is given for the soft policy improvement operator when applied to the few-step MeanFlow parameterization (abstract and method section). Any approximation in the probability path or velocity field could introduce bias into the KL penalty, undermining the theoretical justification for the reported benchmark gains.

Authors: We acknowledge that a formal fixed-point analysis of the soft policy improvement operator under the approximate MeanFlow parameterization is absent from the current manuscript. Section 3.3 outlines the practical adaptation that uses few-step sampling and an approximated KL divergence, but does not supply a contraction-mapping argument. In the revision we will include a new subsection providing a theoretical discussion: we will show that, under the assumption that the trained MeanFlow model converges to the target distribution (as enforced by the training loss), the bias in the KL penalty vanishes asymptotically and the operator retains the essential contraction property of standard soft policy iteration. Empirical support from the MuJoCo and DeepMind Control Suite results will be referenced to illustrate that any residual approximation error does not prevent convergence to high-performing policies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on external frameworks

full rationale

The paper adapts the standard soft policy iteration algorithm from maximum-entropy RL to MeanFlow policies and states that it solves the two MeanFlow-specific challenges of likelihood evaluation and policy improvement. No equations or claims are presented that reduce the performance claims, the soft Q-function fixed point, or the reported benchmark results to quantities defined only by the authors' own fitted constants, self-referential definitions, or a chain of their prior unverified results. The experimental comparisons to diffusion baselines on MuJoCo and DeepMind Control Suite therefore constitute independent evidence rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that MeanFlow can be made compatible with soft policy iteration; no new physical constants or invented particles are introduced, but standard RL hyperparameters remain.

pith-pipeline@v0.9.0 · 5427 in / 1015 out tokens · 25249 ms · 2026-05-10T11:11:28.096185+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 7 canonical work pages · 6 internal anchors

[1]

Continuous control with deep reinforcement learning

Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)

work page internal anchor Pith review arXiv 2015
[2]

In: International Conference on Machine Learning, pp

Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR

2015
[3]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

In: International Conference on Machine Learning, pp

Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum entropy deep reinforce- ment learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870 (2018). Pmlr

2018
[5]

In: International Conference on Machine Learning, pp

Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR

2018
[6]

arXiv preprint arXiv:2305.13122 , year=

Yang, L., Huang, Z., Lei, F., Zhong, Y., Yang, Y., Fang, C., Wen, S., Zhou, B., Lin, Z.: Policy representation via diffusion probability model for reinforcement learning. arXiv preprint arXiv:2305.13122 (2023)

work page arXiv 2023
[7]

Advances in Neural Information Processing Systems37, 54183–54204 (2024)

Wang, Y., Wang, L., Jiang, Y., Zou, W., Liu, T., Song, X., Wang, W., Xiao, L., Wu, J., Duan, J.,et al.: Diffusion actor-critic with entropy regulator. Advances in Neural Information Processing Systems37, 54183–54204 (2024)

2024
[8]

Advances in Neural Information Processing Systems 11 37, 53945–53968 (2024)

Ding, S., Hu, K., Zhang, Z., Ren, K., Zhang, W., Yu, J., Wang, J., Shi, Y.: Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems 11 37, 53945–53968 (2024)

2024
[9]

Mean Flows for One-step Generative Modeling

Geng, Z., Deng, M., Bai, X., Kolter, J.Z., He, K.: Mean flows for one-step generative modeling. arXiv preprint arXiv:2505.13447 (2025)

work page internal anchor Pith review arXiv 2025
[10]

In: International Conference on Machine Learning (2025)

Celik, O., Li, Z., Blessing, D., Li, G., Palenicek, D., Peters, J., Chalvatzaki, G., Neumann, G.: Dime: Diffusion-based maximum entropy reinforcement learning. In: International Conference on Machine Learning (2025)

2025
[11]

In: International Conference on Machine Learning (2025)

Dong, X., Cheng, J., Zhang, X.S.: Maximum entropy reinforcement learning with diffusion policy. In: International Conference on Machine Learning (2025)

2025
[12]

In: International Conference on Machine Learning (2025)

Ma, H., Chen, T., Wang, K., Li, N., Dai, B.: Efficient online reinforcement learning for diffusion policy. In: International Conference on Machine Learning (2025)

2025
[13]

In: International Conference on Machine Learning, pp

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265 (2015). pmlr

2015
[14]

Advances in Neural Information Processing Systems34, 8780–8794 (2021)

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems34, 8780–8794 (2021)

2021
[15]

Advances in Neural Information Processing Systems35, 36479–36494 (2022)

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T.,et al.: Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems35, 36479–36494 (2022)

2022
[16]

OpenAI Blog1(8), 1 (2024)

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E.,et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024)

2024
[17]

Advances in Neural Information Processing Systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems33, 6840–6851 (2020)

2020
[18]

Advances in Neural Information Processing Systems32(2019)

Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems32(2019)

2019
[19]

In: International Conference on Learning Representations (2021)

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative mod- eling through stochastic differential equations. In: International Conference on Learning Representations (2021)

2021
[20]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

work page internal anchor Pith review arXiv 2022
[21]

In: International Conference on Learning Representations (2023)

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: International Conference on Learning Representations (2023)

2023
[22]

In: International Conference on Learning Representations (2022)

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: International Conference on Learning Representations (2022)

2022
[23]

In: European Conference on Computer Vision, pp

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. In: European Conference on Computer Vision, pp. 87–103 (2024). Springer

2024
[24]

Advances in Neural Information Processing Systems35, 5775–5787 (2022)

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems35, 5775–5787 (2022)

2022
[25]

Machine Intelligence Research, 1–22 (2025)

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver++: Fast solver for guided sampling of 12 diffusion probabilistic models. Machine Intelligence Research, 1–22 (2025)

2025
[26]

In: International Conference on Machine Learning, pp

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: International Conference on Machine Learning, pp. 32211–32252 (2023). PMLR

2023
[27]

In: International Conference on Learning Representations (2024)

Song, Y., Dhariwal, P.: Improved techniques for training consistency models. In: International Conference on Learning Representations (2024)

2024
[28]

In: International Conference on Learning Representations (2025)

Frans, K., Hafner, D., Levine, S., Abbeel, P.: One step diffusion via shortcut models. In: International Conference on Learning Representations (2025)

2025
[29]

Advances in Neural Information Processing Systems (2025)

Ding, S., Hu, K., Zhong, S., Luo, H., Zhang, W., Wang, J., Wang, J., Shi, Y.: Genpo: Generative diffusion models meet on-policy reinforcement learning. Advances in Neural Information Processing Systems (2025)

2025
[30]

Advances in Neural Information Processing Systems (2025)

Lv, L., Li, Y., Luo, Y., Sun, F., Kong, T., Xu, J., Ma, X.: Flow-based policy for online reinforcement learning. Advances in Neural Information Processing Systems (2025)

2025
[31]

In: International Conference on Machine Learning, pp

Psenka, M., Escontrela, A., Abbeel, P., Ma, Y.: Learning a diffusion model policy from rewards via q-score matching. In: International Conference on Machine Learning, pp. 41163–41182 (2024). PMLR

2024
[32]

Advances in Neural Information Processing Systems31(2018)

Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differential equations. Advances in Neural Information Processing Systems31(2018)

2018
[33]

In: Maximum Entropy and Bayesian Methods: Cambridge, England, 1988, pp

Skilling, J.: The eigenvalues of mega-dimensional matrices. In: Maximum Entropy and Bayesian Methods: Cambridge, England, 1988, pp. 455–466. Springer, Dordrecht (1989)

1988
[34]

Communications in Statistics-Simulation and Computation18(3), 1059–1076 (1989)

Hutchinson, M.F.: A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation18(3), 1059–1076 (1989)

1989
[35]

University of Chicago, Dept

Kong, A.: A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep348, 14 (1992)

1992
[36]

Journal of Machine Learning Research21(141), 1–75 (2020)

Metelli, A.M., Papini, M., Montali, N., Restelli, M.: Importance sampling techniques for policy optimization. Journal of Machine Learning Research21(141), 1–75 (2020)

2020
[37]

In: International Conference on Machine Learning, pp

Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: International Conference on Machine Learning, pp. 449–458 (2017). PMLR

2017
[38]

IEEE Transactions on Neural Networks and Learning Systems33(11), 6584–6598 (2021)

Duan, J., Guan, Y., Li, S.E., Ren, Y., Sun, Q., Cheng, B.: Distributional soft actor-critic: Off-policy reinforce- ment learning for addressing value estimation errors. IEEE Transactions on Neural Networks and Learning Systems33(11), 6584–6598 (2021)

2021
[39]

Advances in Neural Information Processing Systems37, 98806–98834 (2024)

Mao, L., Xu, H., Zhan, X., Zhang, W., Zhang, A.: Diffusion-dice: In-sample diffusion guidance for offline reinforcement learning. Advances in Neural Information Processing Systems37, 98806–98834 (2024)

2024
[40]

Advances in Neural Information Processing Systems (2025)

Espinosa-Dice, N., Zhang, Y., Chen, Y., Guo, B., Oertell, O., Swamy, G., Brantley, K., Sun, W.: Scaling offline rl via efficient and expressive shortcut models. Advances in Neural Information Processing Systems (2025)

2025
[41]

In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp

Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033 (2012). IEEE

2012
[42]

DeepMind Control Suite

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D.d.L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al.: Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018)

work page internal anchor Pith review arXiv 2018
[43]

In: International Conference 13 on Learning Representations (2023)

Pearce, T., Rashid, T., Kanervisto, A., Bignell, D., Sun, M., Georgescu, R., Macua, S.V., Tan, S.Z., Momen- nejad, I., Hofmann, K.,et al.: Imitating human behaviour with diffusion models. In: International Conference 13 on Learning Representations (2023)

2023
[44]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

2025
[45]

In: International Conference on Learning Representations (2023)

Chen, H., Lu, C., Ying, C., Su, H., Zhu, J.: Offline reinforcement learning via high-fidelity generative behavior modeling. In: International Conference on Learning Representations (2023)

2023
[46]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J.G., Levine, S.: Idql: Implicit q-learning as an actor- critic method with diffusion policies. arXiv preprint arXiv:2304.10573 (2023)

work page internal anchor Pith review arXiv 2023
[47]

Advances in Neural Information Processing Systems36, 67195–67212 (2023)

Kang, B., Ma, X., Du, C., Pang, T., Yan, S.: Efficient diffusion policies for offline reinforcement learning. Advances in Neural Information Processing Systems36, 67195–67212 (2023)

2023
[48]

In: International Conference on Learning Representations (2025)

Zhang, S., Zhang, W., Gu, Q.: Energy-weighted flow matching for offline reinforcement learning. In: International Conference on Learning Representations (2025)

2025
[49]

In: International Conference on Learning Representations (2023)

Wang, Z., Hunt, J.J., Zhou, M.: Diffusion policies as an expressive policy class for offline reinforcement learning. In: International Conference on Learning Representations (2023)

2023
[50]

In: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pp

Chen, Y., Li, H., Zhao, D.: Boosting continuous control with consistency policy. In: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pp. 335–344 (2024)

2024
[51]

In: International Conference on Learning Representations (2024)

Ding, Z., Jin, C.: Consistency models as a rich and efficient policy class for reinforcement learning. In: International Conference on Learning Representations (2024)

2024
[52]

In: International Conference on Machine Learning (2025)

Park, S., Li, Q., Levine, S.: Flow q-learning. In: International Conference on Machine Learning (2025)

2025
[53]

In: International Conference on Machine Learning, pp

Lu, C., Chen, H., Chen, J., Su, H., Li, C., Zhu, J.: Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In: International Conference on Machine Learning, pp. 22825–22855 (2023). PMLR

2023
[54]

In: International Conference on Learning Representations (2025)

Fang, L., Liu, R., Zhang, J., Wang, W., Jing, B.: Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning. In: International Conference on Learning Representations (2025)

2025
[55]

In: International Conference on Learning Representations (2025)

Ren, A.Z., Lidard, J., Ankile, L.L., Simeonov, A., Agrawal, P., Majumdar, A., Burchfiel, B., Dai, H., Sim- chowitz, M.: Diffusion policy policy optimization. In: International Conference on Learning Representations (2025)

2025
[56]

Advances in Neural Information Processing Systems (2025)

Zhang, T., Yu, C., Su, S., Wang, Y.: Reinflow: Fine-tuning flow matching policy with online reinforcement learning. Advances in Neural Information Processing Systems (2025)

2025
[57]

https://artowen.su.domains/mc/

Owen, A.B.: Monte Carlo Theory, Methods and Examples. https://artowen.su.domains/mc/. Online book (2013)

2013
[58]

In: SysML Conference 2018 (2019)

Frostig, R., Johnson, M.J., Leary, C.: Compiling machine learning programs via high-level tracing. In: SysML Conference 2018 (2019)

2018
[59]

Advances in neural information processing systems32(2019) 14 Appendix A Supplementary Related Work A.1 Diffusion Policies for Offline and Offline2Online RL

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems32(2019) 14 Appendix A Supplementary Related Work A.1 Diffusion Policies for Offline and Offline2Online RL. B...

2019