Recognition: unknown
Mean Flow Policy Optimization
Pith reviewed 2026-05-10 11:11 UTC · model grok-4.3
The pith
Mean Flow Policy Optimization uses few-step flow models to represent RL policies, matching diffusion performance while cutting training and inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Representing policies as MeanFlow models and optimizing them via soft policy iteration under the maximum entropy RL framework produces policies whose performance on standard continuous-control benchmarks equals or exceeds that of diffusion-based methods while substantially lowering both training and inference cost.
What carries the argument
MeanFlow models, a class of few-step flow-based generative models serving as the policy class, combined with maximum-entropy soft policy iteration adapted for action-likelihood evaluation and soft improvement.
If this is right
- Expressive policy classes in RL need not incur the full iterative cost of diffusion if few-step flow alternatives exist.
- The maximum-entropy framework can be applied to generative-model families other than diffusion without losing its theoretical guarantees.
- Reducing the number of sampling steps in the policy directly translates into faster online RL training loops.
- Once action likelihoods and soft improvement are tractable, any few-step generative model becomes a candidate for entropy-regularized policy optimization.
Where Pith is reading between the lines
- The same few-step flow construction could be tried in offline RL or model-based settings where repeated policy evaluation is the dominant cost.
- If the efficiency advantage persists at scale, complex policies could be deployed on hardware with tighter latency budgets than current diffusion methods allow.
- Hybrid approaches that combine MeanFlow with existing acceleration tricks such as distillation or consistency models remain unexplored in the paper.
Load-bearing premise
The two MeanFlow-specific obstacles of action likelihood evaluation and soft policy improvement can be solved without introducing instabilities or bias that would undermine the maximum-entropy guarantees.
What would settle it
A set of runs on MuJoCo or DeepMind Control Suite in which MFPO either underperforms the diffusion baselines by a clear margin or shows no substantial reduction in training and inference wall-clock time would falsify the central claim.
read the original abstract
Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/MFPolicy/MFPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Mean Flow Policy Optimization (MFPO), which represents RL policies via MeanFlow (few-step flow-based) generative models and optimizes them under the maximum-entropy objective using soft policy iteration. It claims to resolve two MeanFlow-specific challenges—action likelihood evaluation and soft policy improvement—thereby achieving performance on MuJoCo and DeepMind Control Suite benchmarks that is comparable to or better than diffusion-based baselines while substantially lowering training and inference time. Code is released.
Significance. If the MeanFlow-specific implementations of likelihood evaluation and policy improvement are shown to be unbiased and to preserve the fixed-point guarantees of soft policy iteration, the approach would provide a practical efficiency improvement over diffusion policies without sacrificing the theoretical benefits of maximum-entropy RL. The public code release is a clear strength for reproducibility.
major comments (2)
- [Method section (action likelihood evaluation)] The manuscript does not supply the explicit estimator or derivation for the action log-likelihood under the MeanFlow policy (referenced in the abstract and the method section). Without this, it is impossible to confirm that the entropy term remains unbiased, which is load-bearing for the claim that soft policy iteration converges to the true soft-optimal policy.
- [Method section (soft policy improvement)] No analysis or fixed-point argument is given for the soft policy improvement operator when applied to the few-step MeanFlow parameterization (abstract and method section). Any approximation in the probability path or velocity field could introduce bias into the KL penalty, undermining the theoretical justification for the reported benchmark gains.
minor comments (1)
- [Experiments] The number of flow steps and the precise form of the velocity field used in the MeanFlow policy should be stated explicitly in the experimental setup for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the careful reading and valuable comments on the theoretical underpinnings of Mean Flow Policy Optimization. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and derivations.
read point-by-point responses
-
Referee: [Method section (action likelihood evaluation)] The manuscript does not supply the explicit estimator or derivation for the action log-likelihood under the MeanFlow policy (referenced in the abstract and the method section). Without this, it is impossible to confirm that the entropy term remains unbiased, which is load-bearing for the claim that soft policy iteration converges to the true soft-optimal policy.
Authors: We agree that an explicit derivation of the action log-likelihood estimator is required to rigorously establish unbiasedness of the entropy term. Section 3.2 describes the Monte Carlo estimation procedure based on the MeanFlow probability path, but the full mathematical steps were not expanded for brevity. In the revised manuscript we will add a dedicated appendix containing the complete derivation, showing that the estimator is unbiased for the few-step MeanFlow parameterization and therefore preserves the fixed-point properties of soft policy iteration under the maximum-entropy objective. revision: yes
-
Referee: [Method section (soft policy improvement)] No analysis or fixed-point argument is given for the soft policy improvement operator when applied to the few-step MeanFlow parameterization (abstract and method section). Any approximation in the probability path or velocity field could introduce bias into the KL penalty, undermining the theoretical justification for the reported benchmark gains.
Authors: We acknowledge that a formal fixed-point analysis of the soft policy improvement operator under the approximate MeanFlow parameterization is absent from the current manuscript. Section 3.3 outlines the practical adaptation that uses few-step sampling and an approximated KL divergence, but does not supply a contraction-mapping argument. In the revision we will include a new subsection providing a theoretical discussion: we will show that, under the assumption that the trained MeanFlow model converges to the target distribution (as enforced by the training loss), the bias in the KL penalty vanishes asymptotically and the operator retains the essential contraction property of standard soft policy iteration. Empirical support from the MuJoCo and DeepMind Control Suite results will be referenced to illustrate that any residual approximation error does not prevent convergence to high-performing policies. revision: yes
Circularity Check
No significant circularity; derivation builds on external frameworks
full rationale
The paper adapts the standard soft policy iteration algorithm from maximum-entropy RL to MeanFlow policies and states that it solves the two MeanFlow-specific challenges of likelihood evaluation and policy improvement. No equations or claims are presented that reduce the performance claims, the soft Q-function fixed point, or the reported benchmark results to quantities defined only by the authors' own fitted constants, self-referential definitions, or a chain of their prior unverified results. The experimental comparisons to diffusion baselines on MuJoCo and DeepMind Control Suite therefore constitute independent evidence rather than a tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Continuous control with deep reinforcement learning
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
work page internal anchor Pith review arXiv 2015
-
[2]
In: International Conference on Machine Learning, pp
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015). PMLR
2015
-
[3]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
In: International Conference on Machine Learning, pp
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum entropy deep reinforce- ment learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870 (2018). Pmlr
2018
-
[5]
In: International Conference on Machine Learning, pp
Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning, pp. 1587–1596 (2018). PMLR
2018
-
[6]
arXiv preprint arXiv:2305.13122 , year=
Yang, L., Huang, Z., Lei, F., Zhong, Y., Yang, Y., Fang, C., Wen, S., Zhou, B., Lin, Z.: Policy representation via diffusion probability model for reinforcement learning. arXiv preprint arXiv:2305.13122 (2023)
-
[7]
Advances in Neural Information Processing Systems37, 54183–54204 (2024)
Wang, Y., Wang, L., Jiang, Y., Zou, W., Liu, T., Song, X., Wang, W., Xiao, L., Wu, J., Duan, J.,et al.: Diffusion actor-critic with entropy regulator. Advances in Neural Information Processing Systems37, 54183–54204 (2024)
2024
-
[8]
Advances in Neural Information Processing Systems 11 37, 53945–53968 (2024)
Ding, S., Hu, K., Zhang, Z., Ren, K., Zhang, W., Yu, J., Wang, J., Shi, Y.: Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems 11 37, 53945–53968 (2024)
2024
-
[9]
Mean Flows for One-step Generative Modeling
Geng, Z., Deng, M., Bai, X., Kolter, J.Z., He, K.: Mean flows for one-step generative modeling. arXiv preprint arXiv:2505.13447 (2025)
work page internal anchor Pith review arXiv 2025
-
[10]
In: International Conference on Machine Learning (2025)
Celik, O., Li, Z., Blessing, D., Li, G., Palenicek, D., Peters, J., Chalvatzaki, G., Neumann, G.: Dime: Diffusion-based maximum entropy reinforcement learning. In: International Conference on Machine Learning (2025)
2025
-
[11]
In: International Conference on Machine Learning (2025)
Dong, X., Cheng, J., Zhang, X.S.: Maximum entropy reinforcement learning with diffusion policy. In: International Conference on Machine Learning (2025)
2025
-
[12]
In: International Conference on Machine Learning (2025)
Ma, H., Chen, T., Wang, K., Li, N., Dai, B.: Efficient online reinforcement learning for diffusion policy. In: International Conference on Machine Learning (2025)
2025
-
[13]
In: International Conference on Machine Learning, pp
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265 (2015). pmlr
2015
-
[14]
Advances in Neural Information Processing Systems34, 8780–8794 (2021)
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems34, 8780–8794 (2021)
2021
-
[15]
Advances in Neural Information Processing Systems35, 36479–36494 (2022)
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T.,et al.: Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems35, 36479–36494 (2022)
2022
-
[16]
OpenAI Blog1(8), 1 (2024)
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E.,et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024)
2024
-
[17]
Advances in Neural Information Processing Systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems33, 6840–6851 (2020)
2020
-
[18]
Advances in Neural Information Processing Systems32(2019)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems32(2019)
2019
-
[19]
In: International Conference on Learning Representations (2021)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative mod- eling through stochastic differential equations. In: International Conference on Learning Representations (2021)
2021
-
[20]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
work page internal anchor Pith review arXiv 2022
-
[21]
In: International Conference on Learning Representations (2023)
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: International Conference on Learning Representations (2023)
2023
-
[22]
In: International Conference on Learning Representations (2022)
Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: International Conference on Learning Representations (2022)
2022
-
[23]
In: European Conference on Computer Vision, pp
Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. In: European Conference on Computer Vision, pp. 87–103 (2024). Springer
2024
-
[24]
Advances in Neural Information Processing Systems35, 5775–5787 (2022)
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems35, 5775–5787 (2022)
2022
-
[25]
Machine Intelligence Research, 1–22 (2025)
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver++: Fast solver for guided sampling of 12 diffusion probabilistic models. Machine Intelligence Research, 1–22 (2025)
2025
-
[26]
In: International Conference on Machine Learning, pp
Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: International Conference on Machine Learning, pp. 32211–32252 (2023). PMLR
2023
-
[27]
In: International Conference on Learning Representations (2024)
Song, Y., Dhariwal, P.: Improved techniques for training consistency models. In: International Conference on Learning Representations (2024)
2024
-
[28]
In: International Conference on Learning Representations (2025)
Frans, K., Hafner, D., Levine, S., Abbeel, P.: One step diffusion via shortcut models. In: International Conference on Learning Representations (2025)
2025
-
[29]
Advances in Neural Information Processing Systems (2025)
Ding, S., Hu, K., Zhong, S., Luo, H., Zhang, W., Wang, J., Wang, J., Shi, Y.: Genpo: Generative diffusion models meet on-policy reinforcement learning. Advances in Neural Information Processing Systems (2025)
2025
-
[30]
Advances in Neural Information Processing Systems (2025)
Lv, L., Li, Y., Luo, Y., Sun, F., Kong, T., Xu, J., Ma, X.: Flow-based policy for online reinforcement learning. Advances in Neural Information Processing Systems (2025)
2025
-
[31]
In: International Conference on Machine Learning, pp
Psenka, M., Escontrela, A., Abbeel, P., Ma, Y.: Learning a diffusion model policy from rewards via q-score matching. In: International Conference on Machine Learning, pp. 41163–41182 (2024). PMLR
2024
-
[32]
Advances in Neural Information Processing Systems31(2018)
Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differential equations. Advances in Neural Information Processing Systems31(2018)
2018
-
[33]
In: Maximum Entropy and Bayesian Methods: Cambridge, England, 1988, pp
Skilling, J.: The eigenvalues of mega-dimensional matrices. In: Maximum Entropy and Bayesian Methods: Cambridge, England, 1988, pp. 455–466. Springer, Dordrecht (1989)
1988
-
[34]
Communications in Statistics-Simulation and Computation18(3), 1059–1076 (1989)
Hutchinson, M.F.: A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation18(3), 1059–1076 (1989)
1989
-
[35]
University of Chicago, Dept
Kong, A.: A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep348, 14 (1992)
1992
-
[36]
Journal of Machine Learning Research21(141), 1–75 (2020)
Metelli, A.M., Papini, M., Montali, N., Restelli, M.: Importance sampling techniques for policy optimization. Journal of Machine Learning Research21(141), 1–75 (2020)
2020
-
[37]
In: International Conference on Machine Learning, pp
Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: International Conference on Machine Learning, pp. 449–458 (2017). PMLR
2017
-
[38]
IEEE Transactions on Neural Networks and Learning Systems33(11), 6584–6598 (2021)
Duan, J., Guan, Y., Li, S.E., Ren, Y., Sun, Q., Cheng, B.: Distributional soft actor-critic: Off-policy reinforce- ment learning for addressing value estimation errors. IEEE Transactions on Neural Networks and Learning Systems33(11), 6584–6598 (2021)
2021
-
[39]
Advances in Neural Information Processing Systems37, 98806–98834 (2024)
Mao, L., Xu, H., Zhan, X., Zhang, W., Zhang, A.: Diffusion-dice: In-sample diffusion guidance for offline reinforcement learning. Advances in Neural Information Processing Systems37, 98806–98834 (2024)
2024
-
[40]
Advances in Neural Information Processing Systems (2025)
Espinosa-Dice, N., Zhang, Y., Chen, Y., Guo, B., Oertell, O., Swamy, G., Brantley, K., Sun, W.: Scaling offline rl via efficient and expressive shortcut models. Advances in Neural Information Processing Systems (2025)
2025
-
[41]
In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp
Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033 (2012). IEEE
2012
-
[42]
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D.d.L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al.: Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018)
work page internal anchor Pith review arXiv 2018
-
[43]
In: International Conference 13 on Learning Representations (2023)
Pearce, T., Rashid, T., Kanervisto, A., Bignell, D., Sun, M., Georgescu, R., Macua, S.V., Tan, S.Z., Momen- nejad, I., Hofmann, K.,et al.: Imitating human behaviour with diffusion models. In: International Conference 13 on Learning Representations (2023)
2023
-
[44]
The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
2025
-
[45]
In: International Conference on Learning Representations (2023)
Chen, H., Lu, C., Ying, C., Su, H., Zhu, J.: Offline reinforcement learning via high-fidelity generative behavior modeling. In: International Conference on Learning Representations (2023)
2023
-
[46]
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J.G., Levine, S.: Idql: Implicit q-learning as an actor- critic method with diffusion policies. arXiv preprint arXiv:2304.10573 (2023)
work page internal anchor Pith review arXiv 2023
-
[47]
Advances in Neural Information Processing Systems36, 67195–67212 (2023)
Kang, B., Ma, X., Du, C., Pang, T., Yan, S.: Efficient diffusion policies for offline reinforcement learning. Advances in Neural Information Processing Systems36, 67195–67212 (2023)
2023
-
[48]
In: International Conference on Learning Representations (2025)
Zhang, S., Zhang, W., Gu, Q.: Energy-weighted flow matching for offline reinforcement learning. In: International Conference on Learning Representations (2025)
2025
-
[49]
In: International Conference on Learning Representations (2023)
Wang, Z., Hunt, J.J., Zhou, M.: Diffusion policies as an expressive policy class for offline reinforcement learning. In: International Conference on Learning Representations (2023)
2023
-
[50]
In: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pp
Chen, Y., Li, H., Zhao, D.: Boosting continuous control with consistency policy. In: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pp. 335–344 (2024)
2024
-
[51]
In: International Conference on Learning Representations (2024)
Ding, Z., Jin, C.: Consistency models as a rich and efficient policy class for reinforcement learning. In: International Conference on Learning Representations (2024)
2024
-
[52]
In: International Conference on Machine Learning (2025)
Park, S., Li, Q., Levine, S.: Flow q-learning. In: International Conference on Machine Learning (2025)
2025
-
[53]
In: International Conference on Machine Learning, pp
Lu, C., Chen, H., Chen, J., Su, H., Li, C., Zhu, J.: Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In: International Conference on Machine Learning, pp. 22825–22855 (2023). PMLR
2023
-
[54]
In: International Conference on Learning Representations (2025)
Fang, L., Liu, R., Zhang, J., Wang, W., Jing, B.: Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning. In: International Conference on Learning Representations (2025)
2025
-
[55]
In: International Conference on Learning Representations (2025)
Ren, A.Z., Lidard, J., Ankile, L.L., Simeonov, A., Agrawal, P., Majumdar, A., Burchfiel, B., Dai, H., Sim- chowitz, M.: Diffusion policy policy optimization. In: International Conference on Learning Representations (2025)
2025
-
[56]
Advances in Neural Information Processing Systems (2025)
Zhang, T., Yu, C., Su, S., Wang, Y.: Reinflow: Fine-tuning flow matching policy with online reinforcement learning. Advances in Neural Information Processing Systems (2025)
2025
-
[57]
https://artowen.su.domains/mc/
Owen, A.B.: Monte Carlo Theory, Methods and Examples. https://artowen.su.domains/mc/. Online book (2013)
2013
-
[58]
In: SysML Conference 2018 (2019)
Frostig, R., Johnson, M.J., Leary, C.: Compiling machine learning programs via high-level tracing. In: SysML Conference 2018 (2019)
2018
-
[59]
Advances in neural information processing systems32(2019) 14 Appendix A Supplementary Related Work A.1 Diffusion Policies for Offline and Offline2Online RL
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems32(2019) 14 Appendix A Supplementary Related Work A.1 Diffusion Policies for Offline and Offline2Online RL. B...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.