pith. machine review for the scientific record. sign in

arxiv: 2604.09035 · v1 · submitted 2026-04-10 · 💻 cs.AI · cs.LG

Recognition: unknown

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords advantage-guided diffusionmodel-based reinforcement learningdiffusion world modelspolicy improvementadvantage estimatesSigmoid Advantage GuidanceMuJoCo control tasks
0
0 comments X

The pith

Advantage estimates guide diffusion models to sample higher-value trajectories and improve policies in model-based RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that steering the reverse diffusion process in world models using advantage estimates concentrates generated trajectories on those expected to produce higher long-term returns. This approach solves the short-horizon myopia of reward-based guides by incorporating value information from advantages rather than relying solely on immediate rewards or policy actions. Under standard assumptions, the guidance enables reweighted sampling where trajectory weights increase with state-action advantage, which implies measurable policy improvement. The method integrates directly into existing diffusion architectures without retraining and yields better sample efficiency and returns on continuous control tasks.

Core claim

We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL) and develop Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG). We prove that guiding a diffusion model through SAG or EAG permits reweighted sampling of trajectories with weights that increase in state-action advantage, implying policy improvement under standard assumptions. We further show that trajectories generated from AGD-MBRL follow an improved policy with higher value than those from an unguided diffusion model. AGD integrates with PolyGRAD-style models by guiding only state components while leaving actions policy-conditioned and requires no change to the diffusion training objective.

What carries the argument

Advantage-Guided Diffusion (AGD) via SAG or EAG, which steers the reverse diffusion sampling toward trajectories with higher state-action advantages while keeping action generation policy-conditioned.

If this is right

  • Trajectories sampled under SAG or EAG follow policies with strictly higher value than unguided diffusion trajectories.
  • The reweighted sampling concentrates probability mass on actions whose advantages are positive, directly supporting policy improvement.
  • AGD-MBRL achieves higher sample efficiency and final returns than PolyGRAD, reward-guided diffusion, and model-free methods on MuJoCo tasks.
  • No modification to the diffusion training objective is needed, so existing models can adopt the guidance at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same advantage-steering idea could be tested on other generative world models such as flow-matching or autoregressive transformers to check whether the improvement guarantee generalizes.
  • In sparse-reward or long-horizon settings the method might allow shorter diffusion windows without loss of planning quality.
  • If advantage estimates contain systematic bias the reweighting may concentrate on locally attractive but globally suboptimal trajectories.

Load-bearing premise

Advantage estimates computed from the current policy are accurate and the diffusion model is trained well enough for the guidance to shift sampling toward genuinely higher-value trajectories.

What would settle it

An experiment showing that AGD-generated trajectories produce no increase in average return or policy value compared to unguided diffusion when advantage estimates are held fixed and accurate.

Figures

Figures reproduced from arXiv: 2604.09035 by Alexandre Proutiere, Arvid Eriksson, Daniele Foffano, David Broman, Karl H. Johansson.

Figure 1
Figure 1. Figure 1: MDP illustrating the negative effects of short-sighted [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training curves for the Mujoco environments: HalfCheetah, Hopper, Walker and Reacher. Shaded areas indicate the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Advantage-Guided Diffusion for Model-Based Reinforcement Learning (AGD-MBRL), which steers the reverse process of a diffusion world model using the agent's advantage estimates via two new guides (Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG)). It claims to prove that this enables reweighted trajectory sampling with weights increasing in state-action advantage A(s,a), implying policy improvement under standard assumptions, and that the generated trajectories follow a strictly improved policy relative to an unguided diffusion model. The method integrates with PolyGRAD-style architectures by guiding only state components while leaving actions policy-conditioned, requires no change to the diffusion training objective, and reports improved sample efficiency and returns on MuJoCo tasks (HalfCheetah, Hopper, Walker2D, Reacher) over PolyGRAD, reward-guided baselines, and model-free methods like PPO/TRPO.

Significance. If the central theoretical claims hold, the work would offer a principled mechanism to inject long-horizon value information into diffusion-based world models, directly addressing short-horizon myopia without altering the training loss or architecture. The reported empirical gains (up to 2x in some cases) on standard continuous-control benchmarks would indicate practical utility for MBRL. Strengths include the seamless PolyGRAD compatibility and the focus on advantage rather than raw rewards; however, the dependence on self-generated advantage estimates introduces a potential circularity that must be resolved for the improvement guarantee to be robust.

major comments (3)
  1. [Abstract and theoretical proofs section] Abstract and the section containing the policy-improvement proofs: The claim that SAG/EAG guidance performs reweighted sampling of trajectories with weights increasing in state-action advantage A(s,a) (thereby implying policy improvement) is load-bearing. Yet the architecture applies guidance exclusively to state components while action generation remains fully conditioned on the unguided current policy. Because A(s,a) is a joint function of state and action, reweighting the marginal state trajectory does not automatically reweight the joint (s,a) measure; the resulting distribution therefore need not correspond to sampling from a policy with strictly higher value, even under standard MDP assumptions. A formal reduction showing how the conditional action sampling preserves the reweighting guarantee is required.
  2. [Theoretical proofs section] The section stating the proofs and assumptions: The proofs are asserted to hold 'under standard assumptions' (accurate advantage estimates, well-trained diffusion model, standard MDP properties), but the manuscript does not explicitly list or verify these assumptions, nor does it provide the full derivations or error analysis. Without this, the support for the central policy-improvement claim cannot be verified, especially given the state-only guidance architecture.
  3. [Experimental results section] Experimental results section (MuJoCo evaluations): Performance improvements over PolyGRAD and model-free baselines are reported without visible error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the claimed gains (including 2x margins) are reliable or could be explained by variance in the advantage estimates or diffusion sampling.
minor comments (2)
  1. [Method section] The definitions of SAG and EAG (sigmoid and exponential forms) should be presented with explicit mathematical formulas in the main text rather than deferred to appendices, to improve readability of the guidance mechanism.
  2. [Notation and method] Notation for the guided reverse process and the reweighting weights could be unified across the theoretical and experimental sections to avoid ambiguity when comparing guided vs. unguided trajectories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications on the theoretical claims and indicating the revisions incorporated into the manuscript.

read point-by-point responses
  1. Referee: [Abstract and theoretical proofs section] Abstract and the section containing the policy-improvement proofs: The claim that SAG/EAG guidance performs reweighted sampling of trajectories with weights increasing in state-action advantage A(s,a) (thereby implying policy improvement) is load-bearing. Yet the architecture applies guidance exclusively to state components while action generation remains fully conditioned on the unguided current policy. Because A(s,a) is a joint function of state and action, reweighting the marginal state trajectory does not automatically reweight the joint (s,a) measure; the resulting distribution therefore need not correspond to sampling from a policy with strictly higher value, even under standard MDP assumptions. A formal reduction showing how the conditional action sampling preserves the reweighting guarantee is required.

    Authors: We appreciate the referee's precise identification of the joint versus marginal distinction. Although guidance is applied only to states, actions are sampled conditionally from the fixed policy given those states. This structure induces an effective reweighting on the joint (s,a) measure because the guided state marginal is multiplied by the policy's conditional action probabilities. We have added a formal lemma (Lemma 3.2 in the revised theoretical section) that derives the joint reweighting factor explicitly and shows it remains monotonic in A(s,a) under the policy, thereby preserving the policy-improvement guarantee. The proof is included in the main text with a short sketch and full derivation moved to the appendix. revision: yes

  2. Referee: [Theoretical proofs section] The section stating the proofs and assumptions: The proofs are asserted to hold 'under standard assumptions' (accurate advantage estimates, well-trained diffusion model, standard MDP properties), but the manuscript does not explicitly list or verify these assumptions, nor does it provide the full derivations or error analysis. Without this, the support for the central policy-improvement claim cannot be verified, especially given the state-only guidance architecture.

    Authors: We agree that explicit enumeration of assumptions improves verifiability. The revised manuscript now contains a dedicated 'Assumptions' subsection (Section 3.1) that lists all required conditions, including bounded advantage estimation error, sufficient diffusion model capacity, and standard MDP properties (finite horizon, bounded rewards). Full derivations of the reweighting and policy-improvement results have been moved to Appendix B, and we have added a brief error-propagation analysis showing that small advantage estimation errors lead to correspondingly bounded degradation in the improvement guarantee. revision: yes

  3. Referee: [Experimental results section] Experimental results section (MuJoCo evaluations): Performance improvements over PolyGRAD and model-free baselines are reported without visible error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the claimed gains (including 2x margins) are reliable or could be explained by variance in the advantage estimates or diffusion sampling.

    Authors: The referee correctly notes the missing statistical details. We have revised all result figures and tables to display error bars corresponding to standard error across 5 independent random seeds. The experimental section now explicitly states the seed count and includes paired t-test p-values for all reported comparisons against PolyGRAD, reward-guided baselines, and model-free methods, confirming statistical significance of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the claimed proof of policy improvement

full rationale

The paper presents a proof that SAG/EAG guidance enables reweighted sampling with weights increasing in state-action advantage, implying policy improvement under standard assumptions, plus a separate claim that generated trajectories follow a higher-value policy than unguided diffusion. These are theoretical statements relying on MDP properties and accurate advantage estimates, which is standard in RL and does not reduce the result to a tautology, fitted parameter, or self-citation chain by construction. The architecture note about guiding only states while keeping actions policy-conditioned is presented as an integration detail without altering the training objective, but introduces no self-definitional or load-bearing reduction visible in the abstract or described claims. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on existing diffusion-model and RL frameworks; the new elements are the guidance functions and the claimed proofs. No new physical entities or large numbers of free parameters are introduced in the abstract.

axioms (1)
  • domain assumption Standard assumptions for policy improvement in RL (accurate advantage estimates, MDP properties)
    Invoked when claiming that reweighted sampling implies policy improvement.

pith-pipeline@v0.9.0 · 5600 in / 1327 out tokens · 49615 ms · 2026-05-10T17:16:34.889165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 32 canonical work pages · 10 internal anchors

  1. [1]

    A graph placement methodology for fast chip design,

    A. Mirhoseini, A. Goldie, M. Yazgan, J. W. Jiang, E. Songhori, S. Wang, Y .-J. Lee, E. Johnson, O. Pathak, A. Nazi,et al., “A graph placement methodology for fast chip design,”Nature, vol. 594, no. 7862, pp. 207–212, 2021

  2. [2]

    Grandmaster level in starcraft ii using multi-agent reinforcement learning,

    O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev,et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,”nature, vol. 575, no. 7782, pp. 350–354, 2019

  3. [3]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel,et al., “Mastering chess and shogi by self-play with a general reinforcement learning algorithm,”arXiv preprint arXiv:1712.01815, 2017

  4. [4]

    Mastering the game of go with deep neural networks and tree search,

    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot,et al., “Mastering the game of go with deep neural networks and tree search,”nature, vol. 529, no. 7587, pp. 484–489, 2016

  5. [5]

    Human-level control through deep reinforcement learning,

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015

  6. [6]

    Model-Ensemble Trust-Region Policy Optimization

    T. Kurutach, I. Clavera, Y . Duan, A. Tamar, and P. Abbeel, “Model-ensemble trust-region policy optimization,”arXiv preprint arXiv:1802.10592, 2018

  7. [7]

    Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

    V . Micheli, E. Alonso, and F. Fleuret, “Transformers are sample- efficient world models,”arXiv preprint arXiv:2209.00588, 2022

  8. [8]

    Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

    J. Robine, M. H ¨oftmann, T. Uelwer, and S. Harmeling, “Transformer- based world models are happy with 100k interactions,”arXiv preprint arXiv:2303.07109, 2023

  9. [9]

    arXiv preprint arXiv:2305.10912 , year=

    I. Schubert, J. Zhang, J. Bruce, S. Bechtle, E. Parisotto, M. Riedmiller, J. T. Springenberg, A. Byravan, L. Hasenclever, and N. Heess, “A gen- eralist dynamics model for control,”arXiv preprint arXiv:2305.10912, 2023

  10. [10]

    Mastering Diverse Domains through World Models

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering diverse domains through world models,”arXiv preprint arXiv:2301.04104, 2023

  11. [11]

    Planning with Diffusion for Flexible Behavior Synthesis

    M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine, “Plan- ning with diffusion for flexible behavior synthesis,”arXiv preprint arXiv:2205.09991, 2022

  12. [12]

    World models via policy-guided trajectory diffusion,

    M. Rigter, J. Yamada, and I. Posner, “World models via policy-guided trajectory diffusion,”arXiv preprint arXiv:2312.08533, 2023

  13. [13]

    Policy-guided diffusion.arXiv preprint arXiv:2404.06356, 2024

    M. T. Jackson, M. T. Matthews, C. Lu, B. Ellis, S. Whiteson, and J. Fo- erster, “Policy-guided diffusion,”arXiv preprint arXiv:2404.06356, 2024

  14. [14]

    Adversarial diffusion for ro- bust reinforcement learning,

    D. Foffano, A. Russo, and A. Proutiere, “Adversarial diffusion for ro- bust reinforcement learning,”arXiv preprint arXiv:2509.23846, 2025

  15. [15]

    Neural network dynamics for model-based deep reinforcement learning with model- free fine-tuning,

    A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model- free fine-tuning,” in2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566, IEEE, 2018

  16. [16]

    Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,

    K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,”Advances in neural information processing systems, vol. 31, 2018

  17. [17]

    Model-Based Reinforcement Learning for Atari

    L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Camp- bell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al., “Model-based reinforcement learning for Atari,”arXiv preprint arXiv:1903.00374, 2019

  18. [18]

    Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

    T. Jafferjee, E. Imani, E. Talvitie, M. White, and M. Bowling, “Hallucinating value: A pitfall of dyna-style planning with imperfect environment models,”arXiv preprint arXiv:2006.04363, 2020

  19. [19]

    Plannable approximations to mdp homomorphisms: Equivariance under actions,

    E. van der Pol, T. Kipf, F. A. Oliehoek, and M. Welling, “Plannable approximations to mdp homomorphisms: Equivariance under actions,” arXiv preprint arXiv:2002.11963, 2020

  20. [20]

    Auto-Encoding Variational Bayes

    D. P. Kingma, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

  21. [21]

    Attention is all you need,

    A. Vaswani, “Attention is all you need,”Advances in Neural Informa- tion Processing Systems, 2017

  22. [22]

    Mastering Atari with Discrete World Models

    D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba, “Mastering Atari with discrete world models,”arXiv preprint arXiv:2010.02193, 2020

  23. [23]

    Recurrent world models facilitate pol- icy evolution,

    D. Ha and J. Schmidhuber, “Recurrent world models facilitate pol- icy evolution,”Advances in neural information processing systems, vol. 31, 2018

  24. [24]

    Learning to combat compounding-error in model-based reinforcement learning.arXiv preprint arXiv:1912.11206, 2019

    C. Xiao, Y . Wu, C. Ma, D. Schuurmans, and M. M ¨uller, “Learning to combat compounding-error in model-based reinforcement learning,” arXiv preprint arXiv:1912.11206, 2019

  25. [25]

    Bharadhwaj, H., Xie, K., and Shkurti, F

    K. Asadi, D. Misra, S. Kim, and M. L. Littman, “Combating the compounding-error problem with a multi-step model,”arXiv preprint arXiv:1905.13320, 2019

  26. [26]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inInternational conference on machine learning, pp. 2256–2265, PMLR, 2015

  27. [27]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

  28. [28]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

  29. [29]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

  30. [30]

    Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657,

    A. Ajay, Y . Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal, “Is conditional generative modeling all you need for decision- making?,”arXiv preprint arXiv:2211.15657, 2022

  31. [31]

    2305.17330 , archivePrefix=

    Z. Zhu, M. Liu, L. Mao, B. Kang, M. Xu, Y . Yu, S. Ermon, and W. Zhang, “Madiff: Offline multi-agent learning with diffusion models,”arXiv preprint arXiv:2305.17330, 2023

  32. [32]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, p. 02783649241273668, 2023

  33. [33]

    Crossway diffu- sion: Improving diffusion-based visuomotor policy via self-supervised learning,

    X. Li, V . Belagali, J. Shang, and M. S. Ryoo, “Crossway diffu- sion: Improving diffusion-based visuomotor policy via self-supervised learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 16841–16849, IEEE, 2024

  34. [34]

    AdaptDiffuser: Diffusion models as adaptive self-evolving planners.arXiv preprint arXiv:2302.01877, 2023

    Z. Liang, Y . Mu, M. Ding, F. Ni, M. Tomizuka, and P. Luo, “Adapt- diffuser: Diffusion models as adaptive self-evolving planners,”arXiv preprint arXiv:2302.01877, 2023

  35. [35]

    Benchmarking Model-Based Reinforcement Learning

    T. Wang, X. Bao, I. Clavera, J. Hoang, Y . Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba, “Benchmarking model-based rein- forcement learning,”arXiv preprint arXiv:1907.02057, 2019

  36. [36]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine, “IDQL: Implicit q-learning as an actor-critic method with diffusion policies,”arXiv preprint arXiv:2304.10573, 2023

  37. [37]

    Value function estimation using conditional diffusion models for control,

    B. Mazoure, W. Talbott, M. A. Bautista, D. Hjelm, A. Toshev, and J. Susskind, “Value function estimation using conditional diffusion models for control,”arXiv preprint arXiv:2306.07290, 2023

  38. [38]

    Dyna, an integrated architecture for learning, planning, and reacting,

    R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,”ACM Sigart Bulletin, vol. 2, no. 4, pp. 160–163, 1991

  39. [39]

    Trust Region Policy Optimization

    J. Schulman, “Trust region policy optimization,”arXiv preprint arXiv:1502.05477, 2015

  40. [40]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  41. [41]

    Safe offline reinforcement learning with feasibility-guided diffusion model.arXiv preprint arXiv:2401.10700,

    Y . Zheng, J. Li, D. Yu, Y . Yang, S. E. Li, X. Zhan, and J. Liu, “Safe offline reinforcement learning with feasibility-guided diffusion model,” arXiv preprint arXiv:2401.10700, 2024

  42. [42]

    Diffusion spectral representation for reinforcement learning,

    D. Shribak, C.-X. Gao, Y . Li, C. Xiao, and B. Dai, “Diffusion spectral representation for reinforcement learning,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 110028–110056, 2024

  43. [43]

    Prior-guided diffu- sion planning for offline reinforcement learning,

    D. Ki, J. Oh, S.-W. Shim, and B.-J. Lee, “Prior-guided diffu- sion planning for offline reinforcement learning,”arXiv preprint arXiv:2505.10881, 2025

  44. [44]

    Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

    H. Ma, T. Chen, K. Wang, N. Li, and B. Dai, “Efficient on- line reinforcement learning for diffusion policy,”arXiv preprint arXiv:2502.00361, 2025

  45. [45]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    S. Levine, “Reinforcement learning and control as probabilistic infer- ence: Tutorial and review,”arXiv preprint arXiv:1805.00909, 2018

  46. [46]

    R. S. Sutton, A. G. Barto,et al.,Reinforcement learning: An intro- duction. No. 1, MIT press Cambridge, 1998

  47. [47]

    Asynchronous methods for deep rein- forcement learning,

    V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- forcement learning,” inInternational conference on machine learning, pp. 1928–1937, PmLR, 2016

  48. [48]

    Repaint: Inpainting using denoising diffusion prob- abilistic models,

    A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion prob- abilistic models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11461–11471, 2022

  49. [49]

    Stable-baselines3: Reliable reinforcement learning implemen- tations,

    A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implemen- tations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021

  50. [50]

    High-Resolution Image Synthesis with Latent Diffusion Models

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models. arxiv 2022,”arXiv preprint arXiv:2112.10752, 2021

  51. [51]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022