Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning

Kaustubh Patil; Sebastian Sanokowski

arxiv: 2512.02019 · v3 · pith:F75ZDXRVnew · submitted 2025-12-01 · 💻 cs.LG · cs.AI· stat.ML

Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning

Sebastian Sanokowski , Kaustubh Patil This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords policydiffusionda-mdpdistributionsentropyoptimizationprocessesbenchmarks

0 comments

read the original abstract

Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning (ME-RL) to diffusion processes, enabling sampling from the optimal policy trajectory distribution. By minimizing a tractable upper bound on the reverse KL divergence between the diffusion policy and the optimal policy trajectory distributions, we derive a modified surrogate objective and introduce Diffusion-Augmented Markov Decision Processes (DA-MDPs). DA-MDPs allow for seamless integration of diffusion policies into any ME-RL method with minimal modifications. We demonstrate its effectiveness by adapting Proximal Policy Optimization (PPO), Wasserstein Policy Optimization (WPO), and Relative Entropy Pathwise Policy Optimization (REPPO) into their diffusion-based variants: DA-MDP: PPO, DA-MDP: WPO, and DA-MDP: REPPO. Empirical results on standard continuous-control benchmarks show that our approach matches or outperforms baseline methods, while experiments on multimodal benchmarks confirm its ability to model multimodal action distributions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Guided Discovery of New Behaviors using Diffusion Policies
cs.RO 2026-06 unverdicted novelty 6.0

A framework combining Feynman-Kac correctors with a guiding potential mines and repairs novel trajectories to enable diffusion policies to discover diverse executable behaviors in robotic manipulation.