pith. sign in

arxiv: 2512.02581 · v3 · pith:33YXYH6Vnew · submitted 2025-12-02 · 💻 cs.LG

Training Diffusion Policies via Prior-Mapping Co-Evolution

classification 💻 cs.LG
keywords generativepoliciesexpressivegorloptimizationactionbaselineco-evolution
0
0 comments X
read the original abstract

Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize (e.g., Gaussians) are often too simple to represent the multimodal action distributions required for complex control. Conversely, expressive generative policies -- such as diffusion and flow matching -- can be difficult to optimize in online RL due to intractable likelihoods and gradients propagating through long sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this, we introduce GoRL (Generative Online Reinforcement Learning), an algorithm-agnostic framework that trains expressive policies from scratch by confining policy optimization to a tractable latent space while delegating action synthesis to a conditional generative decoder. Viewed as prior-mapping co-evolution, each stage first improves a tractable latent prior through RL and then consolidates the resulting behavior into a more expressive prior-to-action mapping. This two-timescale schedule, anchored by fixed-prior decoder refinement, enables stable optimization while continuously expanding expressiveness. Empirically, \textsc{GoRL} consistently outperforms unimodal and generative baselines across diverse continuous-control tasks. Notably, GoRL achieves returns exceeding 870 on HopperStand, more than 3* the strongest baseline; on high-dimensional humanoid tasks, it further outperforms the strongest non-GoRL baseline by over an order of magnitude.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.