EnergyFlow shows that denoising score matching on diffusion policies recovers the gradient of the expert's soft Q-function under maximum-entropy optimality, enabling non-adversarial reward extraction and improved policy generalization.
Proof.From Theorem 3.3, the learned energy satisfies: Eϕ(a,s) =− Q∗(s,a) α +c(s),(28) wherec(s)is a state-dependent constant arising from integration
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.RO 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Recovering Hidden Reward in Diffusion-Based Policies
EnergyFlow shows that denoising score matching on diffusion policies recovers the gradient of the expert's soft Q-function under maximum-entropy optimality, enabling non-adversarial reward extraction and improved policy generalization.