Guided Discovery of New Behaviors using Diffusion Policies

Dian Yu; Majid Khadiv; Sebastian Sanokowski

arxiv: 2606.08743 · v1 · pith:BCBJPQJAnew · submitted 2026-06-07 · 💻 cs.RO

Guided Discovery of New Behaviors using Diffusion Policies

Dian Yu , Sebastian Sanokowski , Majid Khadiv This is my paper

Pith reviewed 2026-06-27 18:11 UTC · model grok-4.3

classification 💻 cs.RO

keywords diffusion policiesbehavior discoveryrobot manipulationguiding potentialFeynman-Kac correctorstrajectory optimizationmultimodal action distributions

0 comments

The pith

A novel guiding potential with Feynman-Kac correctors steers diffusion policies toward rare but valid trajectories that can be refined and learned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When only a few demonstrations are available, diffusion policies for robot control mostly reproduce the dominant behaviors and miss other valid but uncommon action sequences. This paper establishes that a guiding potential can direct the sampling process to those underrepresented modes while the trajectories stay feasible. The selected samples are then repaired through sampling-based optimization and added to the training set so the policy can generate them. A sympathetic reader would care because the result is a repeatable process for expanding the range of executable behaviors instead of depending on luck or external reinforcement learning loops.

Core claim

The framework combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors across a range of manipulation environments.

What carries the argument

The novel guiding potential, which identifies and steers diffusion samples toward underrepresented yet feasible trajectories for later refinement.

If this is right

Diffusion policies retrained on the refined trajectories generate a wider set of behaviors than the original training data alone.
The guided samples avoid the infeasible regions that standard guidance methods produce.
Refined trajectories remain executable in the target environments after optimization.
The process yields new behaviors consistently across multiple manipulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same steering-and-repair loop could be applied to other generative models used for robot behavior generation.
Iterating the mining and retraining steps multiple times might produce an expanding library of distinct behaviors over successive rounds.
The design suggests that feasibility-preserving guidance can be constructed without relying on reinforcement learning to escape local modes.

Load-bearing premise

The novel guiding potential can systematically identify promising yet underrepresented samples that remain feasible and executable after refinement.

What would settle it

Apply the method to a manipulation task with a known small demonstration set and observe whether any refined trajectories produce new executable behaviors absent from the original demonstrations after retraining.

Figures

Figures reproduced from arXiv: 2606.08743 by Dian Yu, Majid Khadiv, Sebastian Sanokowski.

**Figure 2.** Figure 2: Representative GDNB-discovered behaviors and real-world replay. (Blue/Cyan) denote common behaviors from the base policy, (orange/red) denote rare behaviors discovered by GDNB. Translucent robot copies indicate temporal rollout snapshots, and colored markers indicate accumulated contact locations. This figure shows key frames and contact/pose summaries. Kitchen (A1–A2): GDNB discovers rare subtask sequence… view at source ↗

**Figure 3.** Figure 3: , demonstrates that by leveraging the rare-event sampler introduced in Section 3.1 combined with SBTO correction, GDNB successfully recovers the missing modes on this task. 4.2 Rare Behavior Discovery in Robotic Manipulation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Lennard-Jones-like rare-shell potential. We visualize the normalized score-energy coordinate x = d/d∗ . The shell cost Φ(x) = x −12 − 2x −6 + 1 is minimized at x = 1, corresponding to the target rare shell d = d ∗ . The reward bias B(x) = −Φ(x) therefore attracts samples toward this shell instead of monotonically pushing them toward increasingly large score norms. Green curves show the corresponding deriva… view at source ↗

**Figure 5.** Figure 5: Condition-wise action densities on Multimodal Agent. Each colored curve is the KDE of sampled actions for one heading condition. The gray curve shows the bimodal reward landscape, and the shaded bands mark the two reward-equivalent optima at a = −0.5 and a = +0.5. The seed-trained baseline is concentrated on the right mode. DPPO preserves the same collapse. REPPO broadens the density around the right well … view at source ↗

**Figure 6.** Figure 6: Qualitative rare-case diagnostics for remaining benchmarks. (a) End-effector-center trajectories for two phases of Transport, with common rollouts in the left column and rare rollouts in the right column. (b) Base and rare pusher-center trajectories around the T-shaped block in Push-T. (c) Box-frame angle diagnostics for Block Pushing: first-contact angle on the left and post-contact push direction on the … view at source ↗

**Figure 7.** Figure 7: Task-feature distribution-shift diagnostics across manipulation benchmarks. Each panel shows a one-dimensional Task Feature KDE for a task-related event axis. Blue curves denote rollout samples from the base diffusion policy, orange curves denote rare-sampled rollout candidates, and green curves denote rollout samples from the fine-tuned policy after adaptation. The horizontal coordinate is the base-calibr… view at source ↗

read the original abstract

Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies excelling at modeling multimodal action-trajectory distributions. However, when demonstrations are limited, standard sampling often reproduces dominant behaviors while neglecting valid but rare modes, limiting the discovery of novel solutions. Existing approaches, such as guidance methods or combining reinforcement learning with diffusion, either push samples into infeasible regions or struggle to escape local minima, failing to systematically uncover diverse behaviors. To address these challenges, we propose a framework that combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors. We demonstrate the effectiveness of our framework across a range of manipulation environments, consistently discovering new behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a pipeline using Feynman-Kac correctors plus a guiding potential to surface and repair rare trajectories in diffusion policies, but the abstract supplies no numbers or implementation details to back the claims.

read the letter

The main point is that the authors combine Feynman-Kac correctors with a new guiding potential to steer diffusion sampling toward underrepresented but feasible trajectories in manipulation tasks, then refine those with sampling-based optimization and retrain the policy on them. This targets the known tendency of diffusion policies to stick to dominant modes when demonstration data is scarce.

The approach is laid out clearly enough in the abstract. It identifies why standard guidance tends to push samples into bad regions and why RL hybrids often fail to escape local minima, then proposes a sequence that mines, repairs, and folds the new trajectories back into training. That specific pairing looks new relative to the methods they cite.

The obvious gap is the total lack of results. The abstract asserts that the method "consistently discovers new behaviors" across environments, yet gives no success rates, diversity metrics, comparisons, or even a sketch of how the guiding potential is built or tuned. Without those, there is no way to tell whether the pipeline actually avoids the problems it claims to solve.

This is aimed at people already working on diffusion policies for robotics who want better coverage of the behavior space. A reader who needs a concrete method with measured gains will not get much from the abstract alone. If the full paper contains solid experiments that show measurable improvement in diversity without loss of executability, then the idea is worth referee time. Based on what is here, the evidence is too thin to judge, so I would not cite it yet and would only bring it to a reading group if the experiments turn out to be present and careful.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a framework for discovering novel behaviors with diffusion policies in robotics. It combines Feynman-Kac correctors with a novel guiding potential to steer sampling toward underrepresented yet feasible trajectories, refines those trajectories via sampling-based trajectory optimization, and retrains the diffusion policy on the augmented dataset. The central claim is that this pipeline systematically mines and repairs novel trajectories, yielding diverse and executable behaviors across manipulation environments without the infeasibility or local-minima issues of prior guidance or RL-augmented methods.

Significance. If the empirical claims hold, the approach would offer a principled way to expand the support of learned diffusion policies beyond dominant demonstration modes while preserving executability, which is a recurring limitation in robotic generative modeling with limited data. The explicit use of Feynman-Kac correctors and the guiding potential constitute a potentially reusable technical contribution.

major comments (1)

Abstract: the claim that the method 'consistently discover[s] new behaviors' across manipulation environments is presented without any quantitative results, success rates, diversity metrics, error analysis, or validation of the guiding potential, so the central empirical claim cannot be assessed from the provided description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and the opportunity to clarify our work. We address the single major comment below.

read point-by-point responses

Referee: Abstract: the claim that the method 'consistently discover[s] new behaviors' across manipulation environments is presented without any quantitative results, success rates, diversity metrics, error analysis, or validation of the guiding potential, so the central empirical claim cannot be assessed from the provided description.

Authors: We agree that the abstract, as written, states the empirical claim without supporting quantitative details. The body of the manuscript reports success rates, diversity metrics (e.g., mode coverage and trajectory variance), error analyses, and ablation studies validating the guiding potential across the manipulation tasks. To make the central claim directly assessable from the abstract, we will revise it in the next version to include concise quantitative highlights (e.g., average success-rate improvements and diversity gains) while preserving the overall length and readability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical methodological pipeline (Feynman-Kac correctors + novel guiding potential + sampling-based refinement + retraining) whose central claims are supported by experimental results across manipulation tasks rather than by any derivation that reduces to fitted inputs or self-citations by construction. No equations, self-definitional steps, or load-bearing self-citations appear in the provided abstract or description; the work is self-contained against external benchmarks and does not rename known results or smuggle ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the guiding potential is described as novel but without details on its form or any fitted values.

pith-pipeline@v0.9.1-grok · 5683 in / 1097 out tokens · 18182 ms · 2026-06-27T18:11:02.450159+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 5 canonical work pages

[1]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

2015
[2]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[3]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRH S

2021
[4]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[5]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/ abs/2207.12598

Pith/arXiv arXiv 2022
[6]

K. Saha, V . Mandadi, J. Reddy, A. Srikanth, A. Agarwal, B. Sen, A. Singh, and M. Krishna. Edmp: Ensemble-of-costs-guided diffusion for motion planning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 10351–10358. IEEE, 2024

2024
[7]

S. Fan, Q. Yang, Y . Liu, K. Wu, Z. Che, Q. Liu, and M. Wan. Diffusion trajectory-guided policy for long-horizon robot manipulation.IEEE Robotics and Automation Letters, 10(12): 12788–12795, 2025. doi:10.1109/LRA.2025.3619794. URL https://doi.org/10.1109/ LRA.2025.3619794

work page doi:10.1109/lra.2025.3619794 2025
[8]

X. Dai, Z. Yang, D. Yu, F. Liu, H. Sadeghian, S. Haddadin, and S. Hirche. Safeflow: Safe robot motion planning with flow matching via control barrier functions.arXiv preprint arXiv:2504.08661, 2025

arXiv 2025
[9]

R. S. Sutton and A. G. Barto.Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press, 1998. ISBN 978-0-262-19398-6. URL http://www.inco mpleteideas.net/book/first/the-book.html

1998
[10]

Zhang and Y

Q. Zhang and Y . Chen. Path integral sampler: A stochastic control approach for sampling. In International Conference on Learning Representations, 2022. URL https://openreview.n et/forum?id=_uCb2ynRu7Y

2022
[11]

Berner, L

J. Berner, L. Richter, and K. Ullrich. An optimal control perspective on diffusion-based generative modeling.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=oYIjw37pTP

2024
[12]

Sanokowski, L

S. Sanokowski, L. Gruber, C. Bartmann, S. Hochreiter, and S. Lehner. Rethinking losses for diffusion bridge samplers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=O58KDUfB4x

2026
[13]

A. Ren, J. Lidard, L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InInternational Conference on Learning Representations, 2025

2025
[14]

Celik, Z

O. Celik, Z. Li, D. Blessing, G. Li, D. Palenicek, J. Peters, G. Chalvatzaki, and G. Neumann. DIME: Diffusion-based maximum entropy reinforcement learning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofPro- ce...

2025
[15]

Sanokowski and K

S. Sanokowski and K. Patil. Diffusion-augmented markov decision processes for maximum entropy reinforcement learning.arXiv preprint arXiv:2512.02019, 2025

Pith/arXiv arXiv 2025
[16]

J. He, Y . Du, F. Vargas, D. Zhang, S. Padhy, R. OuYang, C. Gomes, and J. M. Hern´andez-Lobato. No trick, no treat: Pursuits and challenges towards simulation-free training of neural samplers. arXiv preprint arXiv:2502.06685, 2025

arXiv 2025
[17]

Longhini, D

A. Longhini, D. Emukpere, J.-M. Renders, and S. Kim. Behavioral mode discovery for fine- tuning multimodal generative policies, 2026. URLhttps://arxiv.org/abs/2605.11387

Pith/arXiv arXiv 2026
[18]

X. Gu, C. Du, T. Pang, C. Li, M. Lin, and Y . Wang. On memorization in diffusion models. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openre view.net/forum?id=D3DBqvSDbj. Accepted by TMLR

2025
[19]

C. He, X. Liu, G. M. S. Camps, J. Bruno, G. A. Sartoretti, and M. Schwager. Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=PL0tJOfm7I

2026
[20]

Kadkhodaie, F

Z. Kadkhodaie, F. Guth, E. P. Simoncelli, and S. Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. InThe Twelfth International Confer- ence on Learning Representations, 2024. URL https://openreview.net/forum?id=ANvm VS2Yr0

2024
[21]

Skreta, T

M. Skreta, T. Akhound-Sadegh, V . Ohanesian, R. Bondesan, A. Aspuru-Guzik, A. Doucet, R. Brekelmans, A. Tong, and K. Neklyudov. Feynman-kac correctors in diffusion: An- nealing, guidance, and product of experts. In A. Singh, M. Fazel, D. Hsu, S. Lacoste- Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd Internatio...

2025
[22]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[23]

Wagenmaker, Y

A. Wagenmaker, Y . Zhang, M. Nakamoto, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 258–282. PMLR, 27–30 Sep
[24]

URLhttps://proceedings.mlr.press/v305/wagenmaker25a.html
[25]

Y . Jin, J. Lv, H. Xue, W. Chen, C. Wen, and C. Lu. Soe: Sample-efficient robot policy self-improvement via on-manifold exploration.arXiv preprint arXiv:2509.19292, 2025

arXiv 2025
[26]

Eysenbach, A

B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018

Pith/arXiv arXiv 2018
[27]

Devidze, P

R. Devidze, P. Kamalaruban, and A. Singla. Exploration-guided reward shaping for reinforce- ment learning under sparse rewards. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5829–5842. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc /paper...

2022
[28]

Lehman and K

J. Lehman and K. O. Stanley.Novelty Search and the Problem with Objectives, pages 37–56. Springer New York, New York, NY , 2011. ISBN 978-1-4614-1770-5. doi:10.1007/978-1-461 4-1770-5 3. URLhttps://doi.org/10.1007/978-1-4614-1770-5_3. 11

work page doi:10.1007/978-1-461 2011
[29]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In J. Tan, M. Toussaint, and K. Darvish, editors,Proceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 1820–1864. PML...

2023
[30]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In G. Gordon, D. Dunson, and M. Dud´ık, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 627–635, Fort Lauderdale...

2011
[31]

Dhedin, I

V . Dhedin, I. Taouil, S. Omar, D. Yu, K. Tao, A. Dai, and M. Khadiv. Dynaretarget: Dynamically-feasible retargeting using sampling-based trajectory optimization.arXiv preprint arXiv:2602.06827, 2026

Pith/arXiv arXiv 2026
[32]

Kobilarov

M. Kobilarov. Cross-entropy motion planning.The International Journal of Robotics Research, 31(7):855–871, 2012

2012
[33]

Pinneri, S

C. Pinneri, S. Sawant, S. Blaes, J. Achterhold, J. Stueckler, M. Rolinek, and G. Martius. Sample-efficient cross-entropy method for real-time planning. In J. Kober, F. Ramos, and C. Tomlin, editors,Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 1049–1065. PMLR, 16–18 Nov 2021. URL https:...

2020
[34]

Williams, A

G. Williams, A. Aldrich, and E. A. Theodorou. Model predictive path integral control: From theory to parallel computation.Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017

2017
[35]

Y . Xie, L. Winkler, L. Sun, S. Lewis, A. E. Foster, J. J. Luna, T. Hempel, M. Gastegger, Y . Chen, I. Zaporozhets, et al. Enhanced diffusion sampling: Efficient rare event sampling and free energy calculation with diffusion models.arXiv preprint arXiv:2602.16634, 2026

arXiv 2026
[36]

V . Kurtz. Hydrax: Sampling-based model predictive control on gpu with jax and mujoco mjx,
[37]

https://github.com/vincekurtz/hydrax
[38]

K. Rana, R. Lee, D. Pershouse, and N. Suenderhauf. IMLE Policy: Fast and Sample Efficient Visuomotor Policy Learning via Implicit Maximum Likelihood Estimation. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025. XXI.158

work page doi:10.15607/rss.2025 2025
[39]

Bengio, J

Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

2009
[40]

G. Mie. Zur kinetischen theorie der einatomigen k ¨orper.Annalen der Physik, 316(8):657–697,
[41]

URL https://onlinelibrary.wile y.com/doi/abs/10.1002/andp.19033160802

doi:https://doi.org/10.1002/andp.19033160802. URL https://onlinelibrary.wile y.com/doi/abs/10.1002/andp.19033160802

work page doi:10.1002/andp.19033160802
[42]

R. J. Sadus. Second virial coefficient properties of the nm lennard-jones/mie potential.The Journal of Chemical Physics, 149(7), 2018. doi:10.1063/1.5041320

work page doi:10.1063/1.5041320 2018
[43]

Chazal, D

F. Chazal, D. Cohen-Steiner, and Q. M´erigot. Geometric inference for probability measures. Foundations of Computational Mathematics, 11(6):733–751, 2011

2011
[44]

Rosenblatt

M. Rosenblatt. Remarks on some nonparametric estimates of a density function.The Annals of Mathematical Statistics, 27(3):832–837, 1956

1956
[45]

Y . Jin, J. Lv, W. Yu, H. Fang, Y .-L. Li, and C. Lu. Sime: Enhancing policy self-improvement with modal-level exploration. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9792–9799. IEEE, 2025. 12

2025
[46]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017
[47]

X. Gu, L. Akoglu, and A. Rinaldo. Statistical analysis of nearest neighbor methods for anomaly detection.Advances in Neural Information Processing Systems, 32, 2019

2019
[48]

J. T. Barron. A general and adaptive robust loss function. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 13 A Sampler Details This appendix gives sampler-side derivations and implementation details omitted from the main method. In this sampler appendix only, t denotes diffusion/reverse time; environ...

2019

[1] [1]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

2015

[2] [2]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[3] [3]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRH S

2021

[4] [4]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[5] [5]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/ abs/2207.12598

Pith/arXiv arXiv 2022

[6] [6]

K. Saha, V . Mandadi, J. Reddy, A. Srikanth, A. Agarwal, B. Sen, A. Singh, and M. Krishna. Edmp: Ensemble-of-costs-guided diffusion for motion planning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 10351–10358. IEEE, 2024

2024

[7] [7]

S. Fan, Q. Yang, Y . Liu, K. Wu, Z. Che, Q. Liu, and M. Wan. Diffusion trajectory-guided policy for long-horizon robot manipulation.IEEE Robotics and Automation Letters, 10(12): 12788–12795, 2025. doi:10.1109/LRA.2025.3619794. URL https://doi.org/10.1109/ LRA.2025.3619794

work page doi:10.1109/lra.2025.3619794 2025

[8] [8]

X. Dai, Z. Yang, D. Yu, F. Liu, H. Sadeghian, S. Haddadin, and S. Hirche. Safeflow: Safe robot motion planning with flow matching via control barrier functions.arXiv preprint arXiv:2504.08661, 2025

arXiv 2025

[9] [9]

R. S. Sutton and A. G. Barto.Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press, 1998. ISBN 978-0-262-19398-6. URL http://www.inco mpleteideas.net/book/first/the-book.html

1998

[10] [10]

Zhang and Y

Q. Zhang and Y . Chen. Path integral sampler: A stochastic control approach for sampling. In International Conference on Learning Representations, 2022. URL https://openreview.n et/forum?id=_uCb2ynRu7Y

2022

[11] [11]

Berner, L

J. Berner, L. Richter, and K. Ullrich. An optimal control perspective on diffusion-based generative modeling.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=oYIjw37pTP

2024

[12] [12]

Sanokowski, L

S. Sanokowski, L. Gruber, C. Bartmann, S. Hochreiter, and S. Lehner. Rethinking losses for diffusion bridge samplers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=O58KDUfB4x

2026

[13] [13]

A. Ren, J. Lidard, L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InInternational Conference on Learning Representations, 2025

2025

[14] [14]

Celik, Z

O. Celik, Z. Li, D. Blessing, G. Li, D. Palenicek, J. Peters, G. Chalvatzaki, and G. Neumann. DIME: Diffusion-based maximum entropy reinforcement learning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors, Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofPro- ce...

2025

[15] [15]

Sanokowski and K

S. Sanokowski and K. Patil. Diffusion-augmented markov decision processes for maximum entropy reinforcement learning.arXiv preprint arXiv:2512.02019, 2025

Pith/arXiv arXiv 2025

[16] [16]

J. He, Y . Du, F. Vargas, D. Zhang, S. Padhy, R. OuYang, C. Gomes, and J. M. Hern´andez-Lobato. No trick, no treat: Pursuits and challenges towards simulation-free training of neural samplers. arXiv preprint arXiv:2502.06685, 2025

arXiv 2025

[17] [17]

Longhini, D

A. Longhini, D. Emukpere, J.-M. Renders, and S. Kim. Behavioral mode discovery for fine- tuning multimodal generative policies, 2026. URLhttps://arxiv.org/abs/2605.11387

Pith/arXiv arXiv 2026

[18] [18]

X. Gu, C. Du, T. Pang, C. Li, M. Lin, and Y . Wang. On memorization in diffusion models. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openre view.net/forum?id=D3DBqvSDbj. Accepted by TMLR

2025

[19] [19]

C. He, X. Liu, G. M. S. Camps, J. Bruno, G. A. Sartoretti, and M. Schwager. Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=PL0tJOfm7I

2026

[20] [20]

Kadkhodaie, F

Z. Kadkhodaie, F. Guth, E. P. Simoncelli, and S. Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. InThe Twelfth International Confer- ence on Learning Representations, 2024. URL https://openreview.net/forum?id=ANvm VS2Yr0

2024

[21] [21]

Skreta, T

M. Skreta, T. Akhound-Sadegh, V . Ohanesian, R. Bondesan, A. Aspuru-Guzik, A. Doucet, R. Brekelmans, A. Tong, and K. Neklyudov. Feynman-kac correctors in diffusion: An- nealing, guidance, and product of experts. In A. Singh, M. Fazel, D. Hsu, S. Lacoste- Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd Internatio...

2025

[22] [22]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[23] [23]

Wagenmaker, Y

A. Wagenmaker, Y . Zhang, M. Nakamoto, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 258–282. PMLR, 27–30 Sep

[24] [24]

URLhttps://proceedings.mlr.press/v305/wagenmaker25a.html

[25] [25]

Y . Jin, J. Lv, H. Xue, W. Chen, C. Wen, and C. Lu. Soe: Sample-efficient robot policy self-improvement via on-manifold exploration.arXiv preprint arXiv:2509.19292, 2025

arXiv 2025

[26] [26]

Eysenbach, A

B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018

Pith/arXiv arXiv 2018

[27] [27]

Devidze, P

R. Devidze, P. Kamalaruban, and A. Singla. Exploration-guided reward shaping for reinforce- ment learning under sparse rewards. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5829–5842. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc /paper...

2022

[28] [28]

Lehman and K

J. Lehman and K. O. Stanley.Novelty Search and the Problem with Objectives, pages 37–56. Springer New York, New York, NY , 2011. ISBN 978-1-4614-1770-5. doi:10.1007/978-1-461 4-1770-5 3. URLhttps://doi.org/10.1007/978-1-4614-1770-5_3. 11

work page doi:10.1007/978-1-461 2011

[29] [29]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In J. Tan, M. Toussaint, and K. Darvish, editors,Proceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 1820–1864. PML...

2023

[30] [30]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In G. Gordon, D. Dunson, and M. Dud´ık, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 627–635, Fort Lauderdale...

2011

[31] [31]

Dhedin, I

V . Dhedin, I. Taouil, S. Omar, D. Yu, K. Tao, A. Dai, and M. Khadiv. Dynaretarget: Dynamically-feasible retargeting using sampling-based trajectory optimization.arXiv preprint arXiv:2602.06827, 2026

Pith/arXiv arXiv 2026

[32] [32]

Kobilarov

M. Kobilarov. Cross-entropy motion planning.The International Journal of Robotics Research, 31(7):855–871, 2012

2012

[33] [33]

Pinneri, S

C. Pinneri, S. Sawant, S. Blaes, J. Achterhold, J. Stueckler, M. Rolinek, and G. Martius. Sample-efficient cross-entropy method for real-time planning. In J. Kober, F. Ramos, and C. Tomlin, editors,Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 1049–1065. PMLR, 16–18 Nov 2021. URL https:...

2020

[34] [34]

Williams, A

G. Williams, A. Aldrich, and E. A. Theodorou. Model predictive path integral control: From theory to parallel computation.Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017

2017

[35] [35]

Y . Xie, L. Winkler, L. Sun, S. Lewis, A. E. Foster, J. J. Luna, T. Hempel, M. Gastegger, Y . Chen, I. Zaporozhets, et al. Enhanced diffusion sampling: Efficient rare event sampling and free energy calculation with diffusion models.arXiv preprint arXiv:2602.16634, 2026

arXiv 2026

[36] [36]

V . Kurtz. Hydrax: Sampling-based model predictive control on gpu with jax and mujoco mjx,

[37] [37]

https://github.com/vincekurtz/hydrax

[38] [38]

K. Rana, R. Lee, D. Pershouse, and N. Suenderhauf. IMLE Policy: Fast and Sample Efficient Visuomotor Policy Learning via Implicit Maximum Likelihood Estimation. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025. XXI.158

work page doi:10.15607/rss.2025 2025

[39] [39]

Bengio, J

Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

2009

[40] [40]

G. Mie. Zur kinetischen theorie der einatomigen k ¨orper.Annalen der Physik, 316(8):657–697,

[41] [41]

URL https://onlinelibrary.wile y.com/doi/abs/10.1002/andp.19033160802

doi:https://doi.org/10.1002/andp.19033160802. URL https://onlinelibrary.wile y.com/doi/abs/10.1002/andp.19033160802

work page doi:10.1002/andp.19033160802

[42] [42]

R. J. Sadus. Second virial coefficient properties of the nm lennard-jones/mie potential.The Journal of Chemical Physics, 149(7), 2018. doi:10.1063/1.5041320

work page doi:10.1063/1.5041320 2018

[43] [43]

Chazal, D

F. Chazal, D. Cohen-Steiner, and Q. M´erigot. Geometric inference for probability measures. Foundations of Computational Mathematics, 11(6):733–751, 2011

2011

[44] [44]

Rosenblatt

M. Rosenblatt. Remarks on some nonparametric estimates of a density function.The Annals of Mathematical Statistics, 27(3):832–837, 1956

1956

[45] [45]

Y . Jin, J. Lv, W. Yu, H. Fang, Y .-L. Li, and C. Lu. Sime: Enhancing policy self-improvement with modal-level exploration. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9792–9799. IEEE, 2025. 12

2025

[46] [46]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017

[47] [47]

X. Gu, L. Akoglu, and A. Rinaldo. Statistical analysis of nearest neighbor methods for anomaly detection.Advances in Neural Information Processing Systems, 32, 2019

2019

[48] [48]

J. T. Barron. A general and adaptive robust loss function. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 13 A Sampler Details This appendix gives sampler-side derivations and implementation details omitted from the main method. In this sampler appendix only, t denotes diffusion/reverse time; environ...

2019