ReGuide: From Test-Time Guidance to Self-Improving Diffusion Policies

Dileep Kalathil; P. R. Kumar; Srinivas Shakkottai; Tzu-Hsiang Lin

arxiv: 2606.28939 · v1 · pith:XJ5F3PDSnew · submitted 2026-06-27 · 💻 cs.LG · cs.RO

ReGuide: From Test-Time Guidance to Self-Improving Diffusion Policies

Tzu-Hsiang Lin , Srinivas Shakkottai , Dileep Kalathil , P. R. Kumar This is my paper

Pith reviewed 2026-06-30 09:43 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords diffusion policiescovariate shiftself-improving policiestest-time guidancebehavior cloningrobotic manipulationfine-tuning

0 comments

The pith

ReGuide turns one-time test-time guidance into reusable recovery data that lets diffusion policies improve themselves through fine-tuning or retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Behavior-cloned diffusion policies fail when small state deviations compound into task failure. ReGuide counters this by generating corrective rollouts with phase-conditioned guidance that steers only in recoverable regimes and then absorbs the successful trajectories back into the training set. The framework offers two absorption routes: fine-tuning the current checkpoint or retraining from scratch on the augmented data, and these can be iterated. Experiments on Robomimic manipulation tasks show success rates rising by factors of 1.3 to 7.7 over base policies, with matched-data controls attributing the gains to the guided recoveries rather than extra rollouts alone.

Core claim

ReGuide is a self-improving loop that first applies phase-conditioned guidance to produce corrective rollouts by constructing phase-specific latent targets and guiding through estimated clean actions, then absorbs successful guided trajectories back into the policy via ReGuide-FT fine-tuning or ReGuide-FS retraining from scratch; the two absorption methods can be composed and repeated.

What carries the argument

Phase-Conditioned Guidance (PCG), which builds phase-specific latent targets and restricts guidance to the drifted-but-recoverable regime to keep generated actions inside the dynamics model's training distribution.

If this is right

Base diffusion policies achieve 1.3–7.7× higher success on Robomimic Can, Square, Transport, and Tool Hang tasks.
ReGuide outperforms LPB when both are restricted to the test-time-only setting.
Matched-data ablations show performance gains arise from the guided recovery trajectories rather than from simply collecting more rollouts.
The framework supports repeated improvement by composing fine-tuning and from-scratch retraining steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Iterating the absorption loop multiple times could reduce dependence on the size of the original demonstration set.
The recoverable-regime restriction in guidance may apply to other policy classes that exhibit partial recoverability.
Success identification could be replaced by an automated verifier to close the loop without human labeling.

Load-bearing premise

Guided rollouts can be reliably labeled as successful and folded back into training without creating new distribution shift or compounding errors.

What would settle it

A controlled run in which success rates after absorbing the guided data are no higher than after absorbing an equal number of unguided rollouts would falsify the claim that the guidance step supplies the critical recovery signal.

Figures

Figures reproduced from arXiv: 2606.28939 by Dileep Kalathil, P. R. Kumar, Srinivas Shakkottai, Tzu-Hsiang Lin.

**Figure 1.** Figure 1: ReGuide overview. At iteration i: starting with πi and Di , construct phase targets P by clustering the latent states, roll out πi with phase-conditioned guidance (active only in the drifted-butrecoverable regime per the per-phase distance distribution), and merge successful guided rollouts D g i into Di+1 for the next iteration. Update the policy to get πi+1 from Di+1 and πi 4 Our Approach: ReGuide Algor… view at source ↗

**Figure 2.** Figure 2: Phase-aware target construction. Latents states are clustered using temporally augmented features, ordered by trajectory time, and grouped into macro-phases. Representative centroids from each phase define the target sets used for phase-conditioned guidance. task. We cluster these features after dimensionality reduction using a standard Principal Component Analysis (PCA), sort clusters by their mean timest… view at source ↗

**Figure 3.** Figure 3: Main results. ReGuide improves over the base diffusion policy across all tasks. ReGuideFT and ReGuide-FS are complementary variants; their composition ReGuide-FS→FT gives the best result on Can, Square, and Transport, while iterated ReGuide-FT remains slightly stronger on Tool Hang [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Iterative self-improvement. The second iteration of ReGuide-FT improves over the first on all tasks, showing that updated policies can generate useful new guided rollouts. The x-axis shows the cumulative number of guided rollouts collected to update the policy. Ablation studies. The appendix isolates the main algorithmic choices in ReGuide: 1. Number of phase targets. Table 2a sweeps the number of targets … view at source ↗

read the original abstract

Behavior-cloned diffusion policies are expressive but remain vulnerable to covariate shift: small deviations from demonstrated states can compound into task failure. Existing methods address this either by expanding the training distribution through expert corrections or synthetic augmentation, or by steering a frozen policy at test time with guidance from a learned model. The former can be expensive or assumption-dependent, while the latter discards the corrected trajectories after execution. We introduce ReGuide, a self-improving framework that treats guided rollouts as reusable on-policy recovery data. ReGuide first uses Phase-Conditioned Guidance (PCG) to generate corrective rollouts: it constructs phase-specific latent targets, applies guidance only in the drifted-but-recoverable regime, and guides through the estimated clean action to match the dynamics model's training distribution. Successful guided rollouts are then absorbed back into the policy through ReGuide-FT, which fine-tunes the current checkpoint, or ReGuide-FS, which retrains from scratch on the augmented dataset; the two can also be composed and iterated. On Robomimic Can, Square, Transport, and Tool Hang, ReGuide improves base-policy success by $1.3$--$7.7\times$, outperforms LPB in the test-time-only setting, and matched-data ablations show that the gains come from guided recovery data rather than additional rollouts alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReGuide's closed loop of phase-conditioned guidance plus data reuse shows empirical lifts on Robomimic but the safety of absorbing guided rollouts rests on assumptions that the abstract leaves unverified.

read the letter

The main takeaway is that ReGuide converts test-time guidance into on-policy recovery data for diffusion policies, then folds that data back in via fine-tuning or retraining from scratch. The specific loop—PCG to steer only in the drifted-but-recoverable regime, followed by ReGuide-FT or ReGuide-FS—appears new relative to the prior methods summarized in the abstract.

The paper does a reasonable job on the empirical side. It reports 1.3–7.7× success gains over the base policy on Can, Square, Transport, and Tool Hang, beats LPB in the test-time-only case, and includes matched-data ablations that try to show the improvement comes from the guided recoveries rather than extra rollouts alone.

The soft spot is the one flagged in the stress test. The whole self-improvement claim depends on reliably labeling guided rollouts as successful and confirming that the guided action distribution stays inside the original behavior-cloned support. The abstract mentions matching the dynamics model’s training distribution and restricting guidance to the recoverable regime, but supplies no concrete success criteria, no description of how distributional fidelity is checked, and no error bars or dataset statistics. Without those pieces, it is hard to tell whether the loop actually reduces covariate shift or risks compounding it.

This is for people working on behavior-cloned diffusion policies in robotics who want a practical route to self-improvement without new expert data. A reader in that niche would get value from the PCG construction and the ablation design.

It deserves a serious referee because the idea is concrete, the benchmarks are standard, and the matched-data controls are a step in the right direction, even if the methods section will need to address the verification steps directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ReGuide, a self-improving framework for diffusion policies that first applies Phase-Conditioned Guidance (PCG) to produce corrective rollouts only in the drifted-but-recoverable regime while matching the dynamics model's training distribution, then absorbs successful guided trajectories back into the policy via ReGuide-FT (fine-tuning the current checkpoint) or ReGuide-FS (retraining from scratch on the augmented dataset). On Robomimic Can, Square, Transport, and Tool Hang, it reports 1.3--7.7× gains in base-policy success rate, outperforms LPB under test-time-only guidance, and uses matched-data ablations to attribute gains to the guided recovery data rather than additional rollouts.

Significance. If the empirical claims hold, the work offers a concrete bridge between test-time guidance and iterative self-improvement for behavior-cloned diffusion policies, potentially lowering the cost of expert data collection in robotic manipulation. The matched-data ablations constitute a methodological strength by isolating the contribution of guided recoveries from mere increases in rollout volume.

major comments (2)

[Abstract] Abstract (paragraph describing ReGuide-FT/FS): the central claim that guided rollouts can be reliably identified as successful and absorbed as on-policy recovery data without introducing new distribution shift or compounding errors lacks any description of the success criterion, verification procedure, or support check against the original BC distribution; this assumption is load-bearing for the self-improvement loop.
[Abstract] Abstract (PCG description): the statement that guidance is applied 'only in the drifted-but-recoverable regime' and 'through the estimated clean action to match the dynamics model's training distribution' is presented without the corresponding equations, phase-conditioning mechanism, or dynamics-model details needed to evaluate whether the generated actions remain inside the original training support.

minor comments (2)

[Abstract] The abstract reports quantitative gains (1.3--7.7×) but supplies no error bars, number of seeds, or explicit success criteria; these should be added to the results summary even if full tables appear later.
Dataset statistics (number of demonstrations, task horizons, observation/action dimensions) are referenced only implicitly; a brief table or paragraph would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications from the full manuscript and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph describing ReGuide-FT/FS): the central claim that guided rollouts can be reliably identified as successful and absorbed as on-policy recovery data without introducing new distribution shift or compounding errors lacks any description of the success criterion, verification procedure, or support check against the original BC distribution; this assumption is load-bearing for the self-improvement loop.

Authors: We agree the abstract omits these details. The full manuscript defines success via the standard Robomimic environment success flags (task completion within episode horizon) in Sections 4.1 and 4.3; verification consists of executing the rollout and recording the binary success indicator. Distribution-shift concerns are addressed via the matched-data ablations in Section 5.3, which isolate guided-recovery contributions. We will revise the abstract to include a brief clause on the success criterion and verification. revision: yes
Referee: [Abstract] Abstract (PCG description): the statement that guidance is applied 'only in the drifted-but-recoverable regime' and 'through the estimated clean action to match the dynamics model's training distribution' is presented without the corresponding equations, phase-conditioning mechanism, or dynamics-model details needed to evaluate whether the generated actions remain inside the original training support.

Authors: The abstract is intentionally high-level; the phase-conditioning mechanism, equations for constructing phase-specific latent targets, and clean-action guidance are fully specified in Section 3.2, together with the dynamics model (identical to the one used for the base diffusion policy). We will add a short clarifying phrase to the abstract referencing these elements while keeping the abstract equation-free. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent experimental validation

full rationale

The paper describes an empirical self-improving framework (ReGuide with PCG, ReGuide-FT/FS) for diffusion policies, supported by success-rate improvements on Robomimic tasks and matched-data ablations that isolate the contribution of guided recovery data. No equations, derivations, or first-principles predictions appear; claims do not reduce to fitted parameters renamed as outputs or to self-citation chains. The method is self-contained against external benchmarks (task success rates), with no load-bearing uniqueness theorems or ansatzes imported from prior author work. This is the normal case of an applied ML paper whose central results are falsifiable via replication rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be extracted. The central claim rests on the unstated premise that guided recoveries remain distributionally compatible with the original training data.

pith-pipeline@v0.9.1-grok · 5784 in / 1065 out tokens · 38298 ms · 2026-06-30T09:43:53.692989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 11 canonical work pages

[1]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15 ofProceedings of Machine Learning Research, pages 627–635, 2011

2011
[2]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023. doi:10.15607/RSS.2023.XIX.026

work page doi:10.15607/rss.2023.xix.026 2023
[3]

Z. Hu, R. Wu, N. Enock, J. Li, R. Kadakia, Z. Erickson, and A. Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction, 2025. URLhttps://arxiv.org/ abs/2509.07953

work page arXiv 2025
[4]

X. Xu, Y . Hou, C. Xin, Z. Liu, and S. Song. Compliant residual dagger: Improving real-world contact-rich manipulation with human corrections, 2025. URL https://arxiv.org/abs/ 2506.16685

work page arXiv 2025
[5]

L. Ke, Y . Zhang, A. Deshpande, S. Srinivasa, and A. Gupta. CCIL: Continuity-based data augmentation for corrective imitation learning. InInternational Conference on Learning Representations (ICLR), 2024

2024
[6]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

L. Ankile, A. Simeonov, I. Shenfeld, and P. Agrawal. Juicer: Data-efficient imitation learning for robotic assembly. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5096–5103, 2024. doi:10.1109/IROS58592.2024.10802498

work page doi:10.1109/iros58592.2024.10802498 2024
[7]

M. Jia, D. Wang, G. Su, D. Klee, X. Zhu, R. Walters, and R. Platt. SEIL: Simulation-augmented equivariant imitation learning. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1845–1851, 2023

2023
[8]

Du and S

M. Du and S. Song. Dynaguide: Steering diffusion polices with active dynamic guidance, 2025. URLhttps://arxiv.org/abs/2506.13922

work page arXiv 2025
[9]

Sun and S

Z. Sun and S. Song. Latent policy barrier: Learning robust visuomotor policies by staying in-distribution, 2025. URLhttps://arxiv.org/abs/2508.05941

work page arXiv 2025
[10]

Y . He, N. Murata, C.-H. Lai, Y . Takida, T. Uesaka, D. Kim, W.-H. Liao, Y . Mitsufuji, J. Z. Kolter, R. Salakhutdinov, and S. Ermon. Manifold preserving guided diffusion. InInternational Conference on Learning Representations (ICLR), 2024

2024
[11]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. InProceedings of the 5th Conference on Robot Learning (CoRL), volume 164 ofProceedings of Machine Learning Research, pages 1678–1690, 2022

2022
[12]

Z. Wang, D. K. Jha, A. H. Qureshi, and D. Romeres. Ppguide: Steering diffusion policies with performance predictive guidance, 2026. URLhttps://arxiv.org/abs/2603.10980

work page arXiv 2026
[13]

H. Qi, H. Yin, A. Zhu, Y . Du, and H. Yang. Inference-time enhancement of generative robot policies via predictive world modeling, 2026. URL https://arxiv.org/abs/2502.00622

work page arXiv 2026
[14]

S. A. Mehta, Y . U. Ciftci, B. Ramachandran, S. Bansal, and D. P. Losey. Stable-bc: Controlling covariate shift with stable behavior cloning.IEEE Robotics and Automation Letters, 10(2): 1952–1959, 2025. doi:10.1109/LRA.2025.3526439

work page doi:10.1109/lra.2025.3526439 1952
[15]

Dhariwal and A

P. Dhariwal and A. Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 8780–8794, 2021. 10

2021
[16]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021

2021
[17]

Diffusion guidance is a controllable policy improvement operator

K. Frans, S. Park, P. Abbeel, and S. Levine. Diffusion guidance is a controllable policy improvement operator, 2025. URLhttps://arxiv.org/abs/2505.23458

work page arXiv 2025
[18]

C. Lu, H. Chen, J. Chen, H. Su, C. Li, and J. Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pages 22825–22855, 2023

2023
[19]

Hansen-Estruch, I

P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies, 2023. URLhttps://arxiv.org/abs/2304. 10573

2023
[20]

G. Zhou, H. Pan, Y . LeCun, and L. Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 79115–79135, 2025

2025
[21]

Mandlekar, F

A. Mandlekar, F. Ramos, B. Boots, S. Savarese, L. Fei-Fei, A. Garg, and D. Fox. IRIS: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 4414–4420, 2020

2020
[22]

Gupta, V

A. Gupta, V . Kumar, C. Lynch, S. Levine, and K. Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. InProceedings of the Conference on Robot Learning (CoRL), volume 100 ofProceedings of Machine Learning Research, pages 1025–1037, 2020

2020
[23]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InProceedings of the Conference on Robot Learning (CoRL), volume 205 of Proceedings of Machine Learning Research, pages 785–799, 2022

2022
[24]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InProceedings of the Conference on Robot Learning (CoRL), volume 229 ofProceedings of Machine Learning Research, pages 1820–1864, 2023

2023
[25]

Xie, M.-T

Q. Xie, M.-T. Luong, E. Hovy, and Q. V . Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

2020
[26]

Rebuffi, A

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017
[27]

what counts as in-distribution

P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. A Extended Related Work This section expands on the related work discussion in Section 2, providing additional detail on individual methods and cov...

work page arXiv 2020

[1] [1]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15 ofProceedings of Machine Learning Research, pages 627–635, 2011

2011

[2] [2]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023. doi:10.15607/RSS.2023.XIX.026

work page doi:10.15607/rss.2023.xix.026 2023

[3] [3]

Z. Hu, R. Wu, N. Enock, J. Li, R. Kadakia, Z. Erickson, and A. Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction, 2025. URLhttps://arxiv.org/ abs/2509.07953

work page arXiv 2025

[4] [4]

X. Xu, Y . Hou, C. Xin, Z. Liu, and S. Song. Compliant residual dagger: Improving real-world contact-rich manipulation with human corrections, 2025. URL https://arxiv.org/abs/ 2506.16685

work page arXiv 2025

[5] [5]

L. Ke, Y . Zhang, A. Deshpande, S. Srinivasa, and A. Gupta. CCIL: Continuity-based data augmentation for corrective imitation learning. InInternational Conference on Learning Representations (ICLR), 2024

2024

[6] [6]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

L. Ankile, A. Simeonov, I. Shenfeld, and P. Agrawal. Juicer: Data-efficient imitation learning for robotic assembly. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5096–5103, 2024. doi:10.1109/IROS58592.2024.10802498

work page doi:10.1109/iros58592.2024.10802498 2024

[7] [7]

M. Jia, D. Wang, G. Su, D. Klee, X. Zhu, R. Walters, and R. Platt. SEIL: Simulation-augmented equivariant imitation learning. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1845–1851, 2023

2023

[8] [8]

Du and S

M. Du and S. Song. Dynaguide: Steering diffusion polices with active dynamic guidance, 2025. URLhttps://arxiv.org/abs/2506.13922

work page arXiv 2025

[9] [9]

Sun and S

Z. Sun and S. Song. Latent policy barrier: Learning robust visuomotor policies by staying in-distribution, 2025. URLhttps://arxiv.org/abs/2508.05941

work page arXiv 2025

[10] [10]

Y . He, N. Murata, C.-H. Lai, Y . Takida, T. Uesaka, D. Kim, W.-H. Liao, Y . Mitsufuji, J. Z. Kolter, R. Salakhutdinov, and S. Ermon. Manifold preserving guided diffusion. InInternational Conference on Learning Representations (ICLR), 2024

2024

[11] [11]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. InProceedings of the 5th Conference on Robot Learning (CoRL), volume 164 ofProceedings of Machine Learning Research, pages 1678–1690, 2022

2022

[12] [12]

Z. Wang, D. K. Jha, A. H. Qureshi, and D. Romeres. Ppguide: Steering diffusion policies with performance predictive guidance, 2026. URLhttps://arxiv.org/abs/2603.10980

work page arXiv 2026

[13] [13]

H. Qi, H. Yin, A. Zhu, Y . Du, and H. Yang. Inference-time enhancement of generative robot policies via predictive world modeling, 2026. URL https://arxiv.org/abs/2502.00622

work page arXiv 2026

[14] [14]

S. A. Mehta, Y . U. Ciftci, B. Ramachandran, S. Bansal, and D. P. Losey. Stable-bc: Controlling covariate shift with stable behavior cloning.IEEE Robotics and Automation Letters, 10(2): 1952–1959, 2025. doi:10.1109/LRA.2025.3526439

work page doi:10.1109/lra.2025.3526439 1952

[15] [15]

Dhariwal and A

P. Dhariwal and A. Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 8780–8794, 2021. 10

2021

[16] [16]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021

2021

[17] [17]

Diffusion guidance is a controllable policy improvement operator

K. Frans, S. Park, P. Abbeel, and S. Levine. Diffusion guidance is a controllable policy improvement operator, 2025. URLhttps://arxiv.org/abs/2505.23458

work page arXiv 2025

[18] [18]

C. Lu, H. Chen, J. Chen, H. Su, C. Li, and J. Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pages 22825–22855, 2023

2023

[19] [19]

Hansen-Estruch, I

P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies, 2023. URLhttps://arxiv.org/abs/2304. 10573

2023

[20] [20]

G. Zhou, H. Pan, Y . LeCun, and L. Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 79115–79135, 2025

2025

[21] [21]

Mandlekar, F

A. Mandlekar, F. Ramos, B. Boots, S. Savarese, L. Fei-Fei, A. Garg, and D. Fox. IRIS: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 4414–4420, 2020

2020

[22] [22]

Gupta, V

A. Gupta, V . Kumar, C. Lynch, S. Levine, and K. Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. InProceedings of the Conference on Robot Learning (CoRL), volume 100 ofProceedings of Machine Learning Research, pages 1025–1037, 2020

2020

[23] [23]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InProceedings of the Conference on Robot Learning (CoRL), volume 205 of Proceedings of Machine Learning Research, pages 785–799, 2022

2022

[24] [24]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InProceedings of the Conference on Robot Learning (CoRL), volume 229 ofProceedings of Machine Learning Research, pages 1820–1864, 2023

2023

[25] [25]

Xie, M.-T

Q. Xie, M.-T. Luong, E. Hovy, and Q. V . Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

2020

[26] [26]

Rebuffi, A

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017

[27] [27]

what counts as in-distribution

P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. A Extended Related Work This section expands on the related work discussion in Section 2, providing additional detail on individual methods and cov...

work page arXiv 2020