pith. sign in

arxiv: 2606.28939 · v1 · pith:XJ5F3PDSnew · submitted 2026-06-27 · 💻 cs.LG · cs.RO

ReGuide: From Test-Time Guidance to Self-Improving Diffusion Policies

Pith reviewed 2026-06-30 09:43 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords diffusion policiescovariate shiftself-improving policiestest-time guidancebehavior cloningrobotic manipulationfine-tuning
0
0 comments X

The pith

ReGuide turns one-time test-time guidance into reusable recovery data that lets diffusion policies improve themselves through fine-tuning or retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Behavior-cloned diffusion policies fail when small state deviations compound into task failure. ReGuide counters this by generating corrective rollouts with phase-conditioned guidance that steers only in recoverable regimes and then absorbs the successful trajectories back into the training set. The framework offers two absorption routes: fine-tuning the current checkpoint or retraining from scratch on the augmented data, and these can be iterated. Experiments on Robomimic manipulation tasks show success rates rising by factors of 1.3 to 7.7 over base policies, with matched-data controls attributing the gains to the guided recoveries rather than extra rollouts alone.

Core claim

ReGuide is a self-improving loop that first applies phase-conditioned guidance to produce corrective rollouts by constructing phase-specific latent targets and guiding through estimated clean actions, then absorbs successful guided trajectories back into the policy via ReGuide-FT fine-tuning or ReGuide-FS retraining from scratch; the two absorption methods can be composed and repeated.

What carries the argument

Phase-Conditioned Guidance (PCG), which builds phase-specific latent targets and restricts guidance to the drifted-but-recoverable regime to keep generated actions inside the dynamics model's training distribution.

If this is right

  • Base diffusion policies achieve 1.3–7.7× higher success on Robomimic Can, Square, Transport, and Tool Hang tasks.
  • ReGuide outperforms LPB when both are restricted to the test-time-only setting.
  • Matched-data ablations show performance gains arise from the guided recovery trajectories rather than from simply collecting more rollouts.
  • The framework supports repeated improvement by composing fine-tuning and from-scratch retraining steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Iterating the absorption loop multiple times could reduce dependence on the size of the original demonstration set.
  • The recoverable-regime restriction in guidance may apply to other policy classes that exhibit partial recoverability.
  • Success identification could be replaced by an automated verifier to close the loop without human labeling.

Load-bearing premise

Guided rollouts can be reliably labeled as successful and folded back into training without creating new distribution shift or compounding errors.

What would settle it

A controlled run in which success rates after absorbing the guided data are no higher than after absorbing an equal number of unguided rollouts would falsify the claim that the guidance step supplies the critical recovery signal.

Figures

Figures reproduced from arXiv: 2606.28939 by Dileep Kalathil, P. R. Kumar, Srinivas Shakkottai, Tzu-Hsiang Lin.

Figure 1
Figure 1. Figure 1: ReGuide overview. At iteration i: starting with πi and Di , construct phase targets P by clustering the latent states, roll out πi with phase-conditioned guidance (active only in the drifted-but￾recoverable regime per the per-phase distance distribution), and merge successful guided rollouts D g i into Di+1 for the next iteration. Update the policy to get πi+1 from Di+1 and πi 4 Our Approach: ReGuide Algor… view at source ↗
Figure 2
Figure 2. Figure 2: Phase-aware target construction. Latents states are clustered using temporally augmented features, ordered by trajectory time, and grouped into macro-phases. Representative centroids from each phase define the target sets used for phase-conditioned guidance. task. We cluster these features after dimensionality reduction using a standard Principal Component Analysis (PCA), sort clusters by their mean timest… view at source ↗
Figure 3
Figure 3. Figure 3: Main results. ReGuide improves over the base diffusion policy across all tasks. ReGuide￾FT and ReGuide-FS are complementary variants; their composition ReGuide-FS→FT gives the best result on Can, Square, and Transport, while iterated ReGuide-FT remains slightly stronger on Tool Hang [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Iterative self-improvement. The second iteration of ReGuide-FT improves over the first on all tasks, showing that updated policies can generate useful new guided rollouts. The x-axis shows the cumulative number of guided rollouts collected to update the policy. Ablation studies. The appendix isolates the main algorithmic choices in ReGuide: 1. Number of phase targets. Table 2a sweeps the number of targets … view at source ↗
read the original abstract

Behavior-cloned diffusion policies are expressive but remain vulnerable to covariate shift: small deviations from demonstrated states can compound into task failure. Existing methods address this either by expanding the training distribution through expert corrections or synthetic augmentation, or by steering a frozen policy at test time with guidance from a learned model. The former can be expensive or assumption-dependent, while the latter discards the corrected trajectories after execution. We introduce ReGuide, a self-improving framework that treats guided rollouts as reusable on-policy recovery data. ReGuide first uses Phase-Conditioned Guidance (PCG) to generate corrective rollouts: it constructs phase-specific latent targets, applies guidance only in the drifted-but-recoverable regime, and guides through the estimated clean action to match the dynamics model's training distribution. Successful guided rollouts are then absorbed back into the policy through ReGuide-FT, which fine-tunes the current checkpoint, or ReGuide-FS, which retrains from scratch on the augmented dataset; the two can also be composed and iterated. On Robomimic Can, Square, Transport, and Tool Hang, ReGuide improves base-policy success by $1.3$--$7.7\times$, outperforms LPB in the test-time-only setting, and matched-data ablations show that the gains come from guided recovery data rather than additional rollouts alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ReGuide, a self-improving framework for diffusion policies that first applies Phase-Conditioned Guidance (PCG) to produce corrective rollouts only in the drifted-but-recoverable regime while matching the dynamics model's training distribution, then absorbs successful guided trajectories back into the policy via ReGuide-FT (fine-tuning the current checkpoint) or ReGuide-FS (retraining from scratch on the augmented dataset). On Robomimic Can, Square, Transport, and Tool Hang, it reports 1.3--7.7× gains in base-policy success rate, outperforms LPB under test-time-only guidance, and uses matched-data ablations to attribute gains to the guided recovery data rather than additional rollouts.

Significance. If the empirical claims hold, the work offers a concrete bridge between test-time guidance and iterative self-improvement for behavior-cloned diffusion policies, potentially lowering the cost of expert data collection in robotic manipulation. The matched-data ablations constitute a methodological strength by isolating the contribution of guided recoveries from mere increases in rollout volume.

major comments (2)
  1. [Abstract] Abstract (paragraph describing ReGuide-FT/FS): the central claim that guided rollouts can be reliably identified as successful and absorbed as on-policy recovery data without introducing new distribution shift or compounding errors lacks any description of the success criterion, verification procedure, or support check against the original BC distribution; this assumption is load-bearing for the self-improvement loop.
  2. [Abstract] Abstract (PCG description): the statement that guidance is applied 'only in the drifted-but-recoverable regime' and 'through the estimated clean action to match the dynamics model's training distribution' is presented without the corresponding equations, phase-conditioning mechanism, or dynamics-model details needed to evaluate whether the generated actions remain inside the original training support.
minor comments (2)
  1. [Abstract] The abstract reports quantitative gains (1.3--7.7×) but supplies no error bars, number of seeds, or explicit success criteria; these should be added to the results summary even if full tables appear later.
  2. Dataset statistics (number of demonstrations, task horizons, observation/action dimensions) are referenced only implicitly; a brief table or paragraph would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications from the full manuscript and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph describing ReGuide-FT/FS): the central claim that guided rollouts can be reliably identified as successful and absorbed as on-policy recovery data without introducing new distribution shift or compounding errors lacks any description of the success criterion, verification procedure, or support check against the original BC distribution; this assumption is load-bearing for the self-improvement loop.

    Authors: We agree the abstract omits these details. The full manuscript defines success via the standard Robomimic environment success flags (task completion within episode horizon) in Sections 4.1 and 4.3; verification consists of executing the rollout and recording the binary success indicator. Distribution-shift concerns are addressed via the matched-data ablations in Section 5.3, which isolate guided-recovery contributions. We will revise the abstract to include a brief clause on the success criterion and verification. revision: yes

  2. Referee: [Abstract] Abstract (PCG description): the statement that guidance is applied 'only in the drifted-but-recoverable regime' and 'through the estimated clean action to match the dynamics model's training distribution' is presented without the corresponding equations, phase-conditioning mechanism, or dynamics-model details needed to evaluate whether the generated actions remain inside the original training support.

    Authors: The abstract is intentionally high-level; the phase-conditioning mechanism, equations for constructing phase-specific latent targets, and clean-action guidance are fully specified in Section 3.2, together with the dynamics model (identical to the one used for the base diffusion policy). We will add a short clarifying phrase to the abstract referencing these elements while keeping the abstract equation-free. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent experimental validation

full rationale

The paper describes an empirical self-improving framework (ReGuide with PCG, ReGuide-FT/FS) for diffusion policies, supported by success-rate improvements on Robomimic tasks and matched-data ablations that isolate the contribution of guided recovery data. No equations, derivations, or first-principles predictions appear; claims do not reduce to fitted parameters renamed as outputs or to self-citation chains. The method is self-contained against external benchmarks (task success rates), with no load-bearing uniqueness theorems or ansatzes imported from prior author work. This is the normal case of an applied ML paper whose central results are falsifiable via replication rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be extracted. The central claim rests on the unstated premise that guided recoveries remain distributionally compatible with the original training data.

pith-pipeline@v0.9.1-grok · 5784 in / 1065 out tokens · 38298 ms · 2026-06-30T09:43:53.692989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 11 canonical work pages

  1. [1]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15 ofProceedings of Machine Learning Research, pages 627–635, 2011

  2. [2]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023. doi:10.15607/RSS.2023.XIX.026

  3. [3]

    Z. Hu, R. Wu, N. Enock, J. Li, R. Kadakia, Z. Erickson, and A. Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction, 2025. URLhttps://arxiv.org/ abs/2509.07953

  4. [4]

    X. Xu, Y . Hou, C. Xin, Z. Liu, and S. Song. Compliant residual dagger: Improving real-world contact-rich manipulation with human corrections, 2025. URL https://arxiv.org/abs/ 2506.16685

  5. [5]

    L. Ke, Y . Zhang, A. Deshpande, S. Srinivasa, and A. Gupta. CCIL: Continuity-based data augmentation for corrective imitation learning. InInternational Conference on Learning Representations (ICLR), 2024

  6. [6]

    In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

    L. Ankile, A. Simeonov, I. Shenfeld, and P. Agrawal. Juicer: Data-efficient imitation learning for robotic assembly. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5096–5103, 2024. doi:10.1109/IROS58592.2024.10802498

  7. [7]

    M. Jia, D. Wang, G. Su, D. Klee, X. Zhu, R. Walters, and R. Platt. SEIL: Simulation-augmented equivariant imitation learning. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1845–1851, 2023

  8. [8]

    Du and S

    M. Du and S. Song. Dynaguide: Steering diffusion polices with active dynamic guidance, 2025. URLhttps://arxiv.org/abs/2506.13922

  9. [9]

    Sun and S

    Z. Sun and S. Song. Latent policy barrier: Learning robust visuomotor policies by staying in-distribution, 2025. URLhttps://arxiv.org/abs/2508.05941

  10. [10]

    Y . He, N. Murata, C.-H. Lai, Y . Takida, T. Uesaka, D. Kim, W.-H. Liao, Y . Mitsufuji, J. Z. Kolter, R. Salakhutdinov, and S. Ermon. Manifold preserving guided diffusion. InInternational Conference on Learning Representations (ICLR), 2024

  11. [11]

    Mandlekar, D

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. InProceedings of the 5th Conference on Robot Learning (CoRL), volume 164 ofProceedings of Machine Learning Research, pages 1678–1690, 2022

  12. [12]

    Z. Wang, D. K. Jha, A. H. Qureshi, and D. Romeres. Ppguide: Steering diffusion policies with performance predictive guidance, 2026. URLhttps://arxiv.org/abs/2603.10980

  13. [13]

    H. Qi, H. Yin, A. Zhu, Y . Du, and H. Yang. Inference-time enhancement of generative robot policies via predictive world modeling, 2026. URL https://arxiv.org/abs/2502.00622

  14. [14]

    S. A. Mehta, Y . U. Ciftci, B. Ramachandran, S. Bansal, and D. P. Losey. Stable-bc: Controlling covariate shift with stable behavior cloning.IEEE Robotics and Automation Letters, 10(2): 1952–1959, 2025. doi:10.1109/LRA.2025.3526439

  15. [15]

    Dhariwal and A

    P. Dhariwal and A. Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 8780–8794, 2021. 10

  16. [16]

    Ho and T

    J. Ho and T. Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021

  17. [17]

    Diffusion guidance is a controllable policy improvement operator

    K. Frans, S. Park, P. Abbeel, and S. Levine. Diffusion guidance is a controllable policy improvement operator, 2025. URLhttps://arxiv.org/abs/2505.23458

  18. [18]

    C. Lu, H. Chen, J. Chen, H. Su, C. Li, and J. Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pages 22825–22855, 2023

  19. [19]

    Hansen-Estruch, I

    P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies, 2023. URLhttps://arxiv.org/abs/2304. 10573

  20. [20]

    G. Zhou, H. Pan, Y . LeCun, and L. Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 79115–79135, 2025

  21. [21]

    Mandlekar, F

    A. Mandlekar, F. Ramos, B. Boots, S. Savarese, L. Fei-Fei, A. Garg, and D. Fox. IRIS: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 4414–4420, 2020

  22. [22]

    Gupta, V

    A. Gupta, V . Kumar, C. Lynch, S. Levine, and K. Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. InProceedings of the Conference on Robot Learning (CoRL), volume 100 ofProceedings of Machine Learning Research, pages 1025–1037, 2020

  23. [23]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InProceedings of the Conference on Robot Learning (CoRL), volume 205 of Proceedings of Machine Learning Research, pages 785–799, 2022

  24. [24]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InProceedings of the Conference on Robot Learning (CoRL), volume 229 ofProceedings of Machine Learning Research, pages 1820–1864, 2023

  25. [25]

    Xie, M.-T

    Q. Xie, M.-T. Luong, E. Hovy, and Q. V . Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  26. [26]

    Rebuffi, A

    S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  27. [27]

    what counts as in-distribution

    P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. A Extended Related Work This section expands on the related work discussion in Section 2, providing additional detail on individual methods and cov...