arxiv: 2605.03065 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.RO

Recognition: 2 theorem links

· Lean Theorem

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

Abhishek Gupta, Chaoyi Pan, Cleah Winston, Douglas Chen, Giri Anantharaman, Hongkai Da, Jesse Zhang, Manan Agarwal, Max Simchowitz, Mitsuhiko Nakamoto, Nai-Chieh Huang, Oliver Kroemer, Paarth Shah, Sarvesh Patil, Sergey Levine, Shashwat Saxena, Zeynep Temel

Pith reviewed 2026-05-08 18:41 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords generative control policiesoff-policy optimizationrobot manipulationdiffusion policiespolicy fine-tuningPPOsample efficiencydexterous control

0 comments

The pith

OGPO fine-tunes generative control policies to near full task success from poor initializations with no expert data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OGPO, an algorithm that fine-tunes generative control policies such as diffusion- and flow-based ones for robot control. It maintains off-policy critic networks to reuse data and applies a modified PPO objective to propagate gradients through the full generative sampling process, with critics providing the terminal reward. This enables improvement of weakly initialized behavior cloning policies to high success rates without expert data in the replay buffer and with limited hyperparameter tuning per task. A reader would care because the method reduces sample requirements for complex robot manipulation while working across state and pixel observations.

Core claim

OGPO maintains off-policy critic networks to maximize data reuse and propagates policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. It achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. It is the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. The work also identifies practical stabilizers including success-buffer regularization, conservative advantages, chi-squared

What carries the argument

The modified PPO objective that uses off-policy critics as terminal rewards to back-propagate gradients through the complete generative sampling steps of the policy while enabling data reuse.

If this is right

The method outperforms alternatives on policy steering and residual correction tasks.
It delivers strong results in multi-task, high-precision insertion, and dexterous control without expert data.
Few task-specific hyperparameter changes are needed across different observation types.
Stabilizers such as success-buffer regularization and Q-variance reduction prevent critic over-exploitation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same off-policy critic structure could be tested on fine-tuning other generative models outside robotics.
Starting from cheap behavior cloning initializations may cut the cost of deploying learned policies in real environments.
The stabilizers might generalize to larger-scale policies or additional generative architectures.
Similar gradient propagation through sampling processes could apply to non-manipulation control problems like navigation.

Load-bearing premise

Off-policy critics stay stable and provide useful signals when gradients are sent through the entire generative policy process via the modified PPO objective, without heavy per-task tuning or collapse in state- and pixel-based settings.

What would settle it

Apply OGPO to a dexterous manipulation task starting from a poorly initialized behavior cloning policy and an empty expert data buffer; if the policy does not reach near full task success or exhibits instability, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.03065 by Abhishek Gupta, Chaoyi Pan, Cleah Winston, Douglas Chen, Giri Anantharaman, Hongkai Da, Jesse Zhang, Manan Agarwal, Max Simchowitz, Mitsuhiko Nakamoto, Nai-Chieh Huang, Oliver Kroemer, Paarth Shah, Sarvesh Patil, Sergey Levine, Shashwat Saxena, Zeynep Temel.

**Figure 1.** Figure 1: OGPO enables sample-efficient full-policy finetuning of generative control policies. Left: A GCP induces a denoising MDP inside each environment step. OGPO severs the bi-level MDP at the executed action and uses an off-policy critic as a terminal reward for PPO-style optimization over purely computational denoising trajectories. Middle: This off-policy policy extraction substantially improves sample effici… view at source ↗

**Figure 2.** Figure 2: We recall the bi-level MDP from [Ren et al., 2024], which embeds action-level trajectories into the environmental dynamics. OGPO truncates this MDP at the end of each denoising trajectory, using Q-values as a terminal, action-trajectory-level reward. This enables off-policy policy extraction via on-policy policy optimization. Our starting point is the bi-level MDP formulation adopted from [Ren et al., 2024] ( view at source ↗

**Figure 3.** Figure 3: Visual depiction of the different off-policy RL algorithms. (left) DSRL trains an initial noise steering policy, while EXPO trains a residual policy to modify the final GCP action. (center) QC drives policy improvement via supervised finetuning (SFT) of Best-of-N actions ranked via the critic, while BPTT backpropagates the gradients sequentially through the entire GCP. (right) OGPO uses an ensemble of cri… view at source ↗

**Figure 4.** Figure 4: The above plots show the full training comparison between (a) Vanilla OGPO, (b) OGPO+, and (c) OGPO+CA, on state-based ROBOMIMIC tasks. The red axis shows success rate and the blue axis shows the mean length of successful trajectories.By aggressively maximizing sparse reward, OGPO optimizes for both task success rate, and completion in few steps. Without further regularization, the two can be in tension, c… view at source ↗

**Figure 5.** Figure 5: On ablating actor and critic observation modalities, we observe that vanilla OGPO fails to improve policy performance from image-based critics. of pixels, we compare four variants: (1) state-based actor/state-based critic; (2) pixel-based actor/state-based critic (3) pixel-based actor/pixel-based (4) state-based actor/pixel-based critic. We plot variants (1-3) in view at source ↗

**Figure 6.** Figure 6: We perform a small sweep of ablations adding Best-of-N (BoN) Inference and Success Buffer on ROBOMIMICTOOLHANG. [Mark et al., 2024, Dong et al., 2025, Li et al., 2025], using the target critic as verifier. In, OGPO+ we do the same with a slightly modified critic QBON described in Appendix A.1. We remark that, due to the aggressive policy extraction, Best-ofN inference yields only marginal additional perf… view at source ↗

**Figure 7.** Figure 7: We take early-, mid-, and late- training checkpoints for OGPO and OGPO+CA to rollout 32 trajectories and visualize the min, mean, and max Q vs Monte-Carlo returns. (a) Shows OGPO’s Q values fluctuating widely between positive and negative values. (b) Shows OGPO+CA’s Q values converging more stably around the y = x axis A second challenge in offline-to-online RL is the pervasive “dip” in performance that ar… view at source ↗

**Figure 8.** Figure 8: We In principle, the PPO clipping mitigates serves as an approximation for a KL trust-region, which has the intended affect of regularizing the policy to mitigate critic over-exploitation. However, given that the generative policies in OGPOare more expressive than traditional monolothic RL policies, we find above that PPO clipping does not suffice. Instead, we leverage a stronger form of regularization - χ… view at source ↗

**Figure 9.** Figure 9: Comparison against natural off-policy baselines (EXPO, DSRL, QC), and on-policy algorithms modified to use OGPO-style off-policy advantages (OFPO++, FQL) on ROBOMIMIC SQUARE, TOOLHANG, and TRANSPORT. 0 0.5M 1M 1.5M 2M Steps 0.0 0.2 0.4 0.6 0.8 1.0 Success rate D4RL/kitchen/complete-v2 0 0.5M 1M 1.5M 2M Steps 0.0 0.2 0.4 0.6 0.8 1.0 D4RL/kitchen/mixed-v2 0 0.5M 1M 1.5M 2M Steps 0.0 0.2 0.4 0.6 0.8 1.0 D4RL/… view at source ↗

**Figure 10.** Figure 10: Comparison against natural off-policy baselines (EXPO, DSRL, QC) on FRANKA-KITCHEN 0 0.1M 0.2M 0.3M 0.4M 0.5M Steps 0.0 0.2 0.4 0.6 0.8 1.0 Success rate AdroitHandDoor-v1 0 0.1M 0.2M 0.3M 0.4M 0.5M Steps 0.0 0.2 0.4 0.6 0.8 1.0 AdroitHandPen-v1 0 0.1M 0.2M 0.3M 0.4M 0.5M Steps 0.0 0.2 0.4 0.6 0.8 1.0 AdroitHandHammer-v1 0 0.1M 0.2M 0.3M 0.4M 0.5M Steps 0.0 0.2 0.4 0.6 0.8 1.0 AdroitHandRelocate-v1 OGPO OG… view at source ↗

**Figure 11.** Figure 11: Comparison against natural off-policy baselines (EXPO, DSRL, QC) on the AdroitHand suboptimal performance when the base policy’s performance is poor, such as in KITCHEN tasks. Further, by not updating later steps of the GCP, steering struggles on high-precision tasks such as the ADROIT task suite. We also empirically found it to be sensitive to hyperparameters; in some tasks, DSRL performance crashes desp… view at source ↗

**Figure 12.** Figure 12: OGPO+ substantially improves sample efficiency compared to the on-policy DPPO algorithm. while DPPO treats the entire bi-level MDP as a single MDP to train with on-policy RL. On final success rates across ROBOMIMIC SQUARE and TRANSPORT, this off-policy modification results in DPPO taking ∼ 10× longer to reach the final success rates achieved by OGPO+. Overall, we find that both OGPO and OGPO+ outperfor… view at source ↗

**Figure 13.** Figure 13: Actions generated by BC policies on critical states in TOOLHANG overlap in the UMAP space and diverge as training progresses. OGPO policies generate actions along narrower variance of successful actions compared to the baselines. We also see that ∇aQ(s, a) nudges the actions away from this variance axis toward more efficient execution As illustrated in view at source ↗

**Figure 15.** Figure 15: OGPO exhibits manifold expansion by making the policy more multimodal as well as execution efficient. We compare against the natural baselines and a no-negative gradient version of OGPO, in (top) sparse reward, (middle) sparse reward with ∆at compensation, and (bottom) early-stage policy sparse reward with ∆at compensation settings. We observe that OGPO policies in sparse reward settings optimize for exe… view at source ↗

**Figure 14.** Figure 14: As needed by the task in question, generative control policies extracted via OGPO exhibit multimodality by faithfully aligning Q-functions toward Q ∗ . We empirically show the effects of OGPO along with all the baselines on a Push-T task from [Chi et al., 2023] and elucidate a detailed summary of the optimization and exploration behavior of OGPO in Appendix G. 6.2 Which Design Decisions Explain the Per… view at source ↗

**Figure 16.** Figure 16: OGPO comparisons with policy extraction ablations with AWR-FM AW-OGPO and OFPO++ on ROBOMIMIC environments 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Steps 1e6 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate square-mh-low_dim 0.0 0.5 1.0 1.5 2.0 Steps 1e6 0.0 0.2 0.4 0.6 0.8 1.0 tool_hang-ph-low_dim 0.0 0.5 1.0 1.5 2.0 Steps 1e6 0.0 0.2 0.4 0.6 0.8 1.0 transport-mh-low_dim OGPO+ OGPO- ² AW-OGPO- ² OFPO++ view at source ↗

**Figure 17.** Figure 17: OGPO comparison with policy extraction ablations with AWR-FM, AW-OGPO and OFPO++ on ROBOMIMIC environments As shown in view at source ↗

**Figure 18.** Figure 18: Comparison to EXPO with Offline-Ratio = 0.5 18 view at source ↗

**Figure 19.** Figure 19: OGPO with diffusion policies. OGPO can successfully improve both flow policy and diffusion policy. We predominantly use flow policies due to the faster inference compute Here, as an example, we illustrate its use in diffusion policies. We study this on the SQUARE task, where we pre-train a diffusion policy on the MH dataset and then apply online improvement with OGPO. As shown in view at source ↗

**Figure 20.** Figure 20: Bi-level (two-layer) MDP construction. Each environment step t is expanded into K inner actiongeneration steps indexed by k ∈ {K − 1, . . . , 0}. The environment transitions and rewards occur only at k = 0, while for k > 0 the state is unchanged and the inner action variable is updated. We formulate the bi-level MDP ( view at source ↗

**Figure 21.** Figure 21: Comparison of ODE-to-SDE correction F.1 Diffusion Policy Policy Optimization (DPPO, Ren et al. [2024] DPPO fine-tunes diffusion policies by applying PPO directly to the bi-level MDP introduced in Appendix D. In this construction, each inner denoising step induces an explicit (Gaussian) likelihood, enabling standard policy-gradient updates on the full trajectory in MBILEVEL. DPPO then instantiates the PPO… view at source ↗

**Figure 22.** Figure 22: UMAP plot of OGPO, OGPO+, and OGPO+CA on ROBOMIMIC TOOLHANG and a no-negative grad ablation of OGPO. We show these rollouts in view at source ↗

**Figure 23.** Figure 23: UMAP plot of OGPO comparison with various policy extraction methods on ROBOMIMIC TOOLHANG H Ablations and Limitations of OGPO/OGPO+ view at source ↗

**Figure 24.** Figure 24: OGPO - OGPO+ design ablations show that success buffer plays a crucial role in OGPO+’s performance. H.1 BPTT vs OGPO The most direct way to train off-policy RL policies is to perform gradient ascent on the Q-values. Although this works for simpler policy parameterizations like Gaussian [Fujimoto et al., 2018], or Squashed Gaussian [Haarnoja et al., 2018] policies, directly using Q values to sequentially b… view at source ↗

**Figure 25.** Figure 25: BPTT uses Q-values directly to backpropagate gradients along the entire GCP chain. This results in unstable gradients and poor convergence. In contrast, OGPO uses PPO-style policy gradient loss using Q-functions described Eq. (3.2). This results in stable gradients and sample-efficient convergence. H.2 OGPO v/s OGPO+, with and without GRPO std (σ) GRPO formulation uses group relative advantage computation… view at source ↗

**Figure 26.** Figure 26: OGPO+ vs OGPO+ with no-negative gradients 42 view at source ↗

**Figure 27.** Figure 27: OGPO+ comparison with an ablation of simultaneous steering and residual learning baseline: S/R H.5 Policy Extraction Alternatives (AWR, ASPO from FPO) OGPO separates critic learning from policy extraction: after learning Qφ with the TD objective, the actor update only needs a mechanism for increasing the probability of high-advantage actions and decreasing the probability of low-advantage actions. This ma… view at source ↗

**Figure 28.** Figure 28: We compare OGPO with DSRL and QC on pixel-based observations and natural language guidance tasks from the LIBERO benchmark Task Structure: LIBERO features procedurally generated tasks with natural language instructions. Tasks require understanding spatial relationships and object attributes from both visual and linguistic modalities. Reward Structure: All Libero tasks use sparse rewards: -1 for each non-s… view at source ↗

read the original abstract

Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate the OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilizers, including success-buffer regularization, conservative advantages, $\chi^2$ regularization, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OGPO gives a usable recipe for off-policy full finetuning of diffusion-style robot policies via modified PPO and four stabilizers, but the stability claims need checking against the actual ablations.

read the letter

The main takeaway is that OGPO lets you keep an off-policy critic, treat its output as a terminal reward, and back-propagate gradients through every step of the generative sampling process inside a PPO-style update. They add success-buffer regularization, conservative advantages, chi-squared regularization, and Q-variance reduction to stop the critic from being exploited. The paper reports this combination turns weak behavior-cloning initializations into high-success policies on manipulation tasks even when the replay buffer contains no expert data at all. That regime matters for real hardware work. The systematic study of failure modes in generative policy finetuning is also useful; it spells out what tends to break and which knobs help. The empirical coverage across multi-task, insertion, and dexterous settings is broad enough to be worth looking at. The soft spot is exactly the one the stress test flags: whether the off-policy critics remain stable and non-exploitative once gradients travel through the full generative chain, especially from poor starting policies. The abstract claims the stabilizers make this work with little per-task tuning across state and pixel observations, but that claim is only as strong as the ablations and variance numbers in the full paper. If those stabilizers still require noticeable fiddling or if the gains are sensitive to initialization quality, the practical advantage shrinks. The work is aimed at robotics groups already using or building generative controllers. A reader who needs a concrete starting point for sample-efficient policy improvement will get value from the algorithm description and the identified mechanisms. It deserves a serious referee because the problem is timely, the claims are specific, and the empirical angle is concrete enough to evaluate properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces Off-policy Generative Policy Optimization (OGPO), an algorithm for sample-efficient full-finetuning of generative control policies (GCPs) such as diffusion- and flow-based policies. OGPO maintains off-policy critic networks for data reuse and employs a modified PPO objective in which the critics act as terminal rewards, allowing policy gradients to propagate through the entire generative sampling process. It reports state-of-the-art results on robot manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. The work claims to be the only method able to fine-tune poorly-initialized behavior-cloning policies to near-full task success with no expert data in the online replay buffer and with minimal task-specific hyperparameter tuning. Four practical stabilizers (success-buffer regularization, conservative advantages, χ² regularization, and Q-variance reduction) are introduced to address critic over-exploitation, and a systematic empirical study of GCP finetuning is presented.

Significance. If the empirical results and stability claims hold under rigorous verification, OGPO would constitute a meaningful advance in sample-efficient robot learning by providing a practical route to optimize expressive generative policies that are otherwise difficult to train. The identification of concrete stabilizing mechanisms and failure modes for off-policy full-policy improvement could inform subsequent work on generative policies in both state- and pixel-based regimes.

major comments (2)

[Abstract (stabilizers and empirical claims)] The load-bearing claim that off-policy critics remain stable and non-exploitative when used as terminal rewards inside the modified PPO objective (with gradients back-propagated through the full generative chain) is supported only by the introduction of four stabilizers. It is not shown whether these stabilizers generalize without non-trivial per-task tuning precisely in the regime of very poor initial BC policies and zero expert data in the replay buffer—the setting highlighted as novel.
[Abstract and empirical investigations section] The assertion of state-of-the-art performance and unique capability to reach near-full task success from poorly-initialized BC policies rests on extensive empirical investigations whose full details (baselines, statistical reporting, ablation studies on the stabilizers, and exact experimental protocols) are not verifiable from the provided material. This directly limits assessment of whether the reported gains are robust.

minor comments (2)

Notation for the modified PPO objective and the precise form of the terminal reward derived from the critic could be clarified with an explicit equation reference to aid reproducibility.
The abstract states that the method works 'with few task-specific hyperparameter tuning'; a concise table summarizing the hyperparameter ranges actually used across tasks would strengthen this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript on OGPO. We value the emphasis on rigorous verification of stability claims and empirical robustness, and we address each major comment below with specific plans for revision where appropriate.

read point-by-point responses

Referee: [Abstract (stabilizers and empirical claims)] The load-bearing claim that off-policy critics remain stable and non-exploitative when used as terminal rewards inside the modified PPO objective (with gradients back-propagated through the full generative chain) is supported only by the introduction of four stabilizers. It is not shown whether these stabilizers generalize without non-trivial per-task tuning precisely in the regime of very poor initial BC policies and zero expert data in the replay buffer—the setting highlighted as novel.

Authors: We thank the referee for this precise observation on the load-bearing nature of the stability claim. The manuscript evaluates OGPO specifically in the regime of poorly initialized BC policies with no expert data in the replay buffer, demonstrating stable critic behavior and high task success across multi-task, insertion, and dexterous manipulation settings. The four stabilizers are introduced with motivations tied to observed failure modes (critic over-exploitation, advantage bias, distribution shift, and variance), and ablations show their individual contributions. While the paper reports results with limited task-specific tuning, we acknowledge that a more explicit demonstration of fixed-hyperparameter generalization across this exact regime would strengthen the claim. In the revision we will add a dedicated subsection analyzing hyperparameter sensitivity and include new experiments using a single shared hyperparameter configuration for the poor-initialization setting. revision: partial
Referee: [Abstract and empirical investigations section] The assertion of state-of-the-art performance and unique capability to reach near-full task success from poorly-initialized BC policies rests on extensive empirical investigations whose full details (baselines, statistical reporting, ablation studies on the stabilizers, and exact experimental protocols) are not verifiable from the provided material. This directly limits assessment of whether the reported gains are robust.

Authors: We apologize that the experimental details were not sufficiently clear or accessible in the material reviewed. The full manuscript includes: (i) explicit baseline descriptions and implementation details for all compared methods, (ii) statistical reporting with means, standard deviations, and multiple random seeds for all main results, (iii) comprehensive ablations isolating each stabilizer, and (iv) exact protocols (network architectures, replay buffer sizes, sampling steps, and environment parameters) in the appendix. To improve verifiability we will revise the main text to include a concise summary table of key experimental settings and ensure every performance claim is cross-referenced to the corresponding figure, table, or appendix section. These changes will make the robustness of the reported gains easier to assess without altering the underlying results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical algorithmic proposal with independent experimental validation

full rationale

The paper presents OGPO as a practical algorithm extending PPO for generative policies, using off-policy critics as terminal rewards and introducing four stabilizers (success-buffer regularization, conservative advantages, χ² regularization, Q-variance reduction). All central claims concern empirical performance on manipulation tasks, including fine-tuning poorly-initialized BC policies without expert data. No derivation chain, uniqueness theorem, or first-principles prediction is offered that reduces by the paper's own equations to fitted parameters, self-citations, or ansatzes; the work is self-contained as an empirical contribution whose results are externally falsifiable via replication on the reported tasks and baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

As an empirical reinforcement-learning paper, the central claims rest on standard off-policy RL assumptions and a large number of fitted hyperparameters whose values are not enumerated in the abstract.

free parameters (1)

PPO and stabilizer hyperparameters
The abstract states few task-specific hyperparameter tuning is needed, implying a set of tuned values for learning rates, regularization coefficients, and advantage estimators.

axioms (1)

domain assumption Off-policy critics can provide reliable terminal rewards for policy gradient updates through generative processes
The method relies on this to enable data reuse and gradient propagation.

pith-pipeline@v0.9.0 · 5603 in / 1187 out tokens · 25805 ms · 2026-05-08T18:41:53.828658+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

154 extracted references · 64 canonical work pages · 24 internal anchors

[1]

Advances in neural information processing systems , volume=

Implicit reparameterization gradients , author=. Advances in neural information processing systems , volume=
[2]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page Pith review arXiv
[3]

International Conference on Machine Learning , pages=

Do differentiable simulators give better policy gradients? , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[4]

arXiv preprint arXiv:2310.11428 , year=

Butterfly effects of sgd noise: Error amplification in behavior cloning and autoregression , author=. arXiv preprint arXiv:2310.11428 , year=

work page arXiv
[5]

2004 , publisher=

Optimal control theory: an introduction , author=. 2004 , publisher=

2004
[6]

Advances in Neural Information Processing Systems , volume=

TaSIL: Taylor series imitation learning , author=. Advances in Neural Information Processing Systems , volume=
[7]

International Conference on Machine Learning , pages=

Information-theoretic considerations in batch reinforcement learning , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[8]

SIAM journal on control and optimization , volume=

A Lyapunov-like characterization of asymptotic controllability , author=. SIAM journal on control and optimization , volume=. 1983 , publisher=

1983
[9]

Preference fine-tuning of LLMs should leverage suboptimal, on-policy data,

Preference fine-tuning of llms should leverage suboptimal, on-policy data , author=. arXiv preprint arXiv:2404.14367 , year=

work page arXiv
[10]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review arXiv
[11]

SGDR: Stochastic Gradient Descent with Warm Restarts

SGDR: Stochastic gradient descent with warm restarts , author=. arXiv preprint arXiv:1608.03983 , year=

work page Pith review arXiv
[12]

Proceedings of the AAAI conference on artificial intelligence , volume=

Film: Visual reasoning with a general conditioning layer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[13]

arXiv preprint arXiv:2306.06253 , year=

Decision Stacks: Flexible Reinforcement Learning via Modular Generative Models , author=. arXiv preprint arXiv:2306.06253 , year=

work page arXiv
[14]

Anthropic

Compositional Foundation Models for Hierarchical Planning , author=. arXiv preprint arXiv:2309.08587 , year=

work page arXiv
[15]

Planning with Diffusion for Flexible Behavior Synthesis

Planning with diffusion for flexible behavior synthesis , author=. arXiv preprint arXiv:2205.09991 , year=

work page internal anchor Pith review arXiv
[16]

Imitating human behaviour with diffusion models.arXiv preprint arXiv:2301.10677, 2023

Imitating human behaviour with diffusion models , author=. arXiv preprint arXiv:2301.10677 , year=

work page arXiv
[17]

Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022

Is Conditional Generative Modeling all you need for Decision-Making? , author=. arXiv preprint arXiv:2211.15657 , year=

work page arXiv
[20]

Advances in neural information processing systems , volume=

Behavior Transformers: Cloning k modes with one stone , author=. Advances in neural information processing systems , volume=
[21]

Advances in neural information processing systems , volume=

Decision transformer: Reinforcement learning via sequence modeling , author=. Advances in neural information processing systems , volume=
[22]

Neural computing and applications , volume=

Deep imitation learning for 3D navigation tasks , author=. Neural computing and applications , volume=. 2018 , publisher=

2018
[23]

ACM Computing Surveys (CSUR) , volume=

Imitation learning: A survey of learning methods , author=. ACM Computing Surveys (CSUR) , volume=. 2017 , publisher=

2017
[24]

End to End Learning for Self-Driving Cars

End to end learning for self-driving cars , author=. arXiv preprint arXiv:1604.07316 , year=

work page internal anchor Pith review arXiv
[25]

Conference on robot learning , pages=

One-shot visual imitation learning via meta-learning , author=. Conference on robot learning , pages=. 2017 , organization=

2017
[26]

2018 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Deep imitation learning for complex manipulation tasks from virtual reality teleoperation , author=. 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2018 , organization=

2018
[27]

Bansal, A

Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst , author=. arXiv preprint arXiv:1812.03079 , year=

work page arXiv
[28]

Learning for Dynamics and Control Conference , pages=

On the sample complexity of stability constrained imitation learning , author=. Learning for Dynamics and Control Conference , pages=. 2022 , organization=

2022
[29]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

2011
[30]

Conference on robot learning , pages=

Dart: Noise injection for robust imitation learning , author=. Conference on robot learning , pages=. 2017 , organization=

2017
[31]

2019 International Conference on Robotics and Automation (ICRA) , pages=

Hg-dagger: Interactive imitation learning with human experts , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

2019
[32]

MEGA-DAgger: Imitation Learning with Multiple Imperfect Experts, May 2024

MEGA-DAgger: Imitation Learning with Multiple Imperfect Experts , author=. arXiv preprint arXiv:2303.00638 , year=

work page arXiv
[33]

arXiv preprint arXiv:2106.03207 , year=

Mitigating covariate shift in imitation learning via offline data without great coverage , author=. arXiv preprint arXiv:2106.03207 , year=

work page arXiv
[34]

Advances in Neural Information Processing Systems , volume=

Causal confusion in imitation learning , author=. Advances in Neural Information Processing Systems , volume=
[35]

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

Efficient reductions for imitation learning , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

2010
[36]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[37]

2003 , publisher=

On the sample complexity of reinforcement learning , author=. 2003 , publisher=

2003
[38]

Distributional and L^q norm inequalities for polynomials over convex bodies in

Carbery, Anthony and Wright, James , journal=. Distributional and L^q norm inequalities for polynomials over convex bodies in. 2001 , publisher=

2001
[39]

International conference on machine learning , pages=

How to escape saddle points efficiently , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[40]

Mathematics of Computation , volume=

Recovery of Sobolev functions restricted to iid sampling , author=. Mathematics of Computation , volume=
[41]

arXiv preprint cs/0408007 , year=

Online convex optimization in the bandit setting: gradient descent without a gradient , author=. arXiv preprint cs/0408007 , year=

work page internal anchor Pith review arXiv
[42]

Nonlinear and optimal control theory: lectures given at the CIME summer school held in Cetraro, Italy June 19--29, 2004 , pages=

Input to state stability: Basic concepts and results , author=. Nonlinear and optimal control theory: lectures given at the CIME summer school held in Cetraro, Italy June 19--29, 2004 , pages=. 2008 , publisher=

2004
[43]

Statistics & Probability Letters , volume=

Optimal global rates of convergence for interpolation problems with random design , author=. Statistics & Probability Letters , volume=. 2013 , publisher=

2013
[44]

High-probability minimax lower bounds,

High-probability minimax lower bounds , author=. arXiv preprint arXiv:2406.13447 , year=

work page arXiv
[45]

2006 , publisher=

A distribution-free theory of nonparametric regression , author=. 2006 , publisher=

2006
[46]

Advances in Neural Information Processing Systems , year=

Provable guarantees for generative behavior cloning: Bridging low-level stability and high-level behavior , author=. Advances in Neural Information Processing Systems , year=
[47]

IFAC Proceedings Volumes , volume=

Uncertainty in unstable systems: the gap metric , author=. IFAC Proceedings Volumes , volume=. 1981 , publisher=

1981
[48]

2000 , publisher=

Empirical Processes in M-estimation , author=. 2000 , publisher=

2000
[49]

2019 , publisher=

High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

2019
[50]

The Thirty Seventh Annual Conference on Learning Theory , pages=

Minimax Linear Regression under the Quantile Risk , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

2024
[51]

Proceedings of The 7th Conference on Robot Learning , pages =

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , author =. Proceedings of The 7th Conference on Robot Learning , pages =. 2023 , editor =

2023
[52]

Advances in Neural Information Processing Systems , volume=

Conservative q-learning for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[53]

Offline Reinforcement Learning with Implicit Q-Learning

Offline reinforcement learning with implicit q-learning , author=. arXiv preprint arXiv:2110.06169 , year=

work page internal anchor Pith review arXiv
[54]

Advances in neural information processing systems , volume=

Generative adversarial imitation learning , author=. Advances in neural information processing systems , volume=
[55]

International Conference on Machine Learning , pages=

Of moments and matching: A game-theoretic framework for closing the imitation gap , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[56]

arXiv preprint arXiv:2410.13855 , year=

Diffusing States and Matching Scores: A New Framework for Imitation Learning , author=. arXiv preprint arXiv:2410.13855 , year=

work page arXiv
[60]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Idql: Implicit q-learning as an actor-critic method with diffusion policies , author=. arXiv preprint arXiv:2304.10573 , year=

work page internal anchor Pith review arXiv
[61]

The Annals of Statistics , volume=

On nonparametric estimation of density level sets , author=. The Annals of Statistics , volume=. 1997 , publisher=

1997
[62]

Journal of Multivariate Analysis , volume=

Nonparametric estimation of a function from noiseless observations at random points , author=. Journal of Multivariate Analysis , volume=. 2017 , publisher=

2017
[63]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

2018
[64]

IEEE Transactions on Intelligent Vehicles , volume=

Motion planning for autonomous driving: The state of the art and future perspectives , author=. IEEE Transactions on Intelligent Vehicles , volume=. 2023 , publisher=

2023
[65]

2021 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Grasping with chopsticks: Combating covariate shift in model-free imitation learning for fine manipulation , author=. 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2021 , organization=

2021
[66]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Seil: simulation-augmented equivariant imitation learning , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

2023
[67]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

1992
[68]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=
[75]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
[78]

Nakamoto, Mitsuhiko and Zhai, Yuexiang and Singh, Anikait and Mark, Max Sobol and Ma, Yi and Finn, Chelsea and Kumar, Aviral and Levine, Sergey , month = jan, year =. Cal-. doi:10.48550/arXiv.2303.05479 , abstract =

work page doi:10.48550/arxiv.2303.05479
[80]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
[83]

TD-M (PC) \^

Lin, Haotian and Wang, Pengcheng and Schneider, Jeff and Shi, Guanya , journal=. TD-M (PC) \^
[87]

International Conference on Machine Learning , pages=

Curriculum reinforcement learning via constrained optimal transport , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[88]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[91]

International conference on machine learning , pages=

Addressing function approximation error in actor-critic methods , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[93]

Advances in Neural Information Processing Systems , volume=

Libero: Benchmarking knowledge transfer for lifelong robot learning , author=. Advances in Neural Information Processing Systems , volume=
[94]

arXiv preprint arXiv:2510.21995 , year=

Is Temporal Difference Learning the Gold Standard for Stitching in RL? , author=. arXiv preprint arXiv:2510.21995 , year=

work page arXiv
[95]

Statistics and computing , volume=

Annealed importance sampling , author=. Statistics and computing , volume=. 2001 , publisher=

2001
[96]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[97]

Residual off- policy rl for finetuning behavior cloning policies.arXiv preprint arXiv:2509.19301, 2025

Residual off-policy rl for finetuning behavior cloning policies , author=. arXiv preprint arXiv:2509.19301 , year=

work page arXiv
[99]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[102]

Consistency models , author=
[104]

Conference on Robot Learning (CoRL) , year=

BridgeData V2: A Dataset for Robot Learning at Scale , author=. Conference on Robot Learning (CoRL) , year=
[105]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

2016
[107]

and Simard, P

Bengio, Y. and Simard, P. and Frasconi, P. , journal=. Learning long-term dependencies with gradient descent is difficult , year=

Showing first 80 references.