Residual off- policy rl for finetuning behavior cloning policies

Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, Anusha Nagabandi · 2025 · arXiv 2509.19301

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

cs.RO · 2026-04-11 · unverdicted · novelty 6.0

MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

cs.LG · 2025-11-18 · unverdicted · novelty 6.0

RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

citing papers explorer

Showing 3 of 3 citing papers.

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies cs.LG · 2026-05-04 · unverdicted · none · ref 97
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks cs.RO · 2026-04-11 · unverdicted · none · ref 21
MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
$\pi^{*}_{0.6}$: a VLA That Learns From Experience cs.LG · 2025-11-18 · unverdicted · none · ref 21
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

Residual off- policy rl for finetuning behavior cloning policies

fields

years

verdicts

representative citing papers

citing papers explorer