SARM2 presents RM, a multi-task stage-aware reward model achieving 80% lower value-estimation MSE, which when used in SPIRAL boosts manipulation task success from ~50% to near-perfect on several benchmarks.
Residual off-policy rl for finetuning behavior cloning policies
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 10roles
background 2polarities
background 2representative citing papers
GLOVES learns flow models from limited expert demonstrations to selectively correct actions from non-expert policies or operators toward expert distributions using reverse-flow OOD detection as an intervention gate.
ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.
OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.
MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
A simulation framework generates diverse reach-to-grasp demonstrations to train imitation learning policies for biosignals-free prosthetic grasping, achieving over 90% success in sim-to-real transfer.
HandelBot refines simulation policies via physical rollouts and residual RL to achieve precise bimanual piano playing, outperforming direct sim transfer by 1.8x with only 30 minutes of real data across five songs.
Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.
EXPO-FT enables pretrained VLA policies to reach 30/30 success on complex manipulation tasks using an average of 19.1 minutes of online robot data while outperforming prior RL approaches.
citing papers explorer
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.
-
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
-
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.