Residual off-policy rl for finetuning behavior cloning policies

[Online] · 2025 · arXiv 2509.19301

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

cs.RO · 2026-06-09 · unverdicted · novelty 6.0

SARM2 presents RM, a multi-task stage-aware reward model achieving 80% lower value-estimation MSE, which when used in SPIRAL boosts manipulation task success from ~50% to near-perfect on several benchmarks.

Flow-based Policy Adaptation without Policy Updates

cs.RO · 2026-06-04 · unverdicted · novelty 6.0

GLOVES learns flow models from limited expert demonstrations to selectively correct actions from non-expert policies or operators toward expert distributions using reverse-flow OOD detection as an intervention gate.

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

cs.RO · 2026-05-19 · unverdicted · novelty 6.0

ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.

MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

cs.RO · 2026-04-11 · unverdicted · novelty 6.0

MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

cs.LG · 2025-11-18 · unverdicted · novelty 6.0

RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.

HiL-ResRL: A Model-Agnostic Finetuning Adapter via Human-in-the-loop Residual Reinforcement Learning

cs.RO · 2026-06-22 · unverdicted · novelty 5.0

HiL-ResRL trains a model-agnostic residual policy on VLA actions using human-guided online RL, achieving over 95% success rate after 1.5 hours of real-robot training.

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning

cs.RO · 2026-06-17 · unverdicted · novelty 5.0

DF-ExpEnse improves sample efficiency in finetuning diffusion-based robotic policies by filtering diffusion-generated actions with critic ensembles and enabling fleet-level collaboration.

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

cs.RO · 2026-06-17 · unverdicted · novelty 5.0

Object-centric residual RL trained in simulation with pose noise and dropout raises real Franka robot VLA success from 42% to 76% zero-shot across five tasks, with improved data reusable for base model retraining.

Simulation-Driven Imitation Learning for Biosignals-Free Shared-Autonomy Prosthetic Grasping

cs.RO · 2026-06-05 · unverdicted · novelty 5.0

A simulation framework generates diverse reach-to-grasp demonstrations to train imitation learning policies for biosignals-free prosthetic grasping, achieving over 90% success in sim-to-real transfer.

HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies

cs.RO · 2026-03-12 · unverdicted · novelty 5.0 · 2 refs

HandelBot refines simulation policies via physical rollouts and residual RL to achieve precise bimanual piano playing, outperforming direct sim transfer by 1.8x with only 30 minutes of real data across five songs.

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

cs.LG · 2026-06-01 · unverdicted · novelty 4.0

Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.

EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

cs.RO · 2026-05-25 · unverdicted · novelty 4.0

EXPO-FT enables pretrained VLA policies to reach 30/30 success on complex manipulation tasks using an average of 19.1 minutes of online robot data while outperforming prior RL approaches.

citing papers explorer

Showing 3 of 3 citing papers after filters.

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies cs.LG · 2026-05-04 · unverdicted · none · ref 97
OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.
$\pi^{*}_{0.6}$: a VLA That Learns From Experience cs.LG · 2025-11-18 · unverdicted · none · ref 21
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards cs.LG · 2026-06-01 · unverdicted · none · ref 15
Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.

Residual off-policy rl for finetuning behavior cloning policies

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer