hub

Residual Policy Learning

Tom Silver, Kelsey Allen, Josh Tenenbaum, Leslie Kaelbling · 2018 · cs.RO · arXiv 1812.06298

32 Pith papers cite this work. Polarity classification is still indexing.

32 Pith papers citing it

open full Pith review browse 32 citing papers arXiv PDF

abstract

We present Residual Policy Learning (RPL): a simple method for improving nondifferentiable policies using model-free deep reinforcement learning. RPL thrives in complex robotic manipulation tasks where good but imperfect controllers are available. In these tasks, reinforcement learning from scratch remains data-inefficient or intractable, but learning a residual on top of the initial controller can yield substantial improvements. We study RPL in six challenging MuJoCo tasks involving partial observability, sensor noise, model misspecification, and controller miscalibration. For initial controllers, we consider both hand-designed policies and model-predictive controllers with known or learned transition models. By combining learning with control algorithms, RPL can perform long-horizon, sparse-reward tasks for which reinforcement learning alone fails. Moreover, we find that RPL consistently and substantially improves on the initial controllers. We argue that RPL is a promising approach for combining the complementary strengths of deep reinforcement learning and robotic control, pushing the boundaries of what either can achieve independently. Video and code at https://k-r-allen.github.io/residual-policy-learning/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand

cs.RO · 2026-06-26 · unverdicted · novelty 7.0

DexCompose achieves 77.4% average success on 16 composite dexterous tasks by using role-aware residual composition with explicit finger ownership to combine pretrained policies without destructive interference.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

cs.RO · 2026-02-25 · unverdicted · novelty 7.0

UPS framework uses conformal prediction to calibrate VLM verifiers for choosing between high-confidence action execution, natural language task queries, or policy interventions, then applies residual learning from interventions to continually improve the base policy with minimal feedback.

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.

AnyBody: Free-Form Whole-Body Humanoid Control from Arbitrary Keypoint Guidance

cs.RO · 2026-06-28 · unverdicted · novelty 6.0

AnyBody distills a privileged teacher tracker into a latent unit-sphere representation and uses a masked transformer to drive humanoid control from arbitrary keypoint subsets.

Enforcing Human-like Kinematics in Dexterous Piano Playing via Adversarial Posture Regularization

cs.RO · 2026-06-22 · unverdicted · novelty 6.0

Adversarial Posture Regularization matches RL policy posture distributions to casual human piano-playing data to enforce human-like kinematics in dexterous hands, outperforming baselines on cPSI, BSE, and FAC metrics.

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

Co-VLA replaces the monolithic action head in VLA models with a coordination-aware Structured Action Expert and Latent-Aware Controller, reporting 27% gains on tight bimanual tasks and doubled OOD performance.

FlexPath: Learned Semantic Path Priors for Image-Based Planning

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

FlexPath decouples learning of task-independent feasible path priors from task-specific adaptation via imitation learning and differentiable Path Shape Objectives for image-based planning.

Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

cs.RO · 2026-06-06 · unverdicted · novelty 6.0

Perceptive BFM grounds human motion priors in robot terrain perception via terrain-conformal reference synthesis and teacher-student transfer from adapted to raw-reference tracking.

SPAR: Support-Preserving Action Rectification

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.

CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

cs.RO · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

CoRMA modifies RMA by replacing raw parameter adaptation with inference of a 6D semantic contact context via a causal Transformer trained with semantic regression and force-regime contrastive loss, yielding higher real-world success than FORGE baselines on PegInsert, GearMesh, and NutThread under ta

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

cs.RO · 2026-05-19 · unverdicted · novelty 6.0

ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

cs.RO · 2026-05-06 · unverdicted · novelty 6.0 · 3 refs

Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

cs.RO · 2026-03-16 · conditional · novelty 6.0

ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.

Fisher Decorator: Refining Flow Policy via a Local Transport Map

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots across LIBERO, CALVIN, and physical tasks.

Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering

cs.RO · 2026-06-28 · unverdicted · novelty 5.0

MoRE improves robot policy success rates by 44 percentage points by distilling mode redirection into weights, matching filtered retraining performance without inference overhead.

Bridging Handheld and Teleoperated Supervision for Contact-Rich Manipulation via State-Gated Experts

cs.RO · 2026-06-25 · unverdicted · novelty 5.0

BRIDGE routes between handheld and teleoperated diffusion policy experts via robot state to achieve up to 36.7% higher success rates than handheld-only baselines on three contact-rich tasks.

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning

cs.RO · 2026-06-17 · unverdicted · novelty 5.0

DF-ExpEnse improves sample efficiency in finetuning diffusion-based robotic policies by filtering diffusion-generated actions with critic ensembles and enabling fleet-level collaboration.

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

cs.RO · 2026-06-17 · unverdicted · novelty 5.0

Object-centric residual RL trained in simulation with pose noise and dropout raises real Franka robot VLA success from 42% to 76% zero-shot across five tasks, with improved data reusable for base model retraining.

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

cs.LG · 2026-06-16 · unverdicted · novelty 5.0

TRIDENT is a MARL framework using Richardson-Romberg gradient correction, Lyapunov-constrained trust-region updates, and a physics-informed residual critic that claims O(1/sqrt(K)) convergence to constrained Nash equilibrium with O(sqrt(K)) violation bounds and large reductions in training violation

An Agency-Transferring Model-Free Policy Enhancement Technique

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

A model-free RL method arbitrates between a functional baseline policy and a learning policy, transferring agency over time to yield a standalone policy with high goal-reaching rates and competitive returns on continuous-control tasks.

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

cs.RO · 2026-06-06 · unverdicted · novelty 5.0

A lightweight RL framework trains terrain-agnostic 3D foothold-tracking policies for humanoids that transfer directly to real-world use as standalone low-level controllers.

Source Side Mitigation of AI Datacenter Power Fluctuations with a Hybrid Energy Storage System and Residual Differentiable Predictive Control

eess.SY · 2026-06-03 · unverdicted · novelty 5.0 · 2 refs

A hybrid energy storage system with residual differentiable predictive control reduces AI datacenter-induced grid frequency deviations by over 80 percent in NPCC 140-bus simulations.

citing papers explorer

Showing 32 of 32 citing papers after filters.

DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand cs.RO · 2026-06-26 · unverdicted · none · ref 31 · internal anchor
DexCompose achieves 77.4% average success on 16 composite dexterous tasks by using role-aware residual composition with explicit finger ownership to combine pretrained policies without destructive interference.
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies cs.RO · 2026-06-08 · unverdicted · none · ref 27 · internal anchor
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering cs.RO · 2026-02-25 · unverdicted · none · ref 24 · internal anchor
UPS framework uses conformal prediction to calibrate VLM verifiers for choosing between high-confidence action execution, natural language task queries, or policy interventions, then applies residual learning from interventions to continually improve the base policy with minimal feedback.
Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers cs.CV · 2026-05-12 · unverdicted · none · ref 47
A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.
AnyBody: Free-Form Whole-Body Humanoid Control from Arbitrary Keypoint Guidance cs.RO · 2026-06-28 · unverdicted · none · ref 40 · internal anchor
AnyBody distills a privileged teacher tracker into a latent unit-sphere representation and uses a masked transformer to drive humanoid control from arbitrary keypoint subsets.
Enforcing Human-like Kinematics in Dexterous Piano Playing via Adversarial Posture Regularization cs.RO · 2026-06-22 · unverdicted · none · ref 13 · internal anchor
Adversarial Posture Regularization matches RL policy posture distributions to casual human piano-playing data to enforce human-like kinematics in dexterous hands, outperforming baselines on cPSI, BSE, and FAC metrics.
Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems cs.RO · 2026-06-18 · unverdicted · none · ref 24 · internal anchor
Co-VLA replaces the monolithic action head in VLA models with a coordination-aware Structured Action Expert and Latent-Aware Controller, reporting 27% gains on tight bimanual tasks and doubled OOD performance.
FlexPath: Learned Semantic Path Priors for Image-Based Planning cs.CV · 2026-06-08 · unverdicted · none · ref 37 · internal anchor
FlexPath decouples learning of task-independent feasible path priors from task-specific adaptation via imitation learning and differentiable Path Shape Objectives for image-based planning.
Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain cs.RO · 2026-06-06 · unverdicted · none · ref 46 · internal anchor
Perceptive BFM grounds human motion priors in robot terrain perception via terrain-conformal reference synthesis and teacher-student transfer from adapted to raw-reference tracking.
SPAR: Support-Preserving Action Rectification cs.LG · 2026-05-27 · unverdicted · none · ref 14 · internal anchor
SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.
CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation cs.RO · 2026-05-21 · unverdicted · none · ref 13 · 2 links · internal anchor
CoRMA modifies RMA by replacing raw parameter adaptation with inference of a 6D semantic contact context via a causal Transformer trained with semantic regression and force-regime contrastive loss, yielding higher real-world success than FORGE baselines on PegInsert, GearMesh, and NutThread under ta
Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning cs.RO · 2026-05-19 · unverdicted · none · ref 49 · internal anchor
ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning cs.RO · 2026-05-06 · unverdicted · none · ref 33 · 3 links · internal anchor
Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors cs.RO · 2026-03-16 · conditional · none · ref 29 · internal anchor
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
Fisher Decorator: Refining Flow Policy via a Local Transport Map cs.LG · 2026-04-20 · unverdicted · none · ref 68
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models cs.RO · 2026-04-20 · unverdicted · none · ref 40
AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots across LIBERO, CALVIN, and physical tasks.
Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering cs.RO · 2026-06-28 · unverdicted · none · ref 26 · internal anchor
MoRE improves robot policy success rates by 44 percentage points by distilling mode redirection into weights, matching filtered retraining performance without inference overhead.
Bridging Handheld and Teleoperated Supervision for Contact-Rich Manipulation via State-Gated Experts cs.RO · 2026-06-25 · unverdicted · none · ref 22 · internal anchor
BRIDGE routes between handheld and teleoperated diffusion policy experts via robot state to achieve up to 36.7% higher success rates than handheld-only baselines on three contact-rich tasks.
DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning cs.RO · 2026-06-17 · unverdicted · none · ref 14 · internal anchor
DF-ExpEnse improves sample efficiency in finetuning diffusion-based robotic policies by filtering diffusion-generated actions with critic ensembles and enabling fleet-level collaboration.
Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement cs.RO · 2026-06-17 · unverdicted · none · ref 14 · internal anchor
Object-centric residual RL trained in simulation with pose noise and dropout raises real Franka robot VLA success from 42% to 76% zero-shot across five tasks, with improved data reusable for base model retraining.
TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning cs.LG · 2026-06-16 · unverdicted · none · ref 83 · internal anchor
TRIDENT is a MARL framework using Richardson-Romberg gradient correction, Lyapunov-constrained trust-region updates, and a physics-informed residual critic that claims O(1/sqrt(K)) convergence to constrained Nash equilibrium with O(sqrt(K)) violation bounds and large reductions in training violation
An Agency-Transferring Model-Free Policy Enhancement Technique cs.LG · 2026-06-08 · unverdicted · none · ref 14 · internal anchor
A model-free RL method arbitrates between a functional baseline policy and a learning policy, transferring agency over time to yield a standalone policy with high goal-reaching rates and competitive returns on continuous-control tasks.
Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking cs.RO · 2026-06-06 · unverdicted · none · ref 38 · internal anchor
A lightweight RL framework trains terrain-agnostic 3D foothold-tracking policies for humanoids that transfer directly to real-world use as standalone low-level controllers.
Source Side Mitigation of AI Datacenter Power Fluctuations with a Hybrid Energy Storage System and Residual Differentiable Predictive Control eess.SY · 2026-06-03 · unverdicted · none · ref 27 · 2 links · internal anchor
A hybrid energy storage system with residual differentiable predictive control reduces AI datacenter-induced grid frequency deviations by over 80 percent in NPCC 140-bus simulations.
CoPark: Learning Reactive Parking via Self-Play cs.RO · 2026-06-02 · unverdicted · none · ref 21 · internal anchor
CoPark uses multi-agent self-play RL with a residual policy and threat-modulated asymmetric prior release to achieve 70-85% success and 3-6% collision rates in reactive parking benchmarks.
Extreme Motion Generation via Hybrid Null-Space Control for Straight-Line Path Following cs.RO · 2026-06-02 · unverdicted · none · ref 16 · internal anchor
A hybrid RL and model-based null-space controller with conditional diffusion sampling for initial poses extends average straight-line path length by 27% over baseline on 10,000 tasks with a 7-DoF Franka arm.
Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions cs.LG · 2026-06-02 · unverdicted · none · ref 70 · internal anchor
GTR introduces a bounded non-monotonic Gaussian trust region and Mixture Gaussian Anchor to enable effective behavior transitions in non-stationary RL where standard PPO fails.
Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift cs.LG · 2026-04-30 · unverdicted · none · ref 9
JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards cs.LG · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.
Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation cs.RO · 2026-06-04 · unverdicted · none · ref 74 · internal anchor
A structured literature survey of safety mechanisms in long-horizon robotic manipulation organized by intervention timing and strength of supporting evidence.
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector cs.RO · 2026-03-16 · unreviewed · ref 34 · internal anchor
Incremental Residual Reinforcement Learning Toward Real-World Learning for Social Navigation cs.RO · 2026-04-09 · unreviewed · ref 12

Residual Policy Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer