hub

Residual Policy Learning

Tom Silver, Kelsey Allen, Josh Tenenbaum, Leslie Kaelbling · 2018 · cs.RO · arXiv 1812.06298

32 Pith papers cite this work. Polarity classification is still indexing.

32 Pith papers citing it

open full Pith review browse 32 citing papers arXiv PDF

abstract

We present Residual Policy Learning (RPL): a simple method for improving nondifferentiable policies using model-free deep reinforcement learning. RPL thrives in complex robotic manipulation tasks where good but imperfect controllers are available. In these tasks, reinforcement learning from scratch remains data-inefficient or intractable, but learning a residual on top of the initial controller can yield substantial improvements. We study RPL in six challenging MuJoCo tasks involving partial observability, sensor noise, model misspecification, and controller miscalibration. For initial controllers, we consider both hand-designed policies and model-predictive controllers with known or learned transition models. By combining learning with control algorithms, RPL can perform long-horizon, sparse-reward tasks for which reinforcement learning alone fails. Moreover, we find that RPL consistently and substantially improves on the initial controllers. We argue that RPL is a promising approach for combining the complementary strengths of deep reinforcement learning and robotic control, pushing the boundaries of what either can achieve independently. Video and code at https://k-r-allen.github.io/residual-policy-learning/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand

cs.RO · 2026-06-26 · unverdicted · novelty 7.0

DexCompose achieves 77.4% average success on 16 composite dexterous tasks by using role-aware residual composition with explicit finger ownership to combine pretrained policies without destructive interference.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

cs.RO · 2026-02-25 · unverdicted · novelty 7.0

UPS framework uses conformal prediction to calibrate VLM verifiers for choosing between high-confidence action execution, natural language task queries, or policy interventions, then applies residual learning from interventions to continually improve the base policy with minimal feedback.

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.

AnyBody: Free-Form Whole-Body Humanoid Control from Arbitrary Keypoint Guidance

cs.RO · 2026-06-28 · unverdicted · novelty 6.0

AnyBody distills a privileged teacher tracker into a latent unit-sphere representation and uses a masked transformer to drive humanoid control from arbitrary keypoint subsets.

Enforcing Human-like Kinematics in Dexterous Piano Playing via Adversarial Posture Regularization

cs.RO · 2026-06-22 · unverdicted · novelty 6.0

Adversarial Posture Regularization matches RL policy posture distributions to casual human piano-playing data to enforce human-like kinematics in dexterous hands, outperforming baselines on cPSI, BSE, and FAC metrics.

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

Co-VLA replaces the monolithic action head in VLA models with a coordination-aware Structured Action Expert and Latent-Aware Controller, reporting 27% gains on tight bimanual tasks and doubled OOD performance.

FlexPath: Learned Semantic Path Priors for Image-Based Planning

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

FlexPath decouples learning of task-independent feasible path priors from task-specific adaptation via imitation learning and differentiable Path Shape Objectives for image-based planning.

Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

cs.RO · 2026-06-06 · unverdicted · novelty 6.0

Perceptive BFM grounds human motion priors in robot terrain perception via terrain-conformal reference synthesis and teacher-student transfer from adapted to raw-reference tracking.

SPAR: Support-Preserving Action Rectification

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.

CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

cs.RO · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

CoRMA modifies RMA by replacing raw parameter adaptation with inference of a 6D semantic contact context via a causal Transformer trained with semantic regression and force-regime contrastive loss, yielding higher real-world success than FORGE baselines on PegInsert, GearMesh, and NutThread under ta

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

cs.RO · 2026-05-19 · unverdicted · novelty 6.0

ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

cs.RO · 2026-05-06 · unverdicted · novelty 6.0 · 3 refs

Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

cs.RO · 2026-03-16 · conditional · novelty 6.0

ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.

Fisher Decorator: Refining Flow Policy via a Local Transport Map

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots across LIBERO, CALVIN, and physical tasks.

Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering

cs.RO · 2026-06-28 · unverdicted · novelty 5.0

MoRE improves robot policy success rates by 44 percentage points by distilling mode redirection into weights, matching filtered retraining performance without inference overhead.

Bridging Handheld and Teleoperated Supervision for Contact-Rich Manipulation via State-Gated Experts

cs.RO · 2026-06-25 · unverdicted · novelty 5.0

BRIDGE routes between handheld and teleoperated diffusion policy experts via robot state to achieve up to 36.7% higher success rates than handheld-only baselines on three contact-rich tasks.

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning

cs.RO · 2026-06-17 · unverdicted · novelty 5.0

DF-ExpEnse improves sample efficiency in finetuning diffusion-based robotic policies by filtering diffusion-generated actions with critic ensembles and enabling fleet-level collaboration.

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

cs.RO · 2026-06-17 · unverdicted · novelty 5.0

Object-centric residual RL trained in simulation with pose noise and dropout raises real Franka robot VLA success from 42% to 76% zero-shot across five tasks, with improved data reusable for base model retraining.

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

cs.LG · 2026-06-16 · unverdicted · novelty 5.0

TRIDENT is a MARL framework using Richardson-Romberg gradient correction, Lyapunov-constrained trust-region updates, and a physics-informed residual critic that claims O(1/sqrt(K)) convergence to constrained Nash equilibrium with O(sqrt(K)) violation bounds and large reductions in training violation

An Agency-Transferring Model-Free Policy Enhancement Technique

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

A model-free RL method arbitrates between a functional baseline policy and a learning policy, transferring agency over time to yield a standalone policy with high goal-reaching rates and competitive returns on continuous-control tasks.

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

cs.RO · 2026-06-06 · unverdicted · novelty 5.0

A lightweight RL framework trains terrain-agnostic 3D foothold-tracking policies for humanoids that transfer directly to real-world use as standalone low-level controllers.

Source Side Mitigation of AI Datacenter Power Fluctuations with a Hybrid Energy Storage System and Residual Differentiable Predictive Control

eess.SY · 2026-06-03 · unverdicted · novelty 5.0 · 2 refs

A hybrid energy storage system with residual differentiable predictive control reduces AI datacenter-induced grid frequency deviations by over 80 percent in NPCC 140-bus simulations.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Residual Policy Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer