DexCompose achieves 77.4% average success on 16 composite dexterous tasks by using role-aware residual composition with explicit finger ownership to combine pretrained policies without destructive interference.
hub
Residual Policy Learning
32 Pith papers cite this work. Polarity classification is still indexing.
abstract
We present Residual Policy Learning (RPL): a simple method for improving nondifferentiable policies using model-free deep reinforcement learning. RPL thrives in complex robotic manipulation tasks where good but imperfect controllers are available. In these tasks, reinforcement learning from scratch remains data-inefficient or intractable, but learning a residual on top of the initial controller can yield substantial improvements. We study RPL in six challenging MuJoCo tasks involving partial observability, sensor noise, model misspecification, and controller miscalibration. For initial controllers, we consider both hand-designed policies and model-predictive controllers with known or learned transition models. By combining learning with control algorithms, RPL can perform long-horizon, sparse-reward tasks for which reinforcement learning alone fails. Moreover, we find that RPL consistently and substantially improves on the initial controllers. We argue that RPL is a promising approach for combining the complementary strengths of deep reinforcement learning and robotic control, pushing the boundaries of what either can achieve independently. Video and code at https://k-r-allen.github.io/residual-policy-learning/.
hub tools
citation-role summary
citation-polarity summary
years
2026 32representative citing papers
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
UPS framework uses conformal prediction to calibrate VLM verifiers for choosing between high-confidence action execution, natural language task queries, or policy interventions, then applies residual learning from interventions to continually improve the base policy with minimal feedback.
A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.
AnyBody distills a privileged teacher tracker into a latent unit-sphere representation and uses a masked transformer to drive humanoid control from arbitrary keypoint subsets.
Adversarial Posture Regularization matches RL policy posture distributions to casual human piano-playing data to enforce human-like kinematics in dexterous hands, outperforming baselines on cPSI, BSE, and FAC metrics.
Co-VLA replaces the monolithic action head in VLA models with a coordination-aware Structured Action Expert and Latent-Aware Controller, reporting 27% gains on tight bimanual tasks and doubled OOD performance.
FlexPath decouples learning of task-independent feasible path priors from task-specific adaptation via imitation learning and differentiable Path Shape Objectives for image-based planning.
Perceptive BFM grounds human motion priors in robot terrain perception via terrain-conformal reference synthesis and teacher-student transfer from adapted to raw-reference tracking.
SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.
CoRMA modifies RMA by replacing raw parameter adaptation with inference of a 6D semantic contact context via a causal Transformer trained with semantic regression and force-regime contrastive loss, yielding higher real-world success than FORGE baselines on PegInsert, GearMesh, and NutThread under ta
ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.
Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots across LIBERO, CALVIN, and physical tasks.
MoRE improves robot policy success rates by 44 percentage points by distilling mode redirection into weights, matching filtered retraining performance without inference overhead.
BRIDGE routes between handheld and teleoperated diffusion policy experts via robot state to achieve up to 36.7% higher success rates than handheld-only baselines on three contact-rich tasks.
DF-ExpEnse improves sample efficiency in finetuning diffusion-based robotic policies by filtering diffusion-generated actions with critic ensembles and enabling fleet-level collaboration.
Object-centric residual RL trained in simulation with pose noise and dropout raises real Franka robot VLA success from 42% to 76% zero-shot across five tasks, with improved data reusable for base model retraining.
TRIDENT is a MARL framework using Richardson-Romberg gradient correction, Lyapunov-constrained trust-region updates, and a physics-informed residual critic that claims O(1/sqrt(K)) convergence to constrained Nash equilibrium with O(sqrt(K)) violation bounds and large reductions in training violation
A model-free RL method arbitrates between a functional baseline policy and a learning policy, transferring agency over time to yield a standalone policy with high goal-reaching rates and competitive returns on continuous-control tasks.
A lightweight RL framework trains terrain-agnostic 3D foothold-tracking policies for humanoids that transfer directly to real-world use as standalone low-level controllers.
A hybrid energy storage system with residual differentiable predictive control reduces AI datacenter-induced grid frequency deviations by over 80 percent in NPCC 140-bus simulations.
citing papers explorer
-
DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand
DexCompose achieves 77.4% average success on 16 composite dexterous tasks by using role-aware residual composition with explicit finger ownership to combine pretrained policies without destructive interference.
-
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
-
When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering
UPS framework uses conformal prediction to calibrate VLM verifiers for choosing between high-confidence action execution, natural language task queries, or policy interventions, then applies residual learning from interventions to continually improve the base policy with minimal feedback.
-
Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers
A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.
-
AnyBody: Free-Form Whole-Body Humanoid Control from Arbitrary Keypoint Guidance
AnyBody distills a privileged teacher tracker into a latent unit-sphere representation and uses a masked transformer to drive humanoid control from arbitrary keypoint subsets.
-
Enforcing Human-like Kinematics in Dexterous Piano Playing via Adversarial Posture Regularization
Adversarial Posture Regularization matches RL policy posture distributions to casual human piano-playing data to enforce human-like kinematics in dexterous hands, outperforming baselines on cPSI, BSE, and FAC metrics.
-
Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems
Co-VLA replaces the monolithic action head in VLA models with a coordination-aware Structured Action Expert and Latent-Aware Controller, reporting 27% gains on tight bimanual tasks and doubled OOD performance.
-
FlexPath: Learned Semantic Path Priors for Image-Based Planning
FlexPath decouples learning of task-independent feasible path priors from task-specific adaptation via imitation learning and differentiable Path Shape Objectives for image-based planning.
-
Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain
Perceptive BFM grounds human motion priors in robot terrain perception via terrain-conformal reference synthesis and teacher-student transfer from adapted to raw-reference tracking.
-
SPAR: Support-Preserving Action Rectification
SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.
-
CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation
CoRMA modifies RMA by replacing raw parameter adaptation with inference of a 6D semantic contact context via a causal Transformer trained with semantic regression and force-regime contrastive loss, yielding higher real-world success than FORGE baselines on PegInsert, GearMesh, and NutThread under ta
-
Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning
ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.
-
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.
-
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
-
Fisher Decorator: Refining Flow Policy via a Local Transport Map
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
-
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots across LIBERO, CALVIN, and physical tasks.
-
Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering
MoRE improves robot policy success rates by 44 percentage points by distilling mode redirection into weights, matching filtered retraining performance without inference overhead.
-
Bridging Handheld and Teleoperated Supervision for Contact-Rich Manipulation via State-Gated Experts
BRIDGE routes between handheld and teleoperated diffusion policy experts via robot state to achieve up to 36.7% higher success rates than handheld-only baselines on three contact-rich tasks.
-
DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning
DF-ExpEnse improves sample efficiency in finetuning diffusion-based robotic policies by filtering diffusion-generated actions with critic ensembles and enabling fleet-level collaboration.
-
Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement
Object-centric residual RL trained in simulation with pose noise and dropout raises real Franka robot VLA success from 42% to 76% zero-shot across five tasks, with improved data reusable for base model retraining.
-
TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning
TRIDENT is a MARL framework using Richardson-Romberg gradient correction, Lyapunov-constrained trust-region updates, and a physics-informed residual critic that claims O(1/sqrt(K)) convergence to constrained Nash equilibrium with O(sqrt(K)) violation bounds and large reductions in training violation
-
An Agency-Transferring Model-Free Policy Enhancement Technique
A model-free RL method arbitrates between a functional baseline policy and a learning policy, transferring agency over time to yield a standalone policy with high goal-reaching rates and competitive returns on continuous-control tasks.
-
Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking
A lightweight RL framework trains terrain-agnostic 3D foothold-tracking policies for humanoids that transfer directly to real-world use as standalone low-level controllers.
-
Source Side Mitigation of AI Datacenter Power Fluctuations with a Hybrid Energy Storage System and Residual Differentiable Predictive Control
A hybrid energy storage system with residual differentiable predictive control reduces AI datacenter-induced grid frequency deviations by over 80 percent in NPCC 140-bus simulations.
-
CoPark: Learning Reactive Parking via Self-Play
CoPark uses multi-agent self-play RL with a residual policy and threat-modulated asymmetric prior release to achieve 70-85% success and 3-6% collision rates in reactive parking benchmarks.
-
Extreme Motion Generation via Hybrid Null-Space Control for Straight-Line Path Following
A hybrid RL and model-based null-space controller with conditional diffusion sampling for initial poses extends average straight-line path length by 27% over baseline on 10,000 tasks with a 7-DoF Franka arm.
-
Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions
GTR introduces a bounded non-monotonic Gaussian trust region and Mixture Gaussian Anchor to enable effective behavior transitions in non-stationary RL where standard PPO fails.
-
Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift
JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.
-
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.
-
Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation
A structured literature survey of safety mechanisms in long-horizon robotic manipulation organized by intervention timing and strength of supporting evidence.
- You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
- Incremental Residual Reinforcement Learning Toward Real-World Learning for Social Navigation