hub

Continuous control with deep reinforcement learning

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa · 2015 · cs.LG · arXiv 1509.02971

40 Pith papers cite this work. Polarity classification is still indexing.

40 Pith papers citing it

open full Pith review browse 40 citing papers arXiv PDF

abstract

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algo

co-cited works

representative citing papers

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

High-Dimensional Continuous Control Using Generalized Advantage Estimation

cs.LG · 2015-06-08 · accept · novelty 8.0

Generalized advantage estimation combined with trust region optimization enables stable neural network policy learning for complex continuous control from raw kinematics.

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.

The Reciprocity Gradient

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.

Stable GFlowNets with Probabilistic Guarantees

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.

A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow

physics.flu-dyn · 2026-04-29 · unverdicted · novelty 7.0

A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.

Leveraging Human Feedback for Semantically-Relevant Skill Discovery

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

SRSD uses human-provided semantic labels to learn rewards that encourage reinforcement learning agents to discover a wide variety of meaningful and distinct behaviors.

Intentional Updates for Streaming Reinforcement Learning

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.

Autonomous Diffractometry Enabled by Visual Reinforcement Learning

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.

To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control

eess.SY · 2026-04-13 · unverdicted · novelty 7.0

A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.

Mastering Diverse Domains through World Models

cs.AI · 2023-01-10 · unverdicted · novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

Dream to Control: Learning Behaviors by Latent Imagination

cs.LG · 2019-12-03 · accept · novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.

Soft Actor-Critic Algorithms and Applications

cs.LG · 2018-12-13 · unverdicted · novelty 7.0

SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

cs.LG · 2018-01-04 · accept · novelty 7.0

Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.

Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.

AdamO: A Collapse-Suppressed Optimizer for Offline RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.

RL Token: Bootstrapping Online RL with Vision-Language-Action Models

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.

A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems

eess.SY · 2026-04-22 · unverdicted · novelty 6.0

This review synthesizes existing RL-MPC integration methods for linear systems into a taxonomy across RL roles, algorithms, MPC formulations, costs, and domains while identifying recurring patterns and practical challenges.

Safe Control using Learned Safety Filters and Adaptive Conformal Inference

eess.SY · 2026-04-20 · unverdicted · novelty 6.0

ACoFi adaptively tunes the switching threshold of learned safety filters using conformal inference on the range of predicted safety values, asymptotically bounding the rate of incorrect safety assessments by a user parameter and reducing violations versus fixed thresholds in simulations.

Scalable Neighborhood-Based Multi-Agent Actor-Critic

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.

Distributional Off-Policy Evaluation with Deep Quantile Process Regression

stat.ML · 2026-04-20 · unverdicted · novelty 6.0

DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.

Mean Flow Policy Optimization

cs.LG · 2026-04-16 · conditional · novelty 6.0

Mean Flow Policy Optimization (MFPO) uses few-step flow-based models for RL policies and achieves performance on par with or better than diffusion-based methods while substantially lowering training and inference time on MuJoCo and DeepMind Control Suite.

Physics-guided surrogate learning enables zero-shot control of turbulent wings

physics.flu-dyn · 2026-04-10 · unverdicted · novelty 6.0

Zero-shot RL control trained on matched channel flows reduces skin-friction drag 28.7% and total drag 10.7% on a NACA4412 wing, outperforming opposition control.

citing papers explorer

Showing 40 of 40 citing papers.

Consistency Models cs.LG · 2023-03-02 · conditional · none · ref 35 · internal anchor
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
High-Dimensional Continuous Control Using Generalized Advantage Estimation cs.LG · 2015-06-08 · accept · none · ref 13 · internal anchor
Generalized advantage estimation combined with trust region optimization enables stable neural network policy learning for complex continuous control from raw kinematics.
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic cs.LG · 2026-05-09 · unverdicted · none · ref 30 · internal anchor
A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.
The Reciprocity Gradient cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.
Stable GFlowNets with Probabilistic Guarantees cs.LG · 2026-05-03 · unverdicted · none · ref 72 · internal anchor
Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.
A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow physics.flu-dyn · 2026-04-29 · unverdicted · none · ref 45 · internal anchor
A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.
Leveraging Human Feedback for Semantically-Relevant Skill Discovery cs.LG · 2026-04-27 · unverdicted · none · ref 23 · internal anchor
SRSD uses human-provided semantic labels to learn rewards that encourage reinforcement learning agents to discover a wide variety of meaningful and distinct behaviors.
Intentional Updates for Streaming Reinforcement Learning cs.LG · 2026-04-21 · unverdicted · none · ref 37 · internal anchor
Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.
Autonomous Diffractometry Enabled by Visual Reinforcement Learning cs.LG · 2026-04-13 · unverdicted · none · ref 17 · internal anchor
A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control eess.SY · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.
Mastering Diverse Domains through World Models cs.AI · 2023-01-10 · unverdicted · none · ref 6 · internal anchor
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Dream to Control: Learning Behaviors by Latent Imagination cs.LG · 2019-12-03 · accept · none · ref 33 · internal anchor
Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Soft Actor-Critic Algorithms and Applications cs.LG · 2018-12-13 · unverdicted · none · ref 8 · internal anchor
SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor cs.LG · 2018-01-04 · accept · none · ref 16 · internal anchor
Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients cs.LG · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.
AdamO: A Collapse-Suppressed Optimizer for Offline RL cs.LG · 2026-05-03 · unverdicted · none · ref 8 · internal anchor
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 100 · internal anchor
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
RL Token: Bootstrapping Online RL with Vision-Language-Action Models cs.LG · 2026-04-24 · unverdicted · none · ref 19 · internal anchor
RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems eess.SY · 2026-04-22 · unverdicted · none · ref 8 · internal anchor
This review synthesizes existing RL-MPC integration methods for linear systems into a taxonomy across RL roles, algorithms, MPC formulations, costs, and domains while identifying recurring patterns and practical challenges.
Safe Control using Learned Safety Filters and Adaptive Conformal Inference eess.SY · 2026-04-20 · unverdicted · none · ref 25 · internal anchor
ACoFi adaptively tunes the switching threshold of learned safety filters using conformal inference on the range of predicted safety values, asymptotically bounding the rate of incorrect safety assessments by a user parameter and reducing violations versus fixed thresholds in simulations.
Scalable Neighborhood-Based Multi-Agent Actor-Critic cs.LG · 2026-04-20 · unverdicted · none · ref 6 · internal anchor
MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.
Distributional Off-Policy Evaluation with Deep Quantile Process Regression stat.ML · 2026-04-20 · unverdicted · none · ref 115 · internal anchor
DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.
Mean Flow Policy Optimization cs.LG · 2026-04-16 · conditional · none · ref 1 · internal anchor
Mean Flow Policy Optimization (MFPO) uses few-step flow-based models for RL policies and achieves performance on par with or better than diffusion-based methods while substantially lowering training and inference time on MuJoCo and DeepMind Control Suite.
Physics-guided surrogate learning enables zero-shot control of turbulent wings physics.flu-dyn · 2026-04-10 · unverdicted · none · ref 56 · internal anchor
Zero-shot RL control trained on matched channel flows reduces skin-friction drag 28.7% and total drag 10.7% on a NACA4412 wing, outperforming opposition control.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control cs.LG · 2026-04-06 · unverdicted · none · ref 44 · internal anchor
FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of simulators.
TD-MPC2: Scalable, Robust World Models for Continuous Control cs.LG · 2023-10-25 · conditional · none · ref 39 · internal anchor
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation cs.RO · 2021-08-06 · accept · none · ref 80 · internal anchor
A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.
Behavior Regularized Offline Reinforcement Learning cs.LG · 2019-11-26 · unverdicted · none · ref 13 · internal anchor
Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
DeepMind Control Suite cs.AI · 2018-01-02 · accept · none · ref 7 · internal anchor
The DeepMind Control Suite supplies a standardized collection of continuous control tasks with interpretable rewards for benchmarking reinforcement learning agents.
Learning to Compress and Transmit: Adaptive Rate Control for Semantic Communications over LEO Satellite-to-Ground Links cs.NI · 2026-05-11 · unverdicted · none · ref 82 · internal anchor
RL agent adaptively controls compression rate in semantic satellite communications to achieve 95% qualified image frames with no packet loss by using SNR predictions and queue management.
REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer cs.RO · 2026-05-09 · unverdicted · none · ref 22 · internal anchor
REAP trains an end-to-end SAC policy with behavior cloning and collision penalties inside a 3DGS Real2Sim simulator and transfers it to physical vehicles, succeeding in narrow mechanical parking slots.
Soft Deterministic Policy Gradient with Gaussian Smoothing cs.LG · 2026-05-07 · unverdicted · none · ref 11 · internal anchor
Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discretized-reward variants.
Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning cs.LG · 2026-05-07 · unverdicted · none · ref 7 · 2 links · internal anchor
ME-AM adds mirror-descent entropy maximization and a mixture behavior prior to adjoint matching in flow-based policies to mitigate popularity bias and support binding in offline RL.
E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation cs.RO · 2026-04-30 · unverdicted · none · ref 30 · internal anchor
E²DT couples a Decision Transformer with a k-Determinantal Point Process that scores trajectories on return-to-go quantiles, predictive uncertainty, and stage coverage to improve sample efficiency and policy quality in robotic manipulation.
Reinforcement Learning for Robust Calibration of Multi-Qudit Quantum Gates quant-ph · 2026-04-21 · unverdicted · none · ref 42 · internal anchor
A hybrid optimal-control-plus-contextual-RL framework learns low-dimensional residual pulse corrections that preserve high-fidelity controlled-phase gates on two qutrits under realistic static model mismatch.
RL-ABC: Reinforcement Learning for Accelerator Beamline Control cs.LG · 2026-04-21 · unverdicted · none · ref 21 · internal anchor
RL-ABC is a framework that formulates accelerator beamline tuning as a Markov decision process with a 57-dimensional state and configurable reward, enabling a DDPG agent to reach 70.3% particle transmission on a VEPP-5 test beamline, matching differential evolution.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework cs.CV · 2026-04-16 · unverdicted · none · ref 37 · internal anchor
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
Fluid Antenna-Enabled Hybrid NOMA and AirFL Networks Under Imperfect CSI and SIC eess.SP · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
Fluid antennas in hybrid NOMA-AirFL networks improve hybrid rate performance under imperfect CSI and SIC by formulating a robust optimization solved via LSTM-DDPG.
Recurrent Deep Reinforcement Learning for Chemotherapy Control under Partial Observability cs.LG · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
Recurrent TD3 with separate LSTM actor-critic networks delivers substantially stronger and more stable chemotherapy control than feed-forward baselines under partial observability on the AhnChemoEnv benchmark.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems cs.LG · 2020-05-04 · unverdicted · none · ref 226 · internal anchor
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Continuous control with deep reinforcement learning

hub tools

claims ledger

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer