super hub Mixed citations

Continuous control with deep reinforcement learning

Alexander Pritzel, Jonathan J. Hunt, Nicolas Heess, Timothy P. Lillicrap, Tom Erez, Yuval Tassa · 2015 · cs.LG · arXiv 1509.02971

Mixed citation behavior. Most common role is background (62%).

121 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 121 citing papers more from Alexander Pritzel arXiv PDF

abstract

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 6

citation-polarity summary

background 10 use method 5 unclear 1

claims ledger

abstract We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algo

authors

Alexander Pritzel Jonathan J. Hunt Nicolas Heess Timothy P. Lillicrap Tom Erez Yuval Tassa

co-cited works

representative citing papers

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

cs.RO · 2025-12-22 · conditional · novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

High-Dimensional Continuous Control Using Generalized Advantage Estimation

cs.LG · 2015-06-08 · accept · novelty 8.0

Generalized advantage estimation combined with trust region optimization enables stable neural network policy learning for complex continuous control from raw kinematics.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Wasserstein policy gradient converges globally in entropy-regularized RL via Bellman-induced distributional PL geometry and uniform LSI, yielding geometric contraction up to discretization bias.

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

math.PR · 2026-05-20 · unverdicted · novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.

Direct Data-Driven Linear Quadratic Tracking via Policy Optimization

eess.SY · 2026-05-15 · unverdicted · novelty 7.0

A reference-decoupled reformulation makes direct data-driven LQT equivalent to certainty-equivalence solutions and supports convergent offline and online DeePO algorithms.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.

The Reciprocity Gradient

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.

Stable GFlowNets with Probabilistic Guarantees

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.

A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow

physics.flu-dyn · 2026-04-29 · unverdicted · novelty 7.0

A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.

Leveraging Human Feedback for Semantically-Relevant Skill Discovery

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

SRSD uses human-provided semantic labels to learn rewards that encourage reinforcement learning agents to discover a wide variety of meaningful and distinct behaviors.

Intentional Updates for Streaming Reinforcement Learning

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.

Autonomous Diffractometry Enabled by Visual Reinforcement Learning

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.

To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control

eess.SY · 2026-04-13 · unverdicted · novelty 7.0

A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.

Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

cs.LG · 2025-09-16 · conditional · novelty 7.0

Action aliasing from safety projections harms policy-gradient estimates more severely when the projection is inside the policy than when it is outside, but a penalty term restores competitiveness.

Guided Reinforcement Learning for Omnidirectional 3D Jumping in Quadruped Robots

cs.RO · 2025-07-22 · unverdicted · novelty 7.0

Guided RL using Bezier curves and UARM model enables efficient, explainable omnidirectional jumping in quadruped robots.

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

cs.RO · 2025-06-18 · unverdicted · novelty 7.0

DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.

DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty

cs.LG · 2025-06-14 · unverdicted · novelty 7.0

DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.

Variational Sequential Optimal Experimental Design using Reinforcement Learning

stat.ML · 2023-06-17 · unverdicted · novelty 7.0

vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.

Mastering Diverse Domains through World Models

cs.AI · 2023-01-10 · unverdicted · novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

cs.LG · 2022-08-12 · unverdicted · novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.

Dream to Control: Learning Behaviors by Latent Imagination

cs.LG · 2019-12-03 · accept · novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.

citing papers explorer

Showing 6 of 6 citing papers after filters.

The Reciprocity Gradient cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor cs.LG · 2018-01-04 · accept · none · ref 16 · internal anchor
Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
RL Token: Bootstrapping Online RL with Vision-Language-Action Models cs.LG · 2026-04-24 · unverdicted · none · ref 19 · internal anchor
RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control cs.LG · 2026-04-06 · unverdicted · none · ref 44 · 2 links · internal anchor
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation cs.RO · 2021-08-06 · accept · none · ref 80 · internal anchor
A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.
Fluid Antenna-Enabled Hybrid NOMA and AirFL Networks Under Imperfect CSI and SIC eess.SP · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
Fluid antennas in hybrid NOMA-AirFL networks improve hybrid rate performance under imperfect CSI and SIC by formulating a robust optimization solved via LSTM-DDPG.

Continuous control with deep reinforcement learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer