super hub Mixed citations

Continuous control with deep reinforcement learning

Alexander Pritzel, Jonathan J. Hunt, Nicolas Heess, Timothy P. Lillicrap, Tom Erez, Yuval Tassa · 2015 · cs.LG · arXiv 1509.02971

Mixed citation behavior. Most common role is background (62%).

149 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 149 citing papers more from Alexander Pritzel arXiv PDF

abstract

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 6

citation-polarity summary

background 10 use method 5 unclear 1

claims ledger

abstract We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algo

authors

Alexander Pritzel Jonathan J. Hunt Nicolas Heess Timothy P. Lillicrap Tom Erez Yuval Tassa

co-cited works

representative citing papers

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

cs.RO · 2025-12-22 · conditional · novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

High-Dimensional Continuous Control Using Generalized Advantage Estimation

cs.LG · 2015-06-08 · accept · novelty 8.0

Generalized advantage estimation combined with trust region optimization enables stable neural network policy learning for complex continuous control from raw kinematics.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.

Your GFlowNet Secretly Learns an Optimal Transport Plan

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Minimum-flow GFlowNets on graphs encode optimal transport plans, with the learned policy recovering the optimal coupling between source and target distributions.

Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

A training-free survival regression approach uses tabular foundation models to build an accelerated failure time model and iteratively impute right-censored data with a non-parametric in-context estimator, matching the performance of trained Cox and parametric AFT models on benchmarks.

Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

stat.ML · 2026-05-31 · unverdicted · novelty 7.0

Periodic and soft target updates guarantee convergence in linear Q-learning to the exact projected Q-Bellman solution under spectral and step-size conditions via joint spectral radius analysis of switched linear systems.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Wasserstein policy gradient converges globally in entropy-regularized RL via Bellman-induced distributional PL geometry and uniform LSI, yielding geometric contraction up to discretization bias.

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

math.PR · 2026-05-20 · unverdicted · novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.

Direct Data-Driven Linear Quadratic Tracking via Policy Optimization

eess.SY · 2026-05-15 · unverdicted · novelty 7.0

A reference-decoupled reformulation makes direct data-driven LQT equivalent to certainty-equivalence solutions and supports convergent offline and online DeePO algorithms.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.

The Reciprocity Gradient

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.

Stable GFlowNets with Probabilistic Guarantees

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.

A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow

physics.flu-dyn · 2026-04-29 · unverdicted · novelty 7.0

A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.

Leveraging Human Feedback for Semantically-Relevant Skill Discovery

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

SRSD uses human-provided semantic labels to learn rewards that encourage reinforcement learning agents to discover a wide variety of meaningful and distinct behaviors.

Intentional Updates for Streaming Reinforcement Learning

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.

Autonomous Diffractometry Enabled by Visual Reinforcement Learning

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.

To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control

eess.SY · 2026-04-13 · unverdicted · novelty 7.0

A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.

Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

cs.LG · 2025-09-16 · conditional · novelty 7.0

Action aliasing from safety projections harms policy-gradient estimates more severely when the projection is inside the policy than when it is outside, but a penalty term restores competitiveness.

Guided Reinforcement Learning for Omnidirectional 3D Jumping in Quadruped Robots

cs.RO · 2025-07-22 · unverdicted · novelty 7.0

Guided RL using Bezier curves and UARM model enables efficient, explainable omnidirectional jumping in quadruped robots.

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

cs.RO · 2025-06-18 · unverdicted · novelty 7.0

DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.

citing papers explorer

Showing 29 of 29 citing papers after filters.

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller cs.RO · 2025-12-22 · conditional · none · ref 13 · internal anchor
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
Guided Reinforcement Learning for Omnidirectional 3D Jumping in Quadruped Robots cs.RO · 2025-07-22 · unverdicted · none · ref 18 · internal anchor
Guided RL using Bezier curves and UARM model enables efficient, explainable omnidirectional jumping in quadruped robots.
Steering Your Diffusion Policy with Latent Space Reinforcement Learning cs.RO · 2025-06-18 · unverdicted · none · ref 78 · internal anchor
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
Structured 4D Latent Predictive Model for Robot Planning cs.RO · 2026-07-01 · unverdicted · none · ref 9 · internal anchor
A 4D latent predictive model encodes scenes holistically to generate 3D-consistent futures that an inverse dynamics module converts into robot actions, outperforming video-based planners on manipulation tasks.
SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning cs.RO · 2025-06-17 · unverdicted · none · ref 26 · internal anchor
SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated and real robot tasks.
Diffusion Policy Policy Optimization cs.RO · 2024-09-01 · unverdicted · none · ref 54 · internal anchor
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation cs.RO · 2021-08-06 · accept · none · ref 80 · internal anchor
A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.
Environment Probing Interaction Policies cs.RO · 2019-07-26 · unverdicted · none · ref 13 · internal anchor
EPI policies use a transition-predictability reward to probe environments and condition task policies, outperforming standard generalization methods on novel test environments.
RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies cs.RO · 2019-07-10 · conditional · none · ref 15 · internal anchor
RL-RRT learns an RL policy for local planning and a reachability estimator to guide RRT expansion, yielding faster kinodynamic planning than prior methods on three robot systems including hardware.
Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension cs.RO · 2026-06-05 · unverdicted · none · ref 55 · 2 links · internal anchor
Reinforcement learning produces a single unified controller that lets an actively suspended planetary rover autonomously cross heterogeneous rough terrains after sim training and zero-shot hardware transfer.
Implicit Action Chunking for Smooth Continuous Control cs.RO · 2026-05-19 · unverdicted · none · ref 46 · internal anchor
Dual-Window Smoothing uses an execution window for deterministic smoothness and a value window to correct critic bias, plus a first-order temporal regularizer, to achieve smoother RL control than explicit chunking or standard baselines.
Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions cs.RO · 2026-05-17 · conditional · none · ref 10 · internal anchor
A three-stage framework pre-trains multi-agent RL agents on real safety-critical data, refines them via online learning in CARLA, and generates the VPSCI dataset of over 198,000 realistic vehicle-pedestrian interaction episodes.
REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer cs.RO · 2026-05-09 · unverdicted · none · ref 22 · internal anchor
REAP trains an end-to-end SAC policy with behavior cloning and collision penalties inside a 3DGS Real2Sim simulator and transfers it to physical vehicles, succeeding in narrow mechanical parking slots.
E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation cs.RO · 2026-04-30 · unverdicted · none · ref 30 · internal anchor
E²DT couples a Decision Transformer with a k-Determinantal Point Process that scores trajectories on return-to-go quantiles, predictive uncertainty, and stage coverage to improve sample efficiency and policy quality in robotic manipulation.
Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion cs.RO · 2025-10-30 · unverdicted · none · ref 14 · internal anchor
A GNN-augmented SAC policy that encodes tensegrity topology as a graph improves sample efficiency and enables zero-shot sim-to-real locomotion on a 3-bar tensegrity robot.
Learning to Solve a Rubik's Cube with a Dexterous Hand cs.RO · 2019-07-26 · unverdicted · none · ref 25 · internal anchor
Hierarchical RL combines a model-based cube solver with a model-free hand controller to solve Rubik's cubes in simulation, achieving 90.3% success on 1400 random scrambles.
Learning Safe Unlabeled Multi-Robot Planning with Motion Constraints cs.RO · 2019-07-11 · unverdicted · none · ref 13 · internal anchor
A multi-agent RL framework for unlabeled multi-robot planning that uses velocity obstacle projections to guarantee collision-free trajectories applicable to arbitrary robot models.
Path Planning Using Deep Deterministic Policy Gradient: A Reinforcement Learning Approach cs.RO · 2026-06-05 · unverdicted · none · ref 35 · internal anchor
Applies DDPG with a composite reward (attractive destination field, repulsive obstacle fields, control energy penalty) to learn safe paths, claiming faster real-time performance than pseudo-spectral optimal control in simulations.
RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning cs.RO · 2026-05-29 · unverdicted · none · ref 14 · internal anchor
RDGen uses sim-to-real RL policies to generate smoother robot demonstrations that improve downstream VLA performance over human-collected data on pick-and-place tasks.
SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving cs.RO · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
SARAD is a hybrid LLM-DRL framework for autonomous driving that replaces random exploration with RAG-enhanced LLM guidance, an attention discriminator, and a collision predictor, reporting performance gains in the Highway-Env simulator.
Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension cs.RO · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
A temporal-feature extension to clustering, with self-transition suppression for cluster count selection, yields clearer phase structures than prior methods across Ant-v5, HalfCheetah-v5, and Walker2D-v5.
Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient cs.RO · 2026-05-26 · unverdicted · none · ref 40 · internal anchor
SDPG is a new on-policy visual RL algorithm that estimates gradients via stochastic perturbations of rollouts, achieving faster training and lower memory use than baselines on visual MuJoCo tasks while adding new robotics benchmarks and sim-to-real results.
Prior Policy Guided Dual-Agent Coordinated Manipulation Planning of Spacecraft-Manipulator System cs.RO · 2026-05-25 · unverdicted · none · ref 12 · internal anchor
DACMP applies dual-agent DRL with timestep-level expert switching guidance to achieve simultaneous end-effector precision and base attitude stability in spacecraft-manipulator systems, reporting higher success rates than baselines in simulation.
On Training Flexible Robots using Deep Reinforcement Learning cs.RO · 2019-06-29 · unverdicted · none · ref 24 · internal anchor
Deep reinforcement learning learns robust policies for flexible robots but is sensitive to sensor choice.
Zero-shot Transfer of Reinforcement Learning Control Policies for the Swing-Up and Stabilization of a Cart-Pole System cs.RO · 2026-06-20 · unverdicted · none · ref 12 · internal anchor
Zero-shot sim-to-real transfer of independently trained RL policies for cart-pole swing-up and stabilization is achieved via sensitivity-guided domain randomization, linear curriculum learning, and first-order action smoothing with Simulink switching logic.
Motion Planning in Dynamic Environments: A Survey from Classical to Modern Methods cs.RO · 2026-06-01 · unverdicted · none · ref 117 · internal anchor
Survey of 138 papers (2015-2025) categorizing motion planning in dynamic environments into sampling, graph search, MPC, learning, and classical local methods, plus perception and challenges like prediction uncertainty.
A Hierarchical Architecture for Sequential Decision-Making in Autonomous Driving using Deep Reinforcement Learning cs.RO · 2019-06-20 · unverdicted · none · ref 10 · internal anchor
A hierarchical DRL architecture generates lane-change commands from occupancy grids for stochastic highway driving and claims improved reliability over end-to-end methods.
Transfer Learning for Customized Car Racing Environments cs.RO · 2026-05-18 · unverdicted · none · ref 7 · internal anchor
The study applies transfer learning to deep RL in OpenAI car racing, observing that model-based approaches outperform model-free methods and that transfer boosts target domain performance.
An Introduction to Deep Reinforcement and Imitation Learning cs.RO · 2025-12-08 · unverdicted · none · ref 12 · internal anchor
The paper delivers a concise, self-contained tutorial on foundational DRL algorithms including REINFORCE and PPO and DIL methods including behavioral cloning, DAgger, and GAIL for embodied agents.

Continuous control with deep reinforcement learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer