Continuous control with deep reinforcement learning

Alexander Pritzel; Daan Wierstra; David Silver; Jonathan J. Hunt; Nicolas Heess; Timothy P. Lillicrap; Tom Erez; Yuval Tassa

arxiv: 1509.02971 · v6 · submitted 2015-09-09 · 💻 cs.LG · stat.ML

Continuous control with deep reinforcement learning

Timothy P. Lillicrap , Jonathan J. Hunt , Alexander Pritzel , Nicolas Heess , Tom Erez , Yuval Tassa , David Silver , Daan Wierstra This is my paper

Pith reviewed 2026-05-11 15:37 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords deep reinforcement learningcontinuous controlactor-criticdeterministic policy gradientpolicy learningsimulated physics taskspixel-based learning

0 comments

The pith

A single actor-critic algorithm using deterministic policy gradients solves more than twenty continuous control tasks with the same network and hyperparameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts the core ideas from deep Q-learning to continuous action spaces by introducing an actor-critic method based on the deterministic policy gradient. With one fixed learning algorithm, network architecture, and set of hyperparameters, it learns effective policies for more than twenty simulated physics tasks such as cart-pole swing-up, object manipulation, legged walking, and vehicle driving. The approach works even when the agent receives only raw pixel images as input and produces results competitive with planning methods that know the full environment dynamics and their derivatives. A sympathetic reader would care because the result suggests deep reinforcement learning can handle real-world-style control problems without per-task tuning or an explicit model of the physics.

Core claim

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate for

What carries the argument

The deterministic policy gradient actor-critic update, which computes policy gradients by chaining the gradient of the action-value function with respect to actions into the policy parameters.

If this is right

Policies competitive with full-information planning can be obtained without any model of the dynamics.
The same fixed setup works across manipulation, locomotion, and driving domains.
End-to-end learning directly from raw pixel observations is possible for many of the tasks.
No separate model-learning or planning stage is needed at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to physical robots where building accurate dynamics models is difficult.
Similar actor-critic constructions might stabilize learning in other high-dimensional continuous domains such as process control or molecular design.
The demonstrated robustness across tasks hints that off-policy deterministic updates may reduce the need for on-policy sampling in continuous reinforcement learning.

Load-bearing premise

That the deterministic policy gradient combined with deep networks and standard replay and target tricks will produce stable learning across diverse continuous control tasks without requiring per-task hyperparameter search or model knowledge.

What would settle it

Training the algorithm on an additional continuous-control task drawn from the same class of physics problems, using exactly the same network, hyperparameters, and replay setup, and observing that it fails to produce a policy better than random or requires extensive per-task retuning.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DDPG shows how to make deterministic policy gradients work with deep nets, replay, and targets for continuous control across many tasks using one setup.

read the letter

The punchline is that this paper gives a working actor-critic method for continuous actions by combining the deterministic policy gradient with deep networks, experience replay, and target networks, then demonstrates it on more than 20 simulated physics tasks with fixed hyperparameters and architecture. It reaches performance competitive with planning baselines that have full model access, and it handles some tasks end-to-end from pixels. That combination had not been shown at this scale before, and the empirical breadth is the real contribution. The authors build cleanly on Silver's DPG and Mnih's DQN without overclaiming theoretical novelty, and the results are presented with learning curves that let readers see the training behavior. The math is standard and the citation pattern is appropriate. The main soft spot is the robustness claim. The paper reports consistent success across diverse tasks without per-task retuning, but it does not include extensive ablations that isolate which pieces (replay buffer size, target update rate, exploration noise schedule) prevent divergence or instability on harder domains. There are also limited multi-seed statistics or explicit failure-mode analysis in the main results, so it remains possible that some reported successes benefited from favorable random seeds or implicit choices. These are not fatal, but they mean the “same hyperparameters, robustly solves” statement rests more on the breadth of positive outcomes than on dissected evidence. This paper is for RL researchers and roboticists who need a practical starting point for continuous control in simulation. It deserves serious peer review because it supplies a reproducible baseline that others could build on or stress-test further, even if later work tightened the stability picture.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Deep Deterministic Policy Gradient (DDPG) algorithm, an actor-critic method that adapts deterministic policy gradients together with deep networks, experience replay, and target networks to continuous action spaces. It claims that a single fixed network architecture, hyperparameter set, and learning procedure robustly solves more than 20 simulated physics tasks (cart-pole swing-up, dexterous manipulation, legged locomotion, car driving) and produces policies competitive with a planning baseline that has full access to dynamics and derivatives; it further shows end-to-end learning directly from raw pixel observations.

Significance. If the reported robustness holds under scrutiny, the work is significant: it supplies a practical, model-free algorithm that bridges the discrete-action successes of DQN to continuous control without per-task retuning, and it supplies a clear, reproducible description of the method together with broad empirical coverage across locomotion and manipulation domains. These elements have clear downstream value for robotics and autonomous systems.

major comments (2)

[§4 (Experiments)] §4 (Experiments) and associated figures: the central claim that the identical hyperparameter set and architecture 'robustly solves' more than 20 tasks spanning cart-pole to driving is load-bearing for the paper's contribution, yet the reported results consist of single learning curves without error bars, multi-seed statistics, or sensitivity sweeps over initialization, exploration-noise scale, or critic learning rate. This leaves open the possibility that reported successes reflect favorable random seeds or implicit per-task choices rather than intrinsic stability of DPG + replay + target networks.
[§4 and Algorithm 1] §4 and Algorithm 1: no ablation isolates the contribution of replay buffer, target-network soft updates, or the Ornstein-Uhlenbeck exploration process across the task suite. Because the robustness assertion rests on the claim that this specific combination prevents divergence on diverse dynamics, the absence of component-wise controls makes it impossible to determine which elements are necessary for the observed stability.

minor comments (2)

[§3] The description of the critic target in Eq. (2) and the soft-update rule for target networks could be written with explicit time indices to avoid ambiguity when readers re-implement the algorithm.
[§4] Several learning-curve plots lack axis labels or legend entries that distinguish training versus evaluation returns; this reduces clarity but does not affect the central empirical claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's significance. We address each major comment below, committing to revisions where they strengthen the manuscript without misrepresenting our original results.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments) and associated figures: the central claim that the identical hyperparameter set and architecture 'robustly solves' more than 20 tasks spanning cart-pole to driving is load-bearing for the paper's contribution, yet the reported results consist of single learning curves without error bars, multi-seed statistics, or sensitivity sweeps over initialization, exploration-noise scale, or critic learning rate. This leaves open the possibility that reported successes reflect favorable random seeds or implicit per-task choices rather than intrinsic stability of DPG + replay + target networks.

Authors: We agree that single-run learning curves limit statistical assessment of variability and robustness. The original experiments used a fixed seed for reproducibility across the diverse task suite, and the competitive performance against a full-information planner on more than 20 tasks (from cart-pole to locomotion and driving) with no per-task retuning provides supporting evidence that successes are not merely lucky seeds. To directly address the concern, we will rerun key experiments with multiple random seeds, add mean curves with standard-error bars, and include a brief sensitivity analysis on exploration noise in the revised manuscript. revision: yes
Referee: [§4 and Algorithm 1] §4 and Algorithm 1: no ablation isolates the contribution of replay buffer, target-network soft updates, or the Ornstein-Uhlenbeck exploration process across the task suite. Because the robustness assertion rests on the claim that this specific combination prevents divergence on diverse dynamics, the absence of component-wise controls makes it impossible to determine which elements are necessary for the observed stability.

Authors: We acknowledge that explicit ablations would help isolate each component's role. The design directly extends DQN's replay and target networks to the deterministic policy gradient setting, with OU noise chosen for temporally correlated exploration in continuous spaces; the paper's core demonstration is that this fixed combination succeeds end-to-end across a broad task distribution without retuning. Full ablations on all 20+ tasks are computationally heavy, but we will add a dedicated discussion paragraph motivating each element and include limited ablation results on a representative subset of tasks (e.g., cart-pole and one locomotion task) in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from adapted deterministic policy gradient algorithm

full rationale

The paper adapts the deterministic policy gradient theorem to deep networks with replay buffers and target networks, then reports empirical success on over 20 continuous control tasks using fixed hyperparameters and architecture. No equations derive a 'prediction' that reduces to a fitted parameter or self-defined quantity by construction. Citations to the DPG theorem reference prior independent work (Silver et al. 2014) whose mathematical content stands outside this manuscript. The central claims are performance numbers and robustness observations, not quantities forced by the paper's own inputs or self-citation chains. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard MDP assumptions and empirical choices for network size, learning rates, and replay buffer size that are not derived from first principles.

free parameters (1)

network architecture and hyperparameters
Same architecture and hyper-parameters used across all tasks; these are selected rather than derived.

axioms (1)

domain assumption The environment can be modeled as a Markov Decision Process
Required for policy gradient and Q-learning methods to apply.

pith-pipeline@v0.9.0 · 5435 in / 1282 out tokens · 39175 ms · 2026-05-11T15:37:28.983366+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller
cs.RO 2025-12 conditional novelty 8.0

First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
High-Dimensional Continuous Control Using Generalized Advantage Estimation
cs.LG 2015-06 accept novelty 8.0

Generalized advantage estimation combined with trust region optimization enables stable neural network policy learning for complex continuous control from raw kinematics.
Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
math.PR 2026-05 unverdicted novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properti...
Direct Data-Driven Linear Quadratic Tracking via Policy Optimization
eess.SY 2026-05 unverdicted novelty 7.0

A reference-decoupled reformulation makes direct data-driven LQT equivalent to certainty-equivalence solutions and supports convergent offline and online DeePO algorithms.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
cs.LG 2026-05 unverdicted novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
The Reciprocity Gradient
cs.LG 2026-05 unverdicted novelty 7.0

The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.
Stable GFlowNets with Probabilistic Guarantees
cs.LG 2026-05 unverdicted novelty 7.0

Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.
A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow
physics.flu-dyn 2026-04 unverdicted novelty 7.0

A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on c...
Leveraging Human Feedback for Semantically-Relevant Skill Discovery
cs.LG 2026-04 unverdicted novelty 7.0

SRSD uses human-provided semantic labels to learn rewards that encourage reinforcement learning agents to discover a wide variety of meaningful and distinct behaviors.
Intentional Updates for Streaming Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.
Autonomous Diffractometry Enabled by Visual Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control
eess.SY 2026-04 unverdicted novelty 7.0

A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.
Frictional Q-Learning
cs.LG 2025-09 unverdicted novelty 7.0

Frictional Q-Learning encodes supported actions as tangent directions on an action manifold using a contrastive variational autoencoder to reduce extrapolation errors in off-policy reinforcement learning.
Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?
cs.LG 2025-09 conditional novelty 7.0

Action aliasing from safety projections harms policy-gradient estimates more severely when the projection is inside the policy than when it is outside, but a penalty term restores competitiveness.
Guided Reinforcement Learning for Omnidirectional 3D Jumping in Quadruped Robots
cs.RO 2025-07 unverdicted novelty 7.0

Guided RL using Bezier curves and UARM model enables efficient, explainable omnidirectional jumping in quadruped robots.
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
cs.RO 2025-06 unverdicted novelty 7.0

DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty
cs.LG 2025-06 unverdicted novelty 7.0

DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
Variational Sequential Optimal Experimental Design using Reinforcement Learning
stat.ML 2023-06 unverdicted novelty 7.0

vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and ...
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
cs.LG 2022-08 unverdicted novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
Dream to Control: Learning Behaviors by Latent Imagination
cs.LG 2019-12 accept novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Benchmarking Model-Based Reinforcement Learning
cs.LG 2019-07 accept novelty 7.0

Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termin...
Finding Needles in a Moving Haystack: Prioritizing Alerts with Adversarial Reinforcement Learning
cs.CR 2019-06 unverdicted novelty 7.0

Adversarial RL approximates a game-theoretic equilibrium to yield a stochastic policy for prioritizing alerts against adaptive attackers in fraud and intrusion detection.
Exploring Model-based Planning with Policy Networks
cs.LG 2019-06 unverdicted novelty 7.0

POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.
Soft Actor-Critic Algorithms and Applications
cs.LG 2018-12 unverdicted novelty 7.0

SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
cs.LG 2018-01 accept novelty 7.0

Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
Goal-Conditioned Agents that Learn Everything All at Once
cs.LG 2026-05 unverdicted novelty 6.0

LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
cs.LG 2026-05 unverdicted novelty 6.0

Reflex formalizes axial and bilateral reflection symmetries and adds symmetry regularization to PPO and SAC, reporting better performance and sample efficiency on Gym and DMC benchmarks.
Optimal design of solar-battery hybrid resources considering multi-market participation under weather and price uncertainty
eess.SY 2026-05 unverdicted novelty 6.0

A deep reinforcement learning co-optimization framework is developed for jointly sizing solar-battery hybrids and determining their multi-market bidding strategies under stochastic weather and price conditions.
Policy-DRIFT: Dynamic Reward-Informed Flow Trajectory Steering
physics.flu-dyn 2026-05 unverdicted novelty 6.0

Policy-DRIFT combines conditional flow matching with terminal reward guidance and decoupled DRL to achieve 49% drag reduction in Re_tau=180 channel flow, 16% above DRL benchmarks and with 37 times less actuation energy.
Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients
cs.LG 2026-05 unverdicted novelty 6.0

The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.
AdamO: A Collapse-Suppressed Optimizer for Offline RL
cs.LG 2026-05 unverdicted novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
cs.LG 2026-04 unverdicted novelty 6.0

RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems
eess.SY 2026-04 unverdicted novelty 6.0

This review synthesizes existing RL-MPC integration methods for linear systems into a taxonomy across RL roles, algorithms, MPC formulations, costs, and domains while identifying recurring patterns and practical challenges.
Safe Control using Learned Safety Filters and Adaptive Conformal Inference
eess.SY 2026-04 unverdicted novelty 6.0

ACoFi adaptively tunes the switching threshold of learned safety filters using conformal inference on the range of predicted safety values, asymptotically bounding the rate of incorrect safety assessments by a user pa...
Scalable Neighborhood-Based Multi-Agent Actor-Critic
cs.LG 2026-04 unverdicted novelty 6.0

MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.
Distributional Off-Policy Evaluation with Deep Quantile Process Regression
stat.ML 2026-04 unverdicted novelty 6.0

DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.
Mean Flow Policy Optimization
cs.LG 2026-04 conditional novelty 6.0

Mean Flow Policy Optimization (MFPO) uses few-step flow-based models for RL policies and achieves performance on par with or better than diffusion-based methods while substantially lowering training and inference time...
Physics-guided surrogate learning enables zero-shot control of turbulent wings
physics.flu-dyn 2026-04 unverdicted novelty 6.0

Zero-shot RL control trained on matched channel flows reduces skin-friction drag 28.7% and total drag 10.7% on a NACA4412 wing, outperforming opposition control.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
LLM-Enhanced Reinforcement Learning for Long-Term User Satisfaction in Interactive Recommendation
cs.IR 2026-01 unverdicted novelty 6.0

LERL combines an LLM high-level planner for diverse semantic categories with a low-level RL policy for item selection to improve long-term user satisfaction and reduce content homogeneity in interactive recommenders.
Tensor-Efficient High-Dimensional Q-learning
cs.LG 2025-11 unverdicted novelty 6.0

TEQL uses a low-rank tensor representation of the Q-function plus error-uncertainty guided exploration to achieve better sample efficiency than matrix low-rank or deep RL baselines on classic control tasks under match...
Reinforcement Learning-based Control via Y-wise Affine Neural Networks (YANNs)
eess.SY 2025-08 unverdicted novelty 6.0

YANN-RL initializes RL actor and critic networks with explicit multi-parametric linear MPC solutions via YANNs to start from linear optimal control performance and then learn nonlinear policies through online interaction.
Joint Scheduling of Deferrable and Nondeferrable Demand with Colocated Stochastic Supply
eess.SY 2025-07 unverdicted novelty 6.0

Optimal scheduling of deferrable demands with colocated stochastic supply and piecewise-linear pricing reduces to a finite set of three procrastination thresholds per demand class; a reinforcement learning algorithm l...
Reinforcement Learning with Action Chunking
cs.LG 2025-07 unverdicted novelty 6.0

Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.
SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning
cs.RO 2025-06 unverdicted novelty 6.0

SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated ...
Neural Mean-Field Games: Extending Mean-Field Game Theory with Neural Stochastic Differential Equations
cs.LG 2025-04 unverdicted novelty 6.0

Neural mean-field games integrate mean-field game theory with neural SDEs to learn strategic interactions from data in a model-free way, demonstrated on games and viral dynamics.
Physically Interpretable World Models via Weakly Supervised Representation Learning
cs.LG 2024-12 unverdicted novelty 6.0

PIWM aligns latent states in image-based world models with physical variables and constrains their dynamics to known equations via weak distribution supervision, yielding accurate long-horizon predictions and paramete...
Training Language Models to Self-Correct via Reinforcement Learning
cs.LG 2024-09 unverdicted novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Diffusion Policy Policy Optimization
cs.RO 2024-09 unverdicted novelty 6.0

DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
Koopman-Assisted Reinforcement Learning
cs.AI 2024-03 unverdicted novelty 6.0

Koopman-assisted RL reformulates max-entropy algorithms using controlled Koopman tensors and reports SOTA performance versus neural SAC on Lorenz, fluid flow, and other systems.
TD-MPC2: Scalable, Robust World Models for Continuous Control
cs.LG 2023-10 conditional novelty 6.0

TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
cs.RO 2021-08 accept novelty 6.0

A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...
Behavior Regularized Offline Reinforcement Learning
cs.LG 2019-11 unverdicted novelty 6.0

Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 107 Pith papers · 3 internal anchors

[1]

Compatible value gradients for reinforcement learning of continuous deep policies

Balduzzi, David and Ghifary, Muhammad. Compatible value gradients for reinforcement learning of continuous deep policies. arXiv preprint arXiv:1509.03005,

work page arXiv
[2]

Memory-based control with recurrent neural networks

Heess, N., Hunt, J. J, Lillicrap, T. P, and Silver, D. Memory-based control with recurrent neural networks. NIPS Deep Reinforcement Learning Workshop (arXiv:1512.04455),

work page Pith review arXiv
[3]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167,

work page internal anchor Pith review arXiv
[4]

Adam: A Method for Stochastic Optimization

Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Evolving deep unsupervised convolu- tional networks for vision-based reinforcement learning

Koutn´ık, Jan, Schmidhuber, J ¨urgen, and Gomez, Faustino. Evolving deep unsupervised convolu- tional networks for vision-based reinforcement learning. In Proceedings of the 2014 conference on Genetic and evolutionary computation, pp. 541–548. ACM, 2014a. Koutn´ık, Jan, Schmidhuber, J ¨urgen, and Gomez, Faustino. Online evolution of deep convolutional net...

work page 2014
[6]

End-to-End Training of Deep Visuomotor Policies

Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702,

work page Pith review arXiv
[7]

Playing Atari with Deep Reinforcement Learning

9 Published as a conference paper at ICLR 2016 Mnih, V olodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wier- stra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Trust Region Policy Optimization

Schulman, John, Heess, Nicolas, Weber, Theophane, and Abbeel, Pieter. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pp. 3510– 3522, 2015a. Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I, and Abbeel, Pieter. Trust region policy optimization. arXiv preprint arXiv:1502.05477...

work page Pith review arXiv
[9]

Synthesis and stabilization of complex behaviors through online trajectory optimization

Tassa, Yuval, Erez, Tom, and Todorov, Emanuel. Synthesis and stabilization of complex behaviors through online trajectory optimization. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 4906–4913. IEEE,

work page 2012
[10]

Proceedings of the 2005, pp. 300–306. IEEE,

work page 2005
[11]

Mujoco: A physics engine for model-based control

Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–

work page 2012
[12]

From Pixels to Torques: Policy Learning with Deep Dynamical Models

Wahlstr¨om, Niklas, Sch ¨on, Thomas B, and Deisenroth, Marc Peter. From pixels to torques: Policy learning with deep dynamical models. arXiv preprint arXiv:1502.02251,

work page Pith review arXiv
[13]

10 Published as a conference paper at ICLR 2016 Supplementary Information: Continuous control with deep reinforcement learning 7 E XPERIMENT DETAILS We used Adam (Kingma & Ba,

work page 2016
[14]

8 P LANNING ALGORITHM Our planner is implemented as a model-predictive controller (Tassa et al., 2012): at every time step we run a single iteration of trajectory optimization (using iLQG, (Todorov & Li, 2005)), starting from the true state of the system. Every single trajectory optimization is planned over a horizon between 250ms and 600ms, and this plan...

work page 2012
[15]

cartpole The classic cart-pole swing-up task

The mass begins each trial in random positions and with random velocities. cartpole The classic cart-pole swing-up task. Agent must balance a pole at- tached to a cart by applying forces to the cart alone. The pole starts each episode hanging upside-down. cartpoleBalance The classic cart-pole balance task. Agent must balance a pole attached to a cart by a...

work page 2009

[1] [1]

Compatible value gradients for reinforcement learning of continuous deep policies

Balduzzi, David and Ghifary, Muhammad. Compatible value gradients for reinforcement learning of continuous deep policies. arXiv preprint arXiv:1509.03005,

work page arXiv

[2] [2]

Memory-based control with recurrent neural networks

Heess, N., Hunt, J. J, Lillicrap, T. P, and Silver, D. Memory-based control with recurrent neural networks. NIPS Deep Reinforcement Learning Workshop (arXiv:1512.04455),

work page Pith review arXiv

[3] [3]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167,

work page internal anchor Pith review arXiv

[4] [4]

Adam: A Method for Stochastic Optimization

Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Evolving deep unsupervised convolu- tional networks for vision-based reinforcement learning

Koutn´ık, Jan, Schmidhuber, J ¨urgen, and Gomez, Faustino. Evolving deep unsupervised convolu- tional networks for vision-based reinforcement learning. In Proceedings of the 2014 conference on Genetic and evolutionary computation, pp. 541–548. ACM, 2014a. Koutn´ık, Jan, Schmidhuber, J ¨urgen, and Gomez, Faustino. Online evolution of deep convolutional net...

work page 2014

[6] [6]

End-to-End Training of Deep Visuomotor Policies

Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702,

work page Pith review arXiv

[7] [7]

Playing Atari with Deep Reinforcement Learning

9 Published as a conference paper at ICLR 2016 Mnih, V olodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wier- stra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Trust Region Policy Optimization

Schulman, John, Heess, Nicolas, Weber, Theophane, and Abbeel, Pieter. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pp. 3510– 3522, 2015a. Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I, and Abbeel, Pieter. Trust region policy optimization. arXiv preprint arXiv:1502.05477...

work page Pith review arXiv

[9] [9]

Synthesis and stabilization of complex behaviors through online trajectory optimization

Tassa, Yuval, Erez, Tom, and Todorov, Emanuel. Synthesis and stabilization of complex behaviors through online trajectory optimization. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 4906–4913. IEEE,

work page 2012

[10] [10]

Proceedings of the 2005, pp. 300–306. IEEE,

work page 2005

[11] [11]

Mujoco: A physics engine for model-based control

Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–

work page 2012

[12] [12]

From Pixels to Torques: Policy Learning with Deep Dynamical Models

Wahlstr¨om, Niklas, Sch ¨on, Thomas B, and Deisenroth, Marc Peter. From pixels to torques: Policy learning with deep dynamical models. arXiv preprint arXiv:1502.02251,

work page Pith review arXiv

[13] [13]

10 Published as a conference paper at ICLR 2016 Supplementary Information: Continuous control with deep reinforcement learning 7 E XPERIMENT DETAILS We used Adam (Kingma & Ba,

work page 2016

[14] [14]

8 P LANNING ALGORITHM Our planner is implemented as a model-predictive controller (Tassa et al., 2012): at every time step we run a single iteration of trajectory optimization (using iLQG, (Todorov & Li, 2005)), starting from the true state of the system. Every single trajectory optimization is planned over a horizon between 250ms and 600ms, and this plan...

work page 2012

[15] [15]

cartpole The classic cart-pole swing-up task

The mass begins each trial in random positions and with random velocities. cartpole The classic cart-pole swing-up task. Agent must balance a pole at- tached to a cart by applying forces to the cart alone. The pole starts each episode hanging upside-down. cartpoleBalance The classic cart-pole balance task. Agent must balance a pole attached to a cart by a...

work page 2009