DeepMind Control Suite

Abbas Abdolmaleki; Alistair Muldal; Andrew Lefrancq; David Budden; Diego de las Casas; Josh Merel; Martin Riedmiller; Timothy Lillicrap; Tom Erez; Yazhe Li

arxiv: 1801.00690 · v1 · submitted 2018-01-02 · 💻 cs.AI

DeepMind Control Suite

Yuval Tassa , Yotam Doron , Alistair Muldal , Tom Erez , Yazhe Li , Diego de las Casas , David Budden , Abbas Abdolmaleki

show 4 more authors

Josh Merel Andrew Lefrancq Timothy Lillicrap Martin Riedmiller

This is my paper

Pith reviewed 2026-05-13 07:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learningcontinuous controlbenchmark suiteMuJoCorobotics simulationpolicy learning

0 comments

The pith

The DeepMind Control Suite offers a standardized set of continuous control tasks to benchmark reinforcement learning agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the DeepMind Control Suite, a collection of continuous control tasks designed with a standardized structure and interpretable rewards. The tasks are written in Python and use the MuJoCo physics engine, making them straightforward to use and customize. By providing these benchmarks along with performance data for several algorithms, the suite aims to facilitate fair comparisons between different reinforcement learning methods. A sympathetic reader would care because consistent benchmarks can accelerate progress in the field by reducing the need for researchers to create their own test environments.

Core claim

The authors present the Control Suite as a publicly available set of continuous control tasks with standardized structure and interpretable rewards, powered by MuJoCo and implemented in Python, intended to serve as performance benchmarks for reinforcement learning agents.

What carries the argument

The Control Suite, a set of continuous control tasks with standardized structure and interpretable rewards.

If this is right

Algorithms can be evaluated and compared using the same tasks and rewards.
Researchers can easily modify the tasks due to the Python implementation.
The suite includes initial benchmarks for several learning algorithms.
The tasks are accessible to the public via the provided repository.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption could lead to more reproducible results in continuous control research.
Success on these tasks may suggest potential for real-world applications, though further validation would be needed.
The design choices might influence how future control benchmarks are structured.

Load-bearing premise

The selected tasks and their reward functions adequately represent real-world continuous control challenges so that performance generalizes.

What would settle it

Demonstrating that top-performing agents on the Control Suite perform poorly on a new set of similar control tasks not included in the suite would falsify its value as a general benchmark.

read the original abstract

The DeepMind Control Suite is a set of continuous control tasks with a standardised structure and interpretable rewards, intended to serve as performance benchmarks for reinforcement learning agents. The tasks are written in Python and powered by the MuJoCo physics engine, making them easy to use and modify. We include benchmarks for several learning algorithms. The Control Suite is publicly available at https://www.github.com/deepmind/dm_control . A video summary of all tasks is available at http://youtu.be/rAai4QzcYbs .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is mainly a practical code release for a standardized MuJoCo-based RL benchmark suite with baselines, useful for comparisons but not a conceptual leap.

read the letter

The paper's main value is the release of dm_control, a Python wrapper around MuJoCo that gives a consistent interface, task structure, and interpretable rewards across a set of continuous control problems. They ship the full codebase, clear task definitions, and baseline runs for a handful of algorithms, which makes it immediately usable for running experiments and comparing results across papers. That addresses a real pain point in the field where everyone was rolling their own environments or using inconsistent setups. The video and GitHub link are straightforward additions that help adoption. What is actually new is the curation and standardization rather than any individual task or algorithm; many of the underlying physics problems existed in prior MuJoCo examples or other suites. The baselines are solid enough to get people started but not exhaustive, which fits the scope. No load-bearing math claims or derivations appear, so nothing to poke holes in there. The main soft spot is that the tasks remain somewhat artificial, and the paper does not provide evidence that success here transfers to messier real-world control; it simply offers the benchmark without overclaiming generalization. Citation patterns are clean and point to relevant prior work on MuJoCo and RL without padding. This is for RL researchers who need a common testbed for continuous control agents, especially those doing incremental algorithm work or benchmarking. A serious referee should see it because the release is reproducible, the code works, and it fills a practical gap even if the intellectual novelty is modest. I would send it to peer review rather than desk reject.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces the DeepMind Control Suite, a collection of continuous control tasks implemented in Python and powered by the MuJoCo physics engine. The tasks feature a standardized structure and interpretable rewards, are intended to serve as performance benchmarks for reinforcement learning agents, and the paper supplies baseline results for several algorithms along with a public code release at https://www.github.com/deepmind/dm_control.

Significance. The release of a standardized, open-source benchmark suite with working code, clear task definitions, and reported baseline numbers constitutes a useful contribution to the RL community by enabling reproducible comparisons on continuous control problems. The absence of free parameters or invented entities in the central claim, combined with the provision of executable environments, strengthens the practical value if the suite sees adoption.

minor comments (2)

[Baselines] § on baseline experiments: specify the exact number of random seeds and the precise hyperparameter settings used for each algorithm to allow exact reproduction of the reported scores.
[Task descriptions] Figure 1 (task illustrations): ensure all panels use consistent axis scaling and label units explicitly so that reward magnitudes are immediately comparable across tasks.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review of the manuscript and their recommendation to accept. We are pleased that the standardized benchmark suite and its public release are viewed as a useful contribution to the reinforcement learning community.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents the DeepMind Control Suite as a collection of standardized continuous-control environments with interpretable rewards, implemented in Python atop MuJoCo, together with baseline runs of several existing RL algorithms. No derivation chain, predictive claim, or uniqueness theorem is advanced; the central contribution is the release of the task definitions and code at the cited GitHub repository, whose correctness is directly verifiable by inspection and execution rather than by any reduction to author-defined parameters or self-citations. Baseline numbers are simply reported outcomes of running published algorithms on the released tasks and do not constitute fitted predictions that loop back to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software and benchmarking contribution rather than a mathematical derivation. No free parameters are fitted to produce a central claim, no new axioms are introduced, and no invented entities are postulated.

pith-pipeline@v0.9.0 · 5413 in / 1005 out tokens · 28109 ms · 2026-05-13T07:40:16.913388+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Game: Talking to Non-Human Systems
cs.LG 2026-05 unverdicted novelty 8.0

A language-game framework enables dialogue with dynamical systems such as GRNs by treating their frozen dynamics as an RL policy core, using an LM to route prompts so the system responds through its own behavior witho...
Adversarial Imitation Learning with General Function Approximation: Theoretical Analysis and Practical Algorithms
cs.LG 2026-05 unverdicted novelty 8.0

OPT-AIL provides the first provably efficient adversarial imitation learning algorithms under general function approximation, achieving polynomial expert sample and interaction complexity.
BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation
cs.RO 2024-03 accept novelty 8.0

BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.
WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents
cs.LG 2026-05 unverdicted novelty 7.0

WMAttack automates finite-budget attack search for world-model agents via SCAS and RGAR, reporting higher normalized reward drops than baselines on Atari and DMC tasks.
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
cs.RO 2026-05 unverdicted novelty 7.0

ARC-RL provides four new MuJoCo continuous-control environments with hexapod and quadruped morphologies inspired by ARC Raiders, a unified multi-component reward without motion capture, CPG expert demonstrators, and e...
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation
cs.LG 2026-05 unverdicted novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
cs.LG 2026-05 unverdicted novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
Generative Actor-Critic with Soft Bridge Policies
cs.LG 2026-05 unverdicted novelty 7.0

SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 7.0

Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
cs.RO 2026-04 unverdicted novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

ACO-MoE employs agent-centric mixture-of-experts to decouple task-relevant features from dynamic visual perturbations in RL, recovering 95.3% of clean performance on the new VDCS benchmark.
Leveraging Human Feedback for Semantically-Relevant Skill Discovery
cs.LG 2026-04 unverdicted novelty 7.0

SRSD uses human-provided semantic labels to learn rewards that encourage reinforcement learning agents to discover a wide variety of meaningful and distinct behaviors.
Intentional Updates for Streaming Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.
Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation
stat.ML 2026-04 unverdicted novelty 7.0

High-order generator regression from multi-step trajectories yields a second-order accurate estimator for finite-horizon continuous-time policy evaluation that outperforms the Bellman baseline in calibration studies a...
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Dream to Control: Learning Behaviors by Latent Imagination
cs.LG 2019-12 accept novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Benchmarking Model-Based Reinforcement Learning
cs.LG 2019-07 accept novelty 7.0

Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termin...
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
cs.RO 2026-05 accept novelty 6.0

ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical c...
Sampling-Based Safe Reinforcement Learning
cs.LG 2026-05 conditional novelty 6.0

SBSRL approximates worst-case safety optimization over uncertain dynamics via finite sampling, adds epistemic-uncertainty-constrained exploration, and supplies high-probability safety guarantees plus finite-time sampl...
PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics
cs.LG 2026-05 unverdicted novelty 6.0

PH-Dreamer integrates a port-Hamiltonian framework into generative world models to enforce physical priors, yielding tighter imagined-real reward alignment and reduced latent space volume on visual control benchmarks.
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
cs.LG 2026-05 unverdicted novelty 6.0

R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.
Debiased Model-based Representations for Sample-efficient Continuous Control
cs.LG 2026-05 unverdicted novelty 6.0

DR.Q debiases model-based representations for Q-learning by maximizing mutual information between state-action and next-state representations and applying faded prioritized experience replay, achieving competitive or ...
When Does Non-Uniform Replay Matter in Reinforcement Learning?
cs.LG 2026-05 unverdicted novelty 6.0

Non-uniform replay helps off-policy RL mainly at low replay volumes, high-entropy sampling matters even at similar recency, and Truncated Geometric replay offers a low-overhead practical solution.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 6.0

Agentick is a new unified benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents across 37 tasks, showing no single approach dominates.
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Hidden states in recurrent RL policies correspond to PMP co-states, so a derived co-state loss structures the dynamics and yields robust performance on partially observable continuous control tasks.
Extending Differential Temporal Difference Methods for Episodic Problems
cs.LG 2026-05 unverdicted novelty 6.0

A generalization of differential TD extends it to episodic settings while preserving policy ordering, inheriting linear TD guarantees, and improving sample efficiency.
TRAP: Tail-aware Ranking Attack for World-Model Planning
cs.LG 2026-05 unverdicted novelty 6.0

TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on c...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
Improving Zero-Shot Offline RL via Behavioral Task Sampling
cs.AI 2026-04 unverdicted novelty 6.0

Extracting task vectors from the offline dataset for policy training improves zero-shot offline RL performance by an average of 20% over random sampling baselines.
Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation
stat.ML 2026-04 unverdicted novelty 6.0

High-order moment-matching estimation of the time-dependent generator improves continuous-time policy evaluation accuracy over first-order Bellman recursion by canceling lower-order truncation terms, with supporting e...
Mean Flow Policy Optimization
cs.LG 2026-04 conditional novelty 6.0

Mean Flow Policy Optimization (MFPO) uses few-step flow-based models for RL policies and achieves performance on par with or better than diffusion-based methods while substantially lowering training and inference time...
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
cs.LG 2026-03 unverdicted novelty 6.0

LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?
cs.LG 2026-02 unverdicted novelty 6.0

ALGD augments the Lagrangian to locally convexify the energy landscape in diffusion models, stabilizing safe RL training and generation without changing optimal policies.
Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
cs.LG 2025-10 unverdicted novelty 6.0

MINTO sets bootstrapped targets to the minimum of online and target network estimates, yielding faster stable value learning across online/offline RL and discrete/continuous actions.
Robust Remote Reinforcement Learning over Unreliable Communication Channels using Homomorphic State Encoding
cs.LG 2025-08 unverdicted novelty 6.0

HR3L enables robust remote RL training over unreliable channels via homomorphic state encoding without gradient exchange, outperforming prior methods in sample efficiency and adapting to packet loss, delays, and bandw...
RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation
cs.RO 2025-07 unverdicted novelty 6.0

RoboEval is a new benchmark providing eight bimanual tasks, thousands of expert demonstrations, and standardized metrics for efficiency, coordination, safety, and failure localization in robotic manipulation.
Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight
cs.RO 2025-01 unverdicted novelty 6.0

DreamerV3 enables pixel-to-control policies for drone racing that reach 9 m/s in both simulation and real hardware-in-the-loop tests.
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
cs.RO 2024-11 unverdicted novelty 6.0

DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
A Survey on Vision-Language-Action Models for Embodied AI
cs.RO 2024-05 unverdicted novelty 6.0

This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
Arena: a toolkit for Multi-Agent Reinforcement Learning
cs.LG 2019-07 accept novelty 6.0

Arena introduces a modular Interface design that extends OpenAI Gym wrappers to support complex multi-agent RL scenarios including self-play and cooperative-competitive interactions.
Continual Reinforcement Learning with Diversity Exploration and Adversarial Self-Correction
cs.LG 2019-06 unverdicted novelty 6.0

CDAN framework uses diversity exploration and adversarial self-correction for continual RL in continuous control, evaluated on new CAM environment with NSD metric showing 18.35% NSD improvement over baseline.
Implicit Action Chunking for Smooth Continuous Control
cs.RO 2026-05 unverdicted novelty 5.0

Dual-Window Smoothing uses an execution window for deterministic smoothness and a value window to correct critic bias, plus a first-order temporal regularizer, to achieve smoother RL control than explicit chunking or ...
When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited
cs.LG 2026-05 unverdicted novelty 5.0

Robust minimax task inference in BFMs achieves dynamics-shift robustness from nominal offline data alone and outperforms standard baselines.
Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling
cs.LG 2026-05 unverdicted novelty 5.0

A delay-aware RL approach learns transferable structured representations and dynamics via implicit causal graphs, outperforming baselines on delayed DMC tasks and accelerating adaptation to new tasks.
When Does Non-Uniform Replay Matter in Reinforcement Learning?
cs.LG 2026-05 unverdicted novelty 5.0

Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.
When Does Non-Uniform Replay Matter in Reinforcement Learning?
cs.LG 2026-05 unverdicted novelty 5.0

Non-uniform replay improves RL sample efficiency mainly in low replay-volume regimes, with high-entropy sampling being key even at comparable recency.
HaM-World: Soft-Hamiltonian World Models with Selective Memory for Planning
cs.AI 2026-05 unverdicted novelty 5.0

HaM-World integrates soft-Hamiltonian dynamics with selective state-space memory to reduce long-horizon rollout error by 55% and achieve top returns under 12 OOD perturbations on DeepMind Control Suite tasks.
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

Recurrent RL policies can have their hidden states aligned with PMP co-states through a derived loss, yielding robust performance on partially observable control tasks.
Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

LoRA applied to critics in SAC and FastTD3 reduces critic loss and yields best or competitive policy performance on most evaluated tasks.
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
cs.AI 2026-03 unverdicted novelty 5.0

An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
SLOPE: Optimistic Potential Landscape Shaping for Model-based Reinforcement Learning
cs.LG 2026-02 unverdicted novelty 5.0

SLOPE improves MBRL in sparse reward settings by using optimistic distributional regression to build informative potential landscapes that provide better exploration gradients, outperforming baselines across 30+ tasks...
D2 Actor Critic: Diffusion Actor Meets Distributional Critic
cs.LG 2025-10 unverdicted novelty 5.0

D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.
Intention-Conditioned Flow Occupancy Models
cs.LG 2025-06 unverdicted novelty 5.0

InFOM applies flow matching to model intention-conditioned occupancy measures for RL pre-training, reporting 1.8x median return gains and 36% higher success rates on benchmarks.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 54 Pith papers · 4 internal anchors

[1]

Layer Normalization

Anonymous. Distributed prioritized experience replay.Under submission, 2017a. Anonymous. Distributional policy gradients.Under submission, 2017b. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoﬀrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

doi: 10.1109/TSMC.1983.6313077

ISSN 0018-9472. doi: 10.1109/TSMC.1983.6313077. Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of Artiﬁcial Intelligence Research,

work page doi:10.1109/tsmc.1983.6313077 1983
[3]

A Distributional Perspective on Reinforcement Learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning.arXiv preprint arXiv:1707.06887,

work page Pith review arXiv
[4]

Simulation tools for model-based robotics: Comparison of bullet, havok, mujoco, ode and physx

Tom Erez, Yuval Tassa, and Emanuel Todorov. Simulation tools for model-based robotics: Comparison of bullet, havok, mujoco, ode and physx. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 4397–4404. IEEE,

work page 2015
[5]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control.arXiv preprint arXiv:1708.04133,

work page Pith review arXiv
[6]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Learning human behaviors from motion capture by adversarial imitation

Josh Merel, Yuval Tassa, TB Dhruva, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201,

work page internal anchor Pith review arXiv
[9]

Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning.arXiv preprint arXiv:1602.01783,

work page Pith review arXiv
[10]

ISSN 0028-0836. Letter. Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller. Data-eﬃcient deepreinforcementlearningfordexterousmanipulation. arXiv preprint arXiv:1704.03073,

work page Pith review arXiv
[11]

Prioritized Experience Replay

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952,

work page Pith review arXiv
[12]

Synthesis and stabilization of complex be- haviors through online trajectory optimization

Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex be- haviors through online trajectory optimization. InIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 4906–4913. IEEE,

work page 2012
[13]

Mujoco: A physics engine for model- based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model- based control. InIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE,

work page 2012

[1] [1]

Layer Normalization

Anonymous. Distributed prioritized experience replay.Under submission, 2017a. Anonymous. Distributional policy gradients.Under submission, 2017b. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoﬀrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

doi: 10.1109/TSMC.1983.6313077

ISSN 0018-9472. doi: 10.1109/TSMC.1983.6313077. Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of Artiﬁcial Intelligence Research,

work page doi:10.1109/tsmc.1983.6313077 1983

[3] [3]

A Distributional Perspective on Reinforcement Learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning.arXiv preprint arXiv:1707.06887,

work page Pith review arXiv

[4] [4]

Simulation tools for model-based robotics: Comparison of bullet, havok, mujoco, ode and physx

Tom Erez, Yuval Tassa, and Emanuel Todorov. Simulation tools for model-based robotics: Comparison of bullet, havok, mujoco, ode and physx. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 4397–4404. IEEE,

work page 2015

[5] [5]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control.arXiv preprint arXiv:1708.04133,

work page Pith review arXiv

[6] [6]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Learning human behaviors from motion capture by adversarial imitation

Josh Merel, Yuval Tassa, TB Dhruva, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201,

work page internal anchor Pith review arXiv

[9] [9]

Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning.arXiv preprint arXiv:1602.01783,

work page Pith review arXiv

[10] [10]

ISSN 0028-0836. Letter. Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller. Data-eﬃcient deepreinforcementlearningfordexterousmanipulation. arXiv preprint arXiv:1704.03073,

work page Pith review arXiv

[11] [11]

Prioritized Experience Replay

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952,

work page Pith review arXiv

[12] [12]

Synthesis and stabilization of complex be- haviors through online trajectory optimization

Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex be- haviors through online trajectory optimization. InIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 4906–4913. IEEE,

work page 2012

[13] [13]

Mujoco: A physics engine for model- based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model- based control. InIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE,

work page 2012