Continual Reinforcement Learning with Diversity Exploration and Adversarial Self-Correction

Fengda Zhu; Mingkui Tan; Runhao Zeng; Xiaojun Chang

arxiv: 1906.09205 · v1 · pith:LXW7ADRAnew · submitted 2019-06-21 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Continual Reinforcement Learning with Diversity Exploration and Adversarial Self-Correction

Fengda Zhu , Xiaojun Chang , Runhao Zeng , Mingkui Tan This is my paper

Pith reviewed 2026-05-25 19:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords continual reinforcement learningcatastrophic forgettingdiversity explorationadversarial learningcontinuous controlunsupervised skill discovery

0 comments

The pith

The CDAN framework overcomes catastrophic forgetting in continual reinforcement learning for continuous control by pairing unsupervised diversity exploration with adversarial self-correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the problem of agents forgetting earlier tasks when learning new ones in sequence, a common failure mode in reinforcement learning for physical control problems. It introduces an end-to-end method that first uses an unsupervised objective to discover diverse, task-specific behaviors even when rewards do not directly indicate which task is active. It then adds an adversarial mechanism that draws on stored past experience to correct and retain earlier policies. The authors treat these two procedures as mutually reinforcing and test the combined system on a new sequential maze environment for continuous control agents.

Core claim

The central claim is that an unsupervised diversity exploration step can extract task-specific skills without relying on task-reward alignment, while an adversarial self-correction step can exploit previous experience to prevent forgetting, and that these two steps support each other in continuous control domains.

What carries the argument

Continual Diversity Adversarial Network (CDAN), which runs unsupervised diversity exploration to generate task-specific skills alongside adversarial self-correction that reuses past trajectories to stabilize earlier policies.

If this is right

Agents can acquire distinct behaviors for each task in a sequence even without explicit task labels in the reward.
Stored experience from earlier tasks can be turned into an adversarial signal that actively corrects policy drift on new tasks.
The two procedures together produce measurable gains on a sequential continuous-control benchmark measured by both path efficiency and average reward.
A dedicated environment and metric become available for comparing future continual reinforcement learning methods on continuous domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pairing of unsupervised diversity and adversarial correction could be tried in discrete-action or partially observable settings where task boundaries are also unclear.
If the reciprocal benefit holds, one might expect performance to improve further by iterating the two procedures more tightly within each episode rather than across separate phases.
Real-world robot platforms that must master successive manipulation or navigation skills could adopt the method once the unsupervised exploration is made safe for physical hardware.

Load-bearing premise

The assumption that an unsupervised objective can still produce useful task-specific skills when rewards and task identity are only loosely related, and that the diversity and correction procedures reinforce each other.

What would settle it

A controlled run on the Continual Ant Maze environment in which removing the unsupervised diversity term produces no measurable drop in the ability to handle later tasks, or in which the adversarial correction term fails to improve retention of earlier task performance.

Figures

Figures reproduced from arXiv: 1906.09205 by Fengda Zhu, Mingkui Tan, Runhao Zeng, Xiaojun Chang.

**Figure 2.** Figure 2: The pipeline of the Continual Diversity Adversarial Network (CDAN). The current step [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Mazes in our dataset. The mazes have different size, shape, and complexity. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Training and testing results in our environment. (a) compares the training rewards between [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Trajectories sampled from baseline and baseline+DE in different tasks. Same color stands for trajectories sampled from same tasks. We also show how diversity exploration effects our training process. As [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The trajectories of different models. We visualize the trajectories of three models in the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Deep reinforcement learning has made significant progress in the field of continuous control, such as physical control and autonomous driving. However, it is challenging for a reinforcement model to learn a policy for each task sequentially due to catastrophic forgetting. Specifically, the model would forget knowledge it learned in the past when trained on a new task. We consider this challenge from two perspectives: i) acquiring task-specific skills is difficult since task information and rewards are not highly related; ii) learning knowledge from previous experience is difficult in continuous control domains. In this paper, we introduce an end-to-end framework namely Continual Diversity Adversarial Network (CDAN). We first develop an unsupervised diversity exploration method to learn task-specific skills using an unsupervised objective. Then, we propose an adversarial self-correction mechanism to learn knowledge by exploiting past experience. The two learning procedures are presumably reciprocal. To evaluate the proposed method, we propose a new continuous reinforcement learning environment named Continual Ant Maze (CAM) and a new metric termed Normalized Shorten Distance (NSD). The experimental results confirm the effectiveness of diversity exploration and self-correction. It is worthwhile noting that our final result outperforms baseline by 18.35% in terms of NSD, and 0.61 according to the average reward.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CDAN pairs unsupervised diversity exploration with adversarial correction for continual RL but the gains rest on an abstract-level claim without visible ablations or setup details.

read the letter

The paper introduces CDAN, an end-to-end framework that first uses an unsupervised objective to acquire diverse task-specific skills and then applies adversarial self-correction to retain earlier knowledge. It also defines a new Continual Ant Maze environment and the Normalized Shorten Distance metric. The target problem—catastrophic forgetting in sequential continuous-control tasks—is real and relevant to physical control and driving scenarios. The two-part structure is a reasonable response to the stated difficulties: weak reward-task alignment and the challenge of reusing experience in continuous domains. The idea that the two procedures can reinforce each other is worth testing if the paper actually demonstrates it. The main limitation is the thin evidence. The abstract states that the method outperforms a baseline by 18.35 % on NSD and 0.61 on average reward, yet supplies no description of the baseline, number of seeds, variance, or training protocol. The reciprocity is labeled “presumably,” which signals that the paper may not contain the ablations or transfer analyses needed to show one component actually improves the other. A custom environment and metric are introduced without discussion of whether they isolate the claimed interaction or simply reward environment-specific tuning. Readers already working on continual RL or robotics control will want to see the full experimental section before deciding whether the framework moves the needle. The work is coherent enough on its own terms to merit referee time, though any review would need to press hard on the missing validation steps.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Continual Diversity Adversarial Network (CDAN) as an end-to-end framework for continual reinforcement learning in continuous control domains. It introduces an unsupervised diversity exploration method to acquire task-specific skills via an unsupervised objective and an adversarial self-correction mechanism to mitigate catastrophic forgetting by exploiting past experience; these two procedures are described as presumably reciprocal. A new environment (Continual Ant Maze, CAM) and metric (Normalized Shorten Distance, NSD) are presented, with experimental results claimed to confirm effectiveness and to outperform a baseline by 18.35% in NSD and 0.61 in average reward.

Significance. If the reciprocity between the unsupervised and adversarial components can be demonstrated and the results shown to generalize beyond the custom CAM environment, the work would address a core challenge in continual RL by providing a mechanism for skill acquisition without strong task-reward correlation. The introduction of a new benchmark and metric could also be useful if properly validated against existing continuous-control suites.

major comments (3)

[Abstract] Abstract: the assertion that the two learning procedures 'are presumably reciprocal' is load-bearing for the central claim yet is presented without ablations, transfer analysis, or mechanistic evidence showing that diversity exploration improves self-correction (or vice versa); this leaves the interaction unverified.
[Abstract] Abstract / Experimental Results: the new CAM environment and NSD metric are introduced to support the 18.35% NSD improvement claim, but no validation against standard continuous-control benchmarks (e.g., MuJoCo tasks) or existing metrics is described, raising the possibility that reported gains are environment-specific artifacts rather than general evidence for the framework.
[Abstract] Abstract: quantitative claims of outperforming the baseline by 18.35% NSD and 0.61 average reward are stated without any reference to implementation details, baseline algorithms, number of runs, variance, or statistical tests, which prevents assessment of whether the central effectiveness claim is reproducible.

minor comments (2)

[Abstract] The phrase 'presumably reciprocal' is vague; replacing it with a precise statement of the hypothesized interaction would improve clarity.
[Abstract] The abstract provides no information on the unsupervised objective function or the adversarial loss formulation; adding these equations (even at high level) would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below, indicating where revisions to the manuscript are warranted.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the two learning procedures 'are presumably reciprocal' is load-bearing for the central claim yet is presented without ablations, transfer analysis, or mechanistic evidence showing that diversity exploration improves self-correction (or vice versa); this leaves the interaction unverified.

Authors: We agree that the phrasing 'presumably reciprocal' in the abstract implies a mutual benefit that is not directly verified through dedicated ablations or transfer analysis. The manuscript demonstrates that each component contributes to performance and that the combined CDAN framework outperforms baselines, but does not isolate the reciprocal interaction. We will revise the abstract to remove or qualify this claim and add an ablation study examining the interaction between the two procedures in the revised manuscript. revision: yes
Referee: [Abstract] Abstract / Experimental Results: the new CAM environment and NSD metric are introduced to support the 18.35% NSD improvement claim, but no validation against standard continuous-control benchmarks (e.g., MuJoCo tasks) or existing metrics is described, raising the possibility that reported gains are environment-specific artifacts rather than general evidence for the framework.

Authors: The CAM environment and NSD metric were introduced to specifically evaluate continual learning in sequential task settings with continuous control, which existing MuJoCo suites do not directly provide. However, we acknowledge that demonstrating results on standard benchmarks would better support generality. We will add experiments on selected MuJoCo tasks using both NSD and standard metrics in the revision. revision: yes
Referee: [Abstract] Abstract: quantitative claims of outperforming the baseline by 18.35% NSD and 0.61 average reward are stated without any reference to implementation details, baseline algorithms, number of runs, variance, or statistical tests, which prevents assessment of whether the central effectiveness claim is reproducible.

Authors: The abstract is intentionally concise, but we agree it should reference key experimental details for the reported numbers. The full manuscript contains implementation details, baseline descriptions, and results averaged over multiple random seeds with variance; we will update the abstract to include the number of runs, mention of variance, and note on statistical testing. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on proposed methods and experiments

full rationale

The paper introduces CDAN with unsupervised diversity exploration and adversarial self-correction, asserts they are 'presumably reciprocal,' and validates via experiments on the new CAM environment and NSD metric, reporting specific improvements (18.35% NSD, 0.61 reward). No equations, derivations, or self-citations are shown that reduce any central claim to fitted inputs or prior self-work by construction. The derivation chain is self-contained against external benchmarks and empirical results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on standard RL assumptions plus the domain assumption that the two proposed procedures work and are reciprocal; no free parameters or invented physical entities are described.

axioms (2)

domain assumption Unsupervised diversity exploration can learn task-specific skills when task information and rewards are not highly related
Stated as the first perspective and component of the framework in the abstract.
domain assumption Adversarial self-correction can learn knowledge from previous experience in continuous control domains
Stated as the second perspective and component of the framework in the abstract.

invented entities (3)

CDAN no independent evidence
purpose: End-to-end framework combining the two procedures
Proposed as the overall method.
CAM no independent evidence
purpose: New continuous reinforcement learning environment for evaluation
Proposed to evaluate the method.
NSD no independent evidence
purpose: New metric termed Normalized Shorten Distance
Proposed for measuring performance.

pith-pipeline@v0.9.0 · 5768 in / 1495 out tokens · 58728 ms · 2026-05-25T19:03:22.670715+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 20 internal anchors

[1]

Variational Option Discovery Algorithms

J. Achiam, H. Edwards, D. Amodei, and P. Abbeel. Variational option discovery algorithms.arXiv preprint arXiv:1807.10299, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Emergent Complexity via Multi-Agent Competition

T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch. Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Bengio, J

Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009

work page 2009
[4]

Marathon Environments: Multi-Agent Continuous Control Benchmarks in a Modern Video Game Engine

J. Booth and J. Booth. Marathon environments: Multi-agent continuous control benchmarks in a modern video game engine. arXiv preprint arXiv:1902.09097, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[5]

HoME: a Household Multimodal Environment

S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. Courville. Home: A household multimodal environment. arXiv preprint arXiv:1711.11017, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Y . Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016

work page 2016
[7]

Diversity is All You Need: Learning Skills without a Reward Function

B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

PathNet: Evolution Channels Gradient Descent in Super Neural Networks

C. Fernando, D. Banarse, C. Blundell, Y . Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

François-Lavet, P

V . François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau, et al. An introduction to deep reinforcement learning. Foundations and Trends R⃝ in Machine Learning, 11(3-4):219–354, 2018

work page 2018
[10]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014

work page 2014
[11]

Variational Intrinsic Control

K. Gregor, D. J. Rezende, and D. Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Reinforcement Learning with Unsupervised Auxiliary Tasks

M. Jaderberg, V . Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Rein- forcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Continual Reinforcement Learning with Complex Synapses

C. Kaplanis, M. Shanahan, and C. Clopath. Continual reinforcement learning with complex synapses. arXiv preprint arXiv:1802.07239, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

Lopez-Paz et al

D. Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017

work page 2017
[16]

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016

work page 1928
[17]

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

work page 2015
[19]

Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference

M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y . Tu, and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Prioritized Experience Replay

T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

Schulman, S

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015

work page 2015
[23]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Silver, J

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. 9

work page 2017
[25]

P. Sun, X. Sun, L. Han, J. Xiong, Q. Wang, B. Li, Y . Zheng, J. Liu, Y . Liu, H. Liu, et al. Tstarbots: Defeating the cheating level builtin ai in starcraft ii in the full game. arXiv preprint arXiv:1809.07193, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[27]

L. Tai, G. Paolo, and M. Liu. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 31–36. IEEE, 2017

work page 2017
[28]

DeepMind Control Suite

Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Van Hasselt, A

H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016
[30]

S. Wang, D. Jia, and X. Weng. Deep reinforcement learning for autonomous driving. arXiv preprint arXiv:1811.11329, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[32]

Y . Wu, Y . Wu, G. Gkioxari, and Y . Tian. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018. 10

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Variational Option Discovery Algorithms

J. Achiam, H. Edwards, D. Amodei, and P. Abbeel. Variational option discovery algorithms.arXiv preprint arXiv:1807.10299, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Emergent Complexity via Multi-Agent Competition

T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch. Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Bengio, J

Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009

work page 2009

[4] [4]

Marathon Environments: Multi-Agent Continuous Control Benchmarks in a Modern Video Game Engine

J. Booth and J. Booth. Marathon environments: Multi-agent continuous control benchmarks in a modern video game engine. arXiv preprint arXiv:1902.09097, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[5] [5]

HoME: a Household Multimodal Environment

S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. Courville. Home: A household multimodal environment. arXiv preprint arXiv:1711.11017, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Y . Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016

work page 2016

[7] [7]

Diversity is All You Need: Learning Skills without a Reward Function

B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

PathNet: Evolution Channels Gradient Descent in Super Neural Networks

C. Fernando, D. Banarse, C. Blundell, Y . Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

François-Lavet, P

V . François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau, et al. An introduction to deep reinforcement learning. Foundations and Trends R⃝ in Machine Learning, 11(3-4):219–354, 2018

work page 2018

[10] [10]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014

work page 2014

[11] [11]

Variational Intrinsic Control

K. Gregor, D. J. Rezende, and D. Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Reinforcement Learning with Unsupervised Auxiliary Tasks

M. Jaderberg, V . Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Rein- forcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

Continual Reinforcement Learning with Complex Synapses

C. Kaplanis, M. Shanahan, and C. Clopath. Continual reinforcement learning with complex synapses. arXiv preprint arXiv:1802.07239, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[15] [15]

Lopez-Paz et al

D. Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017

work page 2017

[16] [16]

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016

work page 1928

[17] [17]

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[18] [18]

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

work page 2015

[19] [19]

Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference

M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y . Tu, and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Prioritized Experience Replay

T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[22] [22]

Schulman, S

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015

work page 2015

[23] [23]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Silver, J

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. 9

work page 2017

[25] [25]

P. Sun, X. Sun, L. Han, J. Xiong, Q. Wang, B. Li, Y . Zheng, J. Liu, Y . Liu, H. Liu, et al. Tstarbots: Defeating the cheating level builtin ai in starcraft ii in the full game. arXiv preprint arXiv:1809.07193, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018

work page 2018

[27] [27]

L. Tai, G. Paolo, and M. Liu. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 31–36. IEEE, 2017

work page 2017

[28] [28]

DeepMind Control Suite

Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Van Hasselt, A

H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016

[30] [30]

S. Wang, D. Jia, and X. Weng. Deep reinforcement learning for autonomous driving. arXiv preprint arXiv:1811.11329, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[32] [32]

Y . Wu, Y . Wu, G. Gkioxari, and Y . Tian. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018. 10

work page internal anchor Pith review Pith/arXiv arXiv 2018