pith. sign in

arxiv: 1907.00664 · v1 · pith:2CI4RNTTnew · submitted 2019-07-01 · 💻 cs.LG · stat.ML

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Pith reviewed 2026-05-25 12:28 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords hierarchical reinforcement learningworld graphpivotal statesgoal-conditioned policycuriosity-driven explorationmaze navigationtask transfer
0
0 comments X

The pith

A learned world graph of pivotal states lets hierarchical agents solve new tasks by planning over the graph and traversing long paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes to build a graph abstraction over an environment, with nodes as important pivotal states and edges as feasible traversals between them. This graph is learned in two stages by first jointly training a latent pivotal state model and a curiosity-driven goal-conditioned policy without any task-specific information. For new tasks, a high-level Manager uses the graph to quickly find solutions and set subgoals at pivotal states for a low-level Worker, which then traverses long distances and explores non-locally using the graph. Thorough ablation studies on a suite of challenging maze tasks show significant advantages in performance and efficiency over baselines without the world graph. A sympathetic reader would care because the approach reuses learned environment structure to handle multiple tasks without starting from scratch each time.

Core claim

The paper claims that a latent pivotal state model jointly trained with a curiosity-driven goal-conditioned policy in a task-agnostic manner produces a world graph abstraction. Provided with this graph, a high-level Manager quickly finds solutions to new tasks and expresses subgoals in reference to pivotal states to a low-level Worker, which leverages the graph to traverse to those states even across long distances and to explore non-locally, yielding better performance and efficiency than graph-free baselines on maze tasks.

What carries the argument

The world graph with nodes as pivotal states and edges as feasible traversals, produced by joint training of a latent pivotal state model and curiosity-driven goal-conditioned policy.

If this is right

  • A high-level Manager can quickly find solutions to new tasks by planning with reference to pivotal states.
  • A low-level Worker can traverse long distances to pivotal states and explore non-locally using the graph.
  • The framework produces significant advantages in performance and efficiency on maze tasks over methods lacking the graph.
  • The graph abstraction supports solving multiple tasks within one complex environment by reusing structure learned without task labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might scale to continuous state spaces if the latent model can identify useful pivotal states without discrete maze structure.
  • Combining the graph with other hierarchical methods could further reduce the cost of adapting to task variations.
  • Dynamically refining the graph during task solving might allow adaptation to changes in the environment.
  • Testing the method in environments with partial observability could reveal whether the graph still enables non-local exploration.

Load-bearing premise

The latent pivotal state model jointly trained with the curiosity-driven goal-conditioned policy in a task-agnostic manner produces nodes that form a useful graph abstraction for solving previously unseen tasks.

What would settle it

Training the model on maze environments and observing no improvement in success rate or sample efficiency on held-out tasks when the learned graph is provided to the Manager and Worker compared to baselines without it.

Figures

Figures reproduced from arXiv: 1907.00664 by Alex Trott, Caiming Xiong, Richard Socher, Stephan Zheng, Wenling Shang.

Figure 1
Figure 1. Figure 1: Top Left: Overall pipeline of our proposed 2-stage framework. Top Right (world graph discovery): a subgraph exemplifies how to forge edges and traverse between pivotal states (in blue). Bottom (Hierarhical RL): an example rollout from our proposed HRL policy with Wide-then-Narrow Manager instructions and world graph traversals, solving a challenging Door-Key task. At first glimpse, the world graph seems su… view at source ↗
Figure 2
Figure 2. Figure 2: Our recurrent latent model with differentiable binary latent units to discover pivotal states. A prior network (left) learns the state-conditioned prior in Beta distribution, pψ(zt|st) = Beta(αt, βt). An inference encoder learns an approximate posterior in HardKuma distribution [8] inferred from (st, at)’s, qφ(zt|at, st) = HardKuma( ˜αt, 1). A generation decoder reconstructs the action sequence from {st|zt… view at source ↗
Figure 3
Figure 3. Figure 3: Left: a general configuration of Feudal Netowrk; Manager and Worker are both A2C-LSTMs operating at different temporal resolutions. Right: proposed Wide-then-Narrow Manager instruction, where Manager first outputs a wide goal gw from a pre-defined set of candidate states V, e.g. Vp, and then zooms its attention to a closer up area around gw to narrow down the final subgoal gn. the shortest such actionable … view at source ↗
Figure 4
Figure 4. Figure 4: Validation curves during training (mean and standard-deviation of reward, 3 seeds) for MultiGoal. Left: Compare between Vp and Vrand, with or without traversal, all models here use WN and πg initialization. Observe that (1) traversal evidently speeds up convergence (2) Vrand carries higher variance and slightly inferior performance than Vp. Right: compare with or without πg initialization on Vp, all models… view at source ↗
read the original abstract

In many real-world scenarios, an autonomous agent often encounters various tasks within a single complex environment. We propose to build a graph abstraction over the environment structure to accelerate the learning of these tasks. Here, nodes are important points of interest (pivotal states) and edges represent feasible traversals between them. Our approach has two stages. First, we jointly train a latent pivotal state model and a curiosity-driven goal-conditioned policy in a task-agnostic manner. Second, provided with the information from the world graph, a high-level Manager quickly finds solution to new tasks and expresses subgoals in reference to pivotal states to a low-level Worker. The Worker can then also leverage the graph to easily traverse to the pivotal states of interest, even across long distance, and explore non-locally. We perform a thorough ablation study to evaluate our approach on a suite of challenging maze tasks, demonstrating significant advantages from the proposed framework over baselines that lack world graph knowledge in terms of performance and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a two-stage framework for accelerating hierarchical RL across multiple tasks in a shared environment by learning a 'world graph' whose nodes are pivotal states discovered via task-agnostic training. Stage 1 jointly optimizes a latent pivotal-state model together with a curiosity-driven goal-conditioned policy. Stage 2 supplies the resulting graph to a high-level Manager that plans over pivotal states and issues subgoals to a low-level Worker; the Worker in turn uses graph edges for long-range, non-local traversal. The authors report a thorough ablation study on a suite of maze navigation tasks demonstrating performance and sample-efficiency gains relative to baselines that lack the world-graph abstraction.

Significance. If the discovered pivotal states reliably form transferable abstractions rather than task-specific exploration artifacts, the approach would supply a concrete mechanism for reusable hierarchical structure in RL, directly addressing the sample-efficiency bottleneck in long-horizon, multi-task settings. The empirical claims on maze domains, if substantiated by the ablations, would constitute a practical demonstration that curiosity-driven discovery can yield planning-friendly graphs.

major comments (2)
  1. [Abstract] Abstract (two-stage description): the central claim that the jointly trained latent pivotal-state model produces nodes usable for solving previously unseen tasks rests on an unverified assumption that curiosity-driven discovery aligns with task-relevant bottlenecks. No mechanism is stated that would prevent the nodes from being transient visitation artifacts, which would render the Manager/Worker transfer advantage void on held-out mazes.
  2. [Ablation study] Ablation study paragraph: the manuscript asserts 'significant advantages … in terms of performance and efficiency' yet supplies neither quantitative metrics (e.g., success rate deltas, sample-complexity ratios) nor a direct comparison isolating the contribution of the world-graph transfer versus the curiosity policy alone. Without these numbers the load-bearing claim that the graph accelerates new-task learning cannot be evaluated.
minor comments (1)
  1. [Abstract] Abstract: 'finds solution to new tasks' should read 'finds a solution to new tasks'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the mechanisms in our approach and indicating revisions where the presentation can be strengthened.

read point-by-point responses
  1. Referee: [Abstract] Abstract (two-stage description): the central claim that the jointly trained latent pivotal-state model produces nodes usable for solving previously unseen tasks rests on an unverified assumption that curiosity-driven discovery aligns with task-relevant bottlenecks. No mechanism is stated that would prevent the nodes from being transient visitation artifacts, which would render the Manager/Worker transfer advantage void on held-out mazes.

    Authors: The joint optimization of the latent pivotal-state model with the curiosity-driven goal-conditioned policy provides the mechanism: the model is trained to encode states that enable the policy to achieve diverse goals via intrinsic rewards, favoring states that serve as reliable exploration hubs rather than transient visitations. This task-agnostic process produces nodes that transfer to held-out tasks, as shown by the empirical results on unseen mazes. We will revise the abstract to explicitly articulate this joint-training mechanism. revision: yes

  2. Referee: [Ablation study] Ablation study paragraph: the manuscript asserts 'significant advantages … in terms of performance and efficiency' yet supplies neither quantitative metrics (e.g., success rate deltas, sample-complexity ratios) nor a direct comparison isolating the contribution of the world-graph transfer versus the curiosity policy alone. Without these numbers the load-bearing claim that the graph accelerates new-task learning cannot be evaluated.

    Authors: The full paper's ablation study reports quantitative results via success rates, learning curves, and efficiency comparisons against baselines. However, an explicit ablation isolating the world-graph transfer benefit from the curiosity policy alone is not present. We will add this comparison and include specific numerical deltas (e.g., success-rate improvements and sample-complexity ratios) in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical procedure is self-contained

full rationale

The paper presents a two-stage empirical method: joint task-agnostic training of a latent pivotal state model with a curiosity-driven goal-conditioned policy, followed by using the resulting graph for Manager/Worker hierarchical control on new tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that reduce the central claims to inputs by construction. The approach is validated via ablation on maze tasks rather than a closed derivation, so the derivation chain does not collapse and remains externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that environments contain learnable pivotal states whose graph abstraction transfers to new tasks; no free parameters or invented entities with independent evidence are specified in the abstract.

axioms (1)
  • domain assumption Environments contain identifiable pivotal states that can be discovered task-agnostically and assembled into a useful graph for subgoal planning.
    Invoked as the foundation for both training stages and the subsequent use of the world graph.
invented entities (1)
  • world graph no independent evidence
    purpose: Graph abstraction whose nodes are pivotal states and edges are feasible traversals, used to accelerate hierarchical task solving.
    Newly introduced abstraction whose utility is asserted but not independently verified in the abstract.

pith-pipeline@v0.9.0 · 5705 in / 1357 out tokens · 44742 ms · 2026-05-25T12:28:49.825223+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Graph World Models: Concepts, Taxonomy, and Future Directions

    cs.AI 2026-04 unverdicted novelty 7.0

    The paper unifies emerging graph-based world models under a new paradigm and proposes a taxonomy organized by spatial, physical, and logical relational inductive biases.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 1 Pith paper · 25 internal anchors

  1. [1]

    Abbeel and A

    P. Abbeel and A. Y . Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004

  2. [2]

    Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

    J. Achiam and S. Sastry. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732, 2017

  3. [3]

    Angeli, D

    A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. A fast and incremental method for loop- closure detection using bags of visual words. IEEE Transactions on Robotics, pages 1027–1037, 2008

  4. [4]

    M. G. Azar, B. Piot, B. A. Pires, J.-B. Gril, F. Altche, and R. Munos. World discovery model. arXiv, 2019

  5. [5]

    J. Ba, V . Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention.arXiv preprint arXiv:1412.7755, 2014

  6. [6]

    Bacon, J

    P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017

  7. [7]

    Barreto, W

    A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pages 4055–4065, 2017. 9

  8. [8]

    Bastings, W

    J. Bastings, W. Aziz, and I. Titov. Interpretable neural predictions with differentiable binary vari- ables. In Proceedings of the 2019 Conference of the Association for Computational Linguistics, Volume 1 (Long Papers). Association for Computational Linguistics, 2019

  9. [9]

    D. P. Bertsekas. Dynamic programming and optimal control, volume 1. 1995

  10. [10]

    D. P. Bertsekas. Nonlinear Programming. 1999

  11. [11]

    N. Biggs. Algebraic Graph Theory. 1993

  12. [12]

    D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017

  13. [13]

    D. M. Blei and P. J. Moreno. Topic segmentation with an aspect hidden markov model. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 343–348. ACM, 2001

  14. [14]

    Large-Scale Study of Curiosity-Driven Learning

    Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018

  15. [15]

    Bu¸ soniu, R

    L. Bu¸ soniu, R. Babuška, and B. De Schutter. Multi-agent reinforcement learning: An overview. In Innovations in multi-agent systems and applications-1, pages 183–221. Springer, 2010

  16. [16]

    W. Chan, Y . Zhang, Q. Le, and N. Jaitly. Latent sequence decompositions. arXiv preprint arXiv:1610.03035, 2016

  17. [17]

    Chatzigiorgaki and A

    M. Chatzigiorgaki and A. N. Skodras. Real-time keyframe extraction towards video content identification. In 2009 16th International conference on digital signal processing, pages 1–6. IEEE, 2009

  18. [18]

    Chevalier-Boisvert and L

    M. Chevalier-Boisvert and L. Willems. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018

  19. [19]

    Chung, K

    J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y . Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015

  20. [20]

    J. D. Co-Reyes, Y . Liu, A. Gupta, B. Eysenbach, P. Abbeel, and S. Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. arXiv preprint arXiv:1806.02813, 2018

  21. [21]

    Dayan and G

    P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993

  22. [22]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018

  23. [23]

    Donahue, Y

    J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014

  24. [24]

    Dwiel, M

    Z. Dwiel, M. Candadi, M. Phielipp, and A. Bansal. Hierarchical policy learning is sensitive to goal space design. arXiv preprint, (2), 2019

  25. [25]

    Go-explore: a new approach for hard-exploration problems

    A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019

  26. [27]

    Z. Feng, R. Dearden, N. Meuleau, and R. Washington. Dynamic programming for structured continuous markov decision problems. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 154–161. AUAI Press, 2004

  27. [28]

    C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135, 2017

  28. [29]

    D. Fox, W. Burgard, and S. Thrun. Active markov localization for mobile robots. Robotics and Autonomous Systems, 25(3-4):195–207, 1998

  29. [30]

    R. Fox, S. Krishnan, I. Stoica, and K. Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017. 10

  30. [31]

    Fritz, C

    G. Fritz, C. Seifert, L. Paletta, and H. Bischof. Attentive object detection using an informa- tion theoretic saliency measure. In International workshop on attention and performance in computational vision, pages 29–41. Springer, 2004

  31. [32]

    Learning Actionable Representations with Goal-Conditioned Policies

    D. Ghosh, A. Gupta, and S. Levine. Learning actionable representations with goal-conditioned policies. arXiv preprint arXiv:1811.07819, 2018

  32. [33]

    Temporal Difference Variational Auto-Encoder

    K. Gregor and F. Besse. Temporal difference variational auto-encoder. arXiv preprint arXiv:1806.03107, 2018

  33. [34]

    Gregor, I

    K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. In ICML, 2015

  34. [35]

    S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016

  35. [36]

    Z. D. Guo, M. G. Azar, B. Piot, B. A. Pires, T. Pohlen, and R. Munos. Neural predictive belief representations. arXiv preprint arXiv:1811.06407, 2018

  36. [37]

    World Models

    D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

  37. [38]

    Latent Space Policies for Hierarchical Reinforcement Learning

    T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018

  38. [39]

    Learning Latent Dynamics for Planning from Pixels

    D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018

  39. [40]

    Hausman, J

    K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018

  40. [41]

    Henderson, R

    P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  41. [42]

    Multi-task Deep Reinforcement Learning with PopArt

    M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt. Multi-task deep reinforcement learning with popart. arXiv preprint arXiv:1809.04474, 2018

  42. [43]

    Higgins, L

    I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Ler- chner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR), 2017

  43. [44]

    SCAN: Learning Hierarchical Compositional Visual Concepts

    I. Higgins, N. Sonnerat, L. Matthey, A. Pal, C. P. Burgess, M. Bosnjak, M. Shanahan, M. Botvinick, D. Hassabis, and A. Lerchner. Scan: Learning hierarchical compositional visual concepts. arXiv preprint arXiv:1707.03389, 2017

  44. [45]

    J. Hu, M. P. Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. Citeseer, 1998

  45. [46]

    Hussein, M

    A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):21, 2017

  46. [47]

    Time-Agnostic Prediction: Predicting Predictable Video Frames

    D. Jayaraman, F. Ebert, A. A. Efros, and S. Levine. Time-agnostic prediction: Predicting predictable video frames. arXiv preprint arXiv:1808.07784, 2018

  47. [48]

    Discovering Options for Exploration by Minimizing Cover Time

    Y . Jinnai, J. W. Park, D. Abel, and G. Konidaris. Discovering options for exploration by minimizing cover time. arXiv preprint arXiv:1903.00606, 2019

  48. [49]

    L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998

  49. [50]

    Model- based reinforcement learning for atari

    L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019

  50. [51]

    D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2013

  51. [52]

    T. Kipf, Y . Li, H. Dai, V . Zambaldi, E. Grefenstette, P. Kohli, and P. Battaglia. Compositional im- itation learning: Explaining and executing one task at a time. arXiv preprint arXiv:1812.01483, 2018

  52. [53]

    Kroemer, C

    O. Kroemer, C. Daniel, G. Neumann, H. Van Hoof, and J. Peters. Towards learning hierarchical skills for multi-phase manipulation tasks. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1503–1510. IEEE, 2015. 11

  53. [54]

    T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016

  54. [55]

    Q. V . Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015

  55. [56]

    Options Discovery with Budgeted Reinforcement Learning

    A. Léon and L. Denoyer. Options discovery with budgeted reinforcement learning. arXiv preprint arXiv:1611.06824, 2016

  56. [57]

    A. Levy, R. Platt, and K. Saenko. Hierarchical actor-critic. arXiv preprint arXiv:1712.00948, 2017

  57. [58]

    A. Q. Li, M. Xanthidis, J. M. O’Kane, and I. Rekleitis. Active localization with dynamic obstacles. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1902–1909. IEEE, 2016

  58. [59]

    M. L. Littman. Algorithms for sequential decision making. 1996

  59. [60]

    Learning Sparse Neural Networks through $L_0$ Regularization

    C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312, 2017

  60. [61]

    Lowry, N

    S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford. Visual place recognition: A survey. IEEE Transactions on Robotics, 32(1):1–19, 2015

  61. [62]

    C. J. Maddison, A. Mnih, and Y . W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016

  62. [63]

    Marthi and C

    B. Marthi and C. Guestrin. Concurrent hierarchical reinforcement learning. 2005

  63. [64]

    All you need is a good init

    D. Mishkin and J. Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015

  64. [65]

    V . Mnih, J. Agapiou, S. Osindero, A. Graves, O. Vinyals, K. Kavukcuoglu, et al. Strategic attentive writer for learning macro-actions. arXiv preprint arXiv:1606.04695, 2016

  65. [66]

    V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016

  66. [67]

    Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

    O. Nachum, S. Gu, H. Lee, and S. Levine. Near-optimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018

  67. [68]

    Nachum, S

    O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018

  68. [69]

    A. V . Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9191–9200, 2018

  69. [70]

    Stick-Breaking Variational Autoencoders

    E. Nalisnick and P. Smyth. Stick-breaking variational autoencoders. arXiv preprint arXiv:1605.06197, 2016

  70. [71]

    The scientific objectives of the mars exploration rover

    NASA. The scientific objectives of the mars exploration rover. 2015

  71. [72]

    Niekum and S

    S. Niekum and S. Chitta. Incremental semantically grounded learning from demonstration. 2013

  72. [73]

    Ostrovski, M

    G. Ostrovski, M. G. Bellemare, A. v. d. Oord, and R. Munos. Count-based exploration with neural density models. ICML, 2017

  73. [74]

    Y . P. Pane, S. P. Nageshrao, and R. Babuška. Actor-critic reinforcement learning for tracking control in robotics. In Decision and Control (CDC), 2016 IEEE 55th Conference on , pages 5819–5826. IEEE, 2016

  74. [75]

    Pathak, P

    D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017

  75. [76]

    Pertsch, O

    K. Pertsch, O. Rybkin, J. Yang, K. Derpanis, J. Lim, K. Daniilidis, and A. Jeable. Keyin: Discovering subgoal structure with keyframe-based video prediction. arXiv, 2019

  76. [77]

    Racanière, T

    S. Racanière, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y . Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in neural information processing systems, pages 5690–5701, 2017. 12

  77. [78]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

  78. [79]

    D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014

  79. [80]

    S. M. Ross. Introduction to stochastic dynamic programming. Academic press, 2014

  80. [81]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015

Showing first 80 references.