pith. machine review for the scientific record. sign in

arxiv: 2010.02193 · v4 · submitted 2020-10-05 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Mastering Atari with Discrete World Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords world modelsreinforcement learningAtaridiscrete latentsimaginationsample efficiencyhuman-level performancelatent space planning
0
0 comments X

The pith

DreamerV2 achieves human-level performance on Atari by learning behaviors inside a separately trained discrete world model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DreamerV2, a reinforcement learning agent that trains a world model using discrete latent representations from image inputs. Policies are then learned entirely by imagining outcomes inside this model rather than through direct environment interaction. A sympathetic reader cares because this separation allows for more sample-efficient learning in complex visual environments like Atari games, where the agent reaches human-level performance on 55 tasks. It also demonstrates applicability to continuous control tasks such as humanoid robot locomotion from pixels.

Core claim

DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model that uses discrete representations. With the same computational budget, it reaches 200M frames and surpasses the final performance of top single-GPU agents IQN and Rainbow. The approach is also shown to work on continuous-action tasks by learning an accurate world model of a complex humanoid robot from pixels.

What carries the argument

The discrete latent world model trained separately from the policy, which compresses images into categorical states and predicts future latents, rewards, and discounts to support multi-step imagination for policy optimization.

If this is right

  • The agent surpasses IQN and Rainbow on Atari while using the same compute and wall-clock time.
  • It solves stand-up and walking tasks for a humanoid robot using only pixel observations.
  • Behaviors can be optimized purely from model predictions, improving sample efficiency over direct interaction methods.
  • The same architecture applies without modification to both discrete and continuous action spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If prediction accuracy holds for longer horizons, the method could support planning in environments where real interaction is expensive or unsafe.
  • The separation of model learning from policy learning opens the possibility of reusing one world model across multiple tasks or agents.
  • Extending the discrete representation to handle partial observability or stochastic dynamics would test whether the approach scales beyond current benchmarks.

Load-bearing premise

The learned discrete world model must remain accurate enough over multi-step imagined trajectories that compounding prediction errors do not invalidate the returns used for policy learning.

What would settle it

Deploying the learned policy in the actual Atari environment and measuring whether its scores fall substantially below those estimated from the imagined trajectories would falsify the central claim.

read the original abstract

Intelligent agents need to generalize from past experience to achieve goals in complex environments. World models facilitate such generalization and allow learning behaviors from imagined outcomes to increase sample-efficiency. While learning world models from image inputs has recently become feasible for some tasks, modeling Atari games accurately enough to derive successful behaviors has remained an open challenge for many years. We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model. The world model uses discrete representations and is trained separately from the policy. DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model. With the same computational budget and wall-clock time, Dreamer V2 reaches 200M frames and surpasses the final performance of the top single-GPU agents IQN and Rainbow. DreamerV2 is also applicable to tasks with continuous actions, where it learns an accurate world model of a complex humanoid robot and solves stand-up and walking from only pixel inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DreamerV2, an RL agent that trains a discrete latent world model (RSSM with categorical latents) separately from the policy and optimizes actor-critic objectives entirely inside imagined trajectories in that latent space. It claims to be the first agent to reach human-level performance on the 55-game Atari benchmark by learning behaviors inside a separately trained world model, surpassing the final scores of Rainbow and IQN under matched single-GPU compute while also demonstrating the approach on continuous-control humanoid tasks from pixels.

Significance. If the result holds, this is a significant advance for model-based RL: it shows that compact discrete world models can scale to the full Atari suite and support policy optimization without direct environment interaction during behavior learning. Credit is due for the ablation studies that isolate the benefit of discrete latents, the learning curves reported across all 55 games with multiple seeds, and the direct comparison to independently published baseline numbers.

major comments (1)
  1. [Section 4] Section 4 (Experiments) and Appendix B: no direct measurement of multi-step prediction error (reward or state MSE on held-out rollouts) is reported over the imagination horizon used for actor-critic optimization. While final Atari scores are strong, this leaves the central assumption that imagined returns remain sufficiently correlated with true returns untested by an explicit diagnostic.
minor comments (2)
  1. [Appendix] Hyperparameter table: exact values for the number of discrete classes, KL coefficient, and imagination horizon length are listed as free parameters but not reported in a single consolidated table, which would aid reproducibility.
  2. [Figure 3] Figure 3 caption: clarify whether the human-normalized scores use the same clipping and normalization constants as the Rainbow and IQN papers for the direct comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and for highlighting the significance of the results. We address the single major comment below.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Experiments) and Appendix B: no direct measurement of multi-step prediction error (reward or state MSE on held-out rollouts) is reported over the imagination horizon used for actor-critic optimization. While final Atari scores are strong, this leaves the central assumption that imagined returns remain sufficiently correlated with true returns untested by an explicit diagnostic.

    Authors: We agree that an explicit multi-step diagnostic would strengthen the presentation. In the revised manuscript we will add, in Appendix B, plots of per-step reward and latent-state MSE computed on held-out rollouts over the exact 15-step imagination horizon used by the actor-critic. These curves will be averaged across the 55 Atari games (and separately for the humanoid tasks) and will be accompanied by a short discussion of how the observed error growth relates to the final policy performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on direct environment evaluation

full rationale

The paper trains a discrete world model on trajectories collected from the real Atari environments, then optimizes actor-critic policies inside imagined rollouts in the latent space, and finally deploys the resulting policy in the actual environments to obtain the reported scores. These scores are compared against independently published baselines (IQN, Rainbow) and are not obtained by re-using any fitted parameter as a prediction. No equation or section defines a target metric in terms of itself, renames a known result, or imports a uniqueness theorem from the authors' prior work to force the architecture. The multi-step imagination accuracy is an empirical assumption whose validity is tested by the final real-environment returns rather than assumed by construction.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-RL assumptions plus several hyperparameters whose values were selected to maximize reported scores.

free parameters (3)
  • number of discrete classes per latent variable
    Chosen as a hyperparameter; controls the granularity of the world-model representation.
  • KL loss coefficient
    Tuned to balance reconstruction and regularization in the world-model objective.
  • imagination horizon length
    Selected to trade off planning depth against compounding model error.
axioms (2)
  • domain assumption The true environment dynamics admit a compact discrete latent representation that remains predictive over dozens of steps.
    Invoked throughout the world-model training and imagination sections.
  • domain assumption Gradient-based optimization of the actor-critic objective in imagined trajectories yields policies that transfer to the real environment.
    Central to the behavior-learning loop.

pith-pipeline@v0.9.0 · 5487 in / 1389 out tokens · 34447 ms · 2026-05-15T01:21:23.129242+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

    cs.CL 2026-05 unverdicted novelty 8.0

    Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

  2. JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

  3. Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

    cs.LG 2026-05 unverdicted novelty 7.0

    Clin-JEPA supplies a multi-phase co-training method for JEPA pretraining on EHR trajectories that achieves converging latent rollouts and improved multi-task AUROC on MIMIC-IV data.

  4. SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

  5. Advantage-Guided Diffusion for Model-Based Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

  6. Training Agents Inside of Scalable World Models

    cs.AI 2025-09 conditional novelty 7.0

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  7. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  8. Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

    cs.LG 2026-05 unverdicted novelty 6.0

    A five-phase co-training framework enables stable JEPA pretraining on EHR trajectories, producing converging latent rollouts and higher multi-task AUROC than baselines on MIMIC-IV ICU data.

  9. Predictive but Not Plannable: RC-aux for Latent World Models

    cs.LG 2026-05 unverdicted novelty 6.0

    RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

  10. Learning to Theorize the World from Observation

    cs.LG 2026-05 unverdicted novelty 6.0

    NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

  11. TRAP: Tail-aware Ranking Attack for World-Model Planning

    cs.LG 2026-05 unverdicted novelty 6.0

    TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on c...

  12. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  13. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  14. Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.

  15. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

    cs.LG 2026-03 unverdicted novelty 6.0

    LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.

  16. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  17. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  18. Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    cs.RO 2023-12 conditional novelty 6.0

    A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.

  19. Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

    cs.LG 2026-05 unverdicted novelty 5.0

    A delay-aware RL approach learns transferable structured representations and dynamics via implicit causal graphs, outperforming baselines on delayed DMC tasks and accelerating adaptation to new tasks.

  20. Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

    cs.RO 2026-04 unverdicted novelty 5.0

    Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...

  21. The Cartesian Cut in Agentic AI

    cs.AI 2026-04 unverdicted novelty 5.0

    LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.

  22. CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics

    cs.LG 2026-04 unverdicted novelty 5.0

    CausalVAE plug-in for world models preserves factual prediction and boosts counterfactual retrieval, with large gains on physics benchmarks and recovered physical interaction trends.

  23. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  24. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 22 Pith papers · 40 internal anchors

  1. [1]

    Stochastic Variational Video Prediction

    M Babaeizadeh, C Finn, D Erhan, RH Campbell, S Levine. Stochastic Variational Video Prediction. ArXiv Preprint ArXiv:1710.11252,

  2. [2]

    Agent57: Outperforming the Atari Human Benchmark

    AP Badia, B Piot, S Kapturowski, P Sprechmann, A Vitvitskyi, D Guo, C Blundell. Agent57: Outperforming the Atari Human Benchmark. ArXiv Preprint ArXiv:2003.13350,

  3. [3]

    A Distributional Perspective on Reinforcement Learning

    MG Bellemare, W Dabney, R Munos. A Distributional Perspective on Reinforcement Learning. ArXiv Preprint ArXiv:1707.06887,

  4. [4]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Y Bengio, N Léonard, A Courville. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. ArXiv Preprint ArXiv:1308.3432,

  5. [5]

    Learning and Querying Fast Generative Models for Reinforcement Learning

    L Buesing, T Weber, S Racaniere, S Eslami, D Rezende, DP Reichert, F Viola, F Besse, K Gregor, D Hassabis, et al. Learning and Querying Fast Generative Models for Reinforcement Learning. ArXiv Preprint ArXiv:1802.03006,

  6. [6]

    Imagined Value Gradients: Model-Based Policy Optimization With Transferable Latent Dynamics Models

    A Byravan, JT Springenberg, A Abdolmaleki, R Hafner, M Neunert, T Lampe, N Siegel, N Heess, M Riedmiller. Imagined Value Gradients: Model-Based Policy Optimization With Transferable Latent Dynamics Models. ArXiv Preprint ArXiv:1910.04142,

  7. [7]

    Dopamine: A Research Framework for Deep Reinforcement Learning

    PS Castro, S Moitra, C Gelada, S Kumar, MG Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. ArXiv Preprint ArXiv:1812.06110,

  8. [8]

    A Simple Framework for Contrastive Learning of Visual Representations

    T Chen, S Kornblith, M Norouzi, G Hinton. A Simple Framework for Contrastive Learning of Visual Representations. ArXiv Preprint ArXiv:2002.05709,

  9. [9]

    Recurrent Environment Simulators

    S Chiappa, S Racaniere, D Wierstra, S Mohamed. Recurrent Environment Simulators. ArXiv Preprint ArXiv:1704.02254,

  10. [10]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    K Cho, B Van Merriënboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation. ArXiv Preprint ArXiv:1406.1078,

  11. [11]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    DA Clevert, T Unterthiner, S Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (Elus). ArXiv Preprint ArXiv:1511.07289,

  12. [12]

    Distributional Reinforcement Learning with Quantile Regression

    W Dabney, M Rowland, MG Bellemare, R Munos. Distributional Reinforcement Learning With Quantile Regression. ArXiv Preprint ArXiv:1710.10044,

  13. [13]

    Implicit Quantile Networks for Distributional Reinforcement Learning

    W Dabney, G Ostrovski, D Silver, R Munos. Implicit Quantile Networks for Distributional Reinforcement Learning. ArXiv Preprint ArXiv:1806.06923,

  14. [14]

    Stochastic Video Generation With a Learned Prior

    E Denton R Fergus. Stochastic Video Generation With a Learned Prior. ArXiv Preprint ArXiv:1802.07687,

  15. [15]

    Probabilistic Recurrent State-Space Models

    A Doerr, C Daniel, M Schiegg, D Nguyen-Tuong, S Schaal, M Toussaint, S Trimpe. Probabilistic Recurrent State-Space Models. ArXiv Preprint ArXiv:1801.10395,

  16. [16]

    Self-Supervised Visual Planning with Temporal Skip Connections

    12 Published as a conference paper at ICLR 2021 F Ebert, C Finn, AX Lee, S Levine. Self-Supervised Visual Planning With Temporal Skip Connections. ArXiv Preprint ArXiv:1710.05268,

  17. [17]

    Noisy Networks for Exploration

    M Fortunato, MG Azar, B Piot, J Menick, I Osband, A Graves, V Mnih, R Munos, D Hassabis, O Pietquin, et al. Noisy Networks for Exploration. ArXiv Preprint ArXiv:1706.10295,

  18. [18]

    Generative Temporal Models with Memory

    M Gemici, CC Hung, A Santoro, G Wayne, S Mohamed, DJ Rezende, D Amos, T Lillicrap. Generative Temporal Models With Memory. ArXiv Preprint ArXiv:1702.04649,

  19. [19]

    Temporal Difference Variational Auto-Encoder

    K Gregor F Besse. Temporal Difference Variational Auto-Encoder.ArXiv Preprint ArXiv:1806.03107,

  20. [20]

    Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

    JB Grill, F Strub, F Altché, C Tallec, PH Richemond, E Buchatskaya, C Doersch, BA Pires, ZD Guo, MG Azar, et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. ArXiv Preprint ArXiv:2006.07733,

  21. [21]

    The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

    A Gruslys, W Dabney, MG Azar, B Piot, M Bellemare, R Munos. The Reactor: A Fast and Sample- Efficient Actor-Critic Agent for Reinforcement Learning. ArXiv Preprint ArXiv:1704.04651 ,

  22. [22]

    World Models

    D Ha J Schmidhuber. World Models. ArXiv Preprint ArXiv:1803.10122,

  23. [23]

    Learning Latent Dynamics for Planning from Pixels

    D Hafner, T Lillicrap, I Fischer, R Villegas, D Ha, H Lee, J Davidson. Learning Latent Dynamics for Planning From Pixels. ArXiv Preprint ArXiv:1811.04551,

  24. [24]

    Dream to Control: Learning Behaviors by Latent Imagination

    D Hafner, T Lillicrap, J Ba, M Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. ArXiv Preprint ArXiv:1912.01603,

  25. [25]

    Model-Based Planning with Discrete and Continuous Actions

    M Henaff, WF Whitney, Y LeCun. Model-Based Planning With Discrete and Continuous Actions. ArXiv Preprint ArXiv:1705.07177,

  26. [26]

    Synthesizing Neural Network Controllers with Probabilistic Model based Reinforcement Learning

    JCG Higuera, D Meger, G Dudek. Synthesizing Neural Network Controllers With Probabilistic Model Based Reinforcement Learning. ArXiv Preprint ArXiv:1803.02291,

  27. [27]

    Distributed Prioritized Experience Replay

    D Horgan, J Quan, D Budden, G Barth-Maron, M Hessel, H Van Hasselt, D Silver. Distributed Prioritized Experience Replay. ArXiv Preprint ArXiv:1803.00933,

  28. [28]

    Deep Variational Reinforcement Learning for POMDPs

    M Igl, L Zintgraf, TA Le, F Wood, S Whiteson. Deep Variational Reinforcement Learning for Pomdps. ArXiv Preprint ArXiv:1806.02426,

  29. [29]

    Model-Based Reinforcement Learning for Atari

    L Kaiser, M Babaeizadeh, P Milos, B Osinski, RH Campbell, K Czechowski, D Erhan, C Finn, P Kozakowski, S Levine, et al. Model-Based Reinforcement Learning for Atari. ArXiv Preprint ArXiv:1903.00374,

  30. [30]

    Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data

    13 Published as a conference paper at ICLR 2021 M Karl, M Soelch, J Bayer, P van der Smagt. Deep Variational Bayes Filters: Unsupervised Learning of State Space Models From Raw Data. ArXiv Preprint ArXiv:1605.06432,

  31. [31]

    Adam: A Method for Stochastic Optimization

    DP Kingma J Ba. Adam: A Method for Stochastic Optimization. ArXiv Preprint ArXiv:1412.6980,

  32. [32]

    Auto-Encoding Variational Bayes

    DP Kingma M Welling. Auto-Encoding Variational Bayes. ArXiv Preprint ArXiv:1312.6114,

  33. [33]

    Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning From Pixels

    I Kostrikov, D Yarats, R Fergus. Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning From Pixels. ArXiv Preprint ArXiv:2004.13649,

  34. [34]

    Deep Kalman Filters

    RG Krishnan, U Shalit, D Sontag. Deep Kalman Filters. ArXiv Preprint ArXiv:1511.05121,

  35. [35]

    Model-Ensemble Trust-Region Policy Optimization

    T Kurutach, I Clavera, Y Duan, A Tamar, P Abbeel. Model-Ensemble Trust-Region Policy Optimization. ArXiv Preprint ArXiv:1802.10592,

  36. [36]

    Stochastic Latent Actor-Critic: Deep Reinforcement Learning With a Latent Variable Model

    AX Lee, A Nagabandi, P Abbeel, S Levine. Stochastic Latent Actor-Critic: Deep Reinforcement Learning With a Latent Variable Model. ArXiv Preprint ArXiv:1907.00953,

  37. [37]

    Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

    A Nagabandi, G Kahn, RS Fearing, S Levine. Neural Network Dynamics for Model-Based Deep Reinforcement Learning With Model-Free Fine-Tuning. ArXiv Preprint ArXiv:1708.02596,

  38. [38]

    PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

    P Parmas, CE Rasmussen, J Peters, K Doya. Pipps: Flexible Model-Based Policy Search Robust to the Curse of Chaos. ArXiv Preprint ArXiv:1902.01240,

  39. [39]

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models

    DJ Rezende, S Mohamed, D Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ArXiv Preprint ArXiv:1401.4082,

  40. [40]

    Prioritized Experience Replay

    T Schaul, J Quan, I Antonoglou, D Silver. Prioritized Experience Replay. ArXiv Preprint ArXiv:1511.05952,

  41. [41]

    Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019

    J Schrittwieser, I Antonoglou, T Hubert, K Simonyan, L Sifre, S Schmitt, A Guez, E Lockhart, D Hassabis, T Graepel, et al. Mastering Atari, Go, Chess and Shogi by Planning With a Learned Model. ArXiv Preprint ArXiv:1911.08265,

  42. [42]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J Schulman, P Moritz, S Levine, M Jordan, P Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. ArXiv Preprint ArXiv:1506.02438,

  43. [43]

    Proximal Policy Optimization Algorithms

    14 Published as a conference paper at ICLR 2021 J Schulman, F Wolski, P Dhariwal, A Radford, O Klimov. Proximal Policy Optimization Algorithms. ArXiv Preprint ArXiv:1707.06347, 2017a. J Schulman, F Wolski, P Dhariwal, A Radford, O Klimov. Proximal Policy Optimization Algorithms. ArXiv Preprint ArXiv:1707.06347, 2017b. R Sekar, O Rybkin, K Daniilidis, P Ab...

  44. [44]

    Universal Planning Networks

    A Srinivas, A Jabri, P Abbeel, S Levine, C Finn. Universal Planning Networks. ArXiv Preprint ArXiv:1804.00645,

  45. [45]

    Curl: Contrastive Unsupervised Representations for Reinforcement Learning

    A Srinivas, M Laskin, P Abbeel. Curl: Contrastive Unsupervised Representations for Reinforcement Learning. ArXiv Preprint ArXiv:2004.04136,

  46. [46]

    Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the Playing Field

    M Toromanoff, E Wirbel, F Moutarde. Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the Playing Field. ArXiv Preprint ArXiv:1908.04683,

  47. [47]

    Deep Reinforcement Learning with Double Q-learning

    H Van Hasselt, A Guez, D Silver. Deep Reinforcement Learning With Double Q-Learning. ArXiv Preprint ArXiv:1509.06461,

  48. [48]

    Exploring Model-based Planning with Policy Networks

    T Wang J Ba. Exploring Model-Based Planning With Policy Networks. ArXiv Preprint ArXiv:1906.08649,

  49. [49]

    Benchmarking Model-Based Reinforcement Learning

    T Wang, X Bao, I Clavera, J Hoang, Y Wen, E Langlois, S Zhang, G Zhang, P Abbeel, J Ba. Benchmarking Model-Based Reinforcement Learning. CoRR, abs/1907.02057,

  50. [50]

    Unsupervised Predictive Memory in a Goal-Directed Agent

    G Wayne, CC Hung, D Amos, M Mirza, A Ahuja, A Grabska-Barwinska, J Rae, P Mirowski, JZ Leibo, A Santoro, et al. Unsupervised Predictive Memory in a Goal-Directed Agent. ArXiv Preprint ArXiv:1803.10760,

  51. [51]

    Imagination-Augmented Agents for Deep Reinforcement Learning

    T Weber, S Racanière, DP Reichert, L Buesing, A Guez, DJ Rezende, AP Badia, O Vinyals, N Heess, Y Li, et al. Imagination-Augmented Agents for Deep Reinforcement Learning. ArXiv Preprint ArXiv:1707.06203,

  52. [52]

    Mopo: Model-Based Offline Policy Optimization

    T Yu, G Thomas, L Yu, S Ermon, J Zou, S Levine, C Finn, T Ma. Mopo: Model-Based Offline Policy Optimization. ArXiv Preprint ArXiv:2005.13239,

  53. [53]

    The task is provided by the DeepMind Control Suite and uses a continuous action space with 21 dimensions

    15 Published as a conference paper at ICLR 2021 A H UMANOID FROM PIXELS Figure A.1: Behavior learned by DreamerV2 on the Humanoid Walk task from pixel inputs only. The task is provided by the DeepMind Control Suite and uses a continuous action space with 21 dimensions. The frames show the agent inputs. 0 1 2 3 4 1e7 0 200 400 600 800 Humanoid Walk Dreamer...

  54. [54]

    We find that DreamerV2 reliably solves both the stand-up motion required at the beginning of the episode and the subsequent walking

    We also set η = 10−5 and β = 2 to further accelerate learning. We find that DreamerV2 reliably solves both the stand-up motion required at the beginning of the episode and the subsequent walking. The score is shown in Figure A.2. To the best of our knowledge, this constitutes the first published re- sult of solving the humanoid environment from only pixel i...

  55. [55]

    that was applied on top of Rainbow by Taiga et al. (2019). This suggests that the world model may help with solving sparse reward tasks, for example due to improved generalization, efficient policy opti- mization in the compact latent space enabling more actor critic updates, or because the reward predictor generalizes and thus smooths out the sparse rewar...

  56. [56]

    • Layer norm Using layer normalization in the GRU that is used as part of the RSSM latent transition model, instead of no normalization, provided no or marginal benefits

    provided marginal or no benefits. • Layer norm Using layer normalization in the GRU that is used as part of the RSSM latent transition model, instead of no normalization, provided no or marginal benefits. Due to the large computational requirements, a comprehensive ablation study on this list of all changes is unfortunately infeasible for us. This would req...