arxiv: 2010.02193 · v4 · submitted 2020-10-05 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Mastering Atari with Discrete World Models

Danijar Hafner , Timothy Lillicrap , Mohammad Norouzi , Jimmy Ba

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords world modelsreinforcement learningAtaridiscrete latentsimaginationsample efficiencyhuman-level performancelatent space planning

0 comments

The pith

DreamerV2 achieves human-level performance on Atari by learning behaviors inside a separately trained discrete world model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DreamerV2, a reinforcement learning agent that trains a world model using discrete latent representations from image inputs. Policies are then learned entirely by imagining outcomes inside this model rather than through direct environment interaction. A sympathetic reader cares because this separation allows for more sample-efficient learning in complex visual environments like Atari games, where the agent reaches human-level performance on 55 tasks. It also demonstrates applicability to continuous control tasks such as humanoid robot locomotion from pixels.

Core claim

DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model that uses discrete representations. With the same computational budget, it reaches 200M frames and surpasses the final performance of top single-GPU agents IQN and Rainbow. The approach is also shown to work on continuous-action tasks by learning an accurate world model of a complex humanoid robot from pixels.

What carries the argument

The discrete latent world model trained separately from the policy, which compresses images into categorical states and predicts future latents, rewards, and discounts to support multi-step imagination for policy optimization.

If this is right

The agent surpasses IQN and Rainbow on Atari while using the same compute and wall-clock time.
It solves stand-up and walking tasks for a humanoid robot using only pixel observations.
Behaviors can be optimized purely from model predictions, improving sample efficiency over direct interaction methods.
The same architecture applies without modification to both discrete and continuous action spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If prediction accuracy holds for longer horizons, the method could support planning in environments where real interaction is expensive or unsafe.
The separation of model learning from policy learning opens the possibility of reusing one world model across multiple tasks or agents.
Extending the discrete representation to handle partial observability or stochastic dynamics would test whether the approach scales beyond current benchmarks.

Load-bearing premise

The learned discrete world model must remain accurate enough over multi-step imagined trajectories that compounding prediction errors do not invalidate the returns used for policy learning.

What would settle it

Deploying the learned policy in the actual Atari environment and measuring whether its scores fall substantially below those estimated from the imagined trajectories would falsify the central claim.

read the original abstract

Intelligent agents need to generalize from past experience to achieve goals in complex environments. World models facilitate such generalization and allow learning behaviors from imagined outcomes to increase sample-efficiency. While learning world models from image inputs has recently become feasible for some tasks, modeling Atari games accurately enough to derive successful behaviors has remained an open challenge for many years. We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model. The world model uses discrete representations and is trained separately from the policy. DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model. With the same computational budget and wall-clock time, Dreamer V2 reaches 200M frames and surpasses the final performance of the top single-GPU agents IQN and Rainbow. DreamerV2 is also applicable to tasks with continuous actions, where it learns an accurate world model of a complex humanoid robot and solves stand-up and walking from only pixel inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DreamerV2 reaches human-level Atari scores by training policies entirely inside a separately learned discrete world model.

read the letter

The main takeaway is that DreamerV2 hits human-level performance on the full set of 55 Atari games by optimizing its policy solely from imagined trajectories in a compact discrete latent space. The world model is trained first and then frozen, so all behavior learning happens without additional real-environment interaction during that phase. This is a noticeable step past earlier model-based attempts that could not sustain accuracy on these tasks at scale. The switch to discrete representations inside the RSSM appears to be the key change that limits error buildup enough for the imagined returns to remain useful for actor-critic updates. The experiments report learning curves across all games with multiple seeds and compare final scores directly against published Rainbow and IQN numbers under matched wall-clock and compute budgets. Those controls make the gains look attributable to the discrete world-model approach rather than extra training resources. The paper also shows the same architecture working on a continuous-control humanoid task from pixels, which broadens the claim. The soft spot is the lack of direct evidence on long-horizon model accuracy. There are no held-out multi-step prediction error curves for rewards or states, and no ablation that varies imagination horizon length while holding other factors fixed. The strong final scores are consistent with the model staying reliable, but they do not isolate how much the discreteness itself contributes versus the RSSM backbone or reward predictor. Hyperparameter details are mostly present, though a couple of coefficients could be stated more explicitly. This work is aimed at people working on sample-efficient model-based RL. Anyone interested in scaling world models to pixel observations or reducing real-data needs will find the Atari results and architecture choices useful. The evidence is grounded enough on its own terms that the paper deserves a serious referee rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The paper introduces DreamerV2, an RL agent that trains a discrete latent world model (RSSM with categorical latents) separately from the policy and optimizes actor-critic objectives entirely inside imagined trajectories in that latent space. It claims to be the first agent to reach human-level performance on the 55-game Atari benchmark by learning behaviors inside a separately trained world model, surpassing the final scores of Rainbow and IQN under matched single-GPU compute while also demonstrating the approach on continuous-control humanoid tasks from pixels.

Significance. If the result holds, this is a significant advance for model-based RL: it shows that compact discrete world models can scale to the full Atari suite and support policy optimization without direct environment interaction during behavior learning. Credit is due for the ablation studies that isolate the benefit of discrete latents, the learning curves reported across all 55 games with multiple seeds, and the direct comparison to independently published baseline numbers.

major comments (1)

[Section 4] Section 4 (Experiments) and Appendix B: no direct measurement of multi-step prediction error (reward or state MSE on held-out rollouts) is reported over the imagination horizon used for actor-critic optimization. While final Atari scores are strong, this leaves the central assumption that imagined returns remain sufficiently correlated with true returns untested by an explicit diagnostic.

minor comments (2)

[Appendix] Hyperparameter table: exact values for the number of discrete classes, KL coefficient, and imagination horizon length are listed as free parameters but not reported in a single consolidated table, which would aid reproducibility.
[Figure 3] Figure 3 caption: clarify whether the human-normalized scores use the same clipping and normalization constants as the Rainbow and IQN papers for the direct comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and for highlighting the significance of the results. We address the single major comment below.

read point-by-point responses

Referee: [Section 4] Section 4 (Experiments) and Appendix B: no direct measurement of multi-step prediction error (reward or state MSE on held-out rollouts) is reported over the imagination horizon used for actor-critic optimization. While final Atari scores are strong, this leaves the central assumption that imagined returns remain sufficiently correlated with true returns untested by an explicit diagnostic.

Authors: We agree that an explicit multi-step diagnostic would strengthen the presentation. In the revised manuscript we will add, in Appendix B, plots of per-step reward and latent-state MSE computed on held-out rollouts over the exact 15-step imagination horizon used by the actor-critic. These curves will be averaged across the 55 Atari games (and separately for the humanoid tasks) and will be accompanied by a short discussion of how the observed error growth relates to the final policy performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on direct environment evaluation

full rationale

The paper trains a discrete world model on trajectories collected from the real Atari environments, then optimizes actor-critic policies inside imagined rollouts in the latent space, and finally deploys the resulting policy in the actual environments to obtain the reported scores. These scores are compared against independently published baselines (IQN, Rainbow) and are not obtained by re-using any fitted parameter as a prediction. No equation or section defines a target metric in terms of itself, renames a known result, or imports a uniqueness theorem from the authors' prior work to force the architecture. The multi-step imagination accuracy is an empirical assumption whose validity is tested by the final real-environment returns rather than assumed by construction.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-RL assumptions plus several hyperparameters whose values were selected to maximize reported scores.

free parameters (3)

number of discrete classes per latent variable
Chosen as a hyperparameter; controls the granularity of the world-model representation.
KL loss coefficient
Tuned to balance reconstruction and regularization in the world-model objective.
imagination horizon length
Selected to trade off planning depth against compounding model error.

axioms (2)

domain assumption The true environment dynamics admit a compact discrete latent representation that remains predictive over dozens of steps.
Invoked throughout the world-model training and imagination sections.
domain assumption Gradient-based optimization of the actor-critic objective in imagined trajectories yields policies that transfer to the real environment.
Central to the behavior-learning loop.

pith-pipeline@v0.9.0 · 5487 in / 1389 out tokens · 34447 ms · 2026-05-15T01:21:23.129242+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat inductive structure unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The world model uses discrete representations... vector of several categorical variables... sparse binary vector of length 1024 with 32 active bits.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KL balancing... imagination horizon H=15

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
cs.LG 2026-05 unverdicted novelty 7.0

Clin-JEPA supplies a multi-phase co-training method for JEPA pretraining on EHR trajectories that achieves converging latent rollouts and improved multi-task AUROC on MIMIC-IV data.
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
cs.CV 2026-04 unverdicted novelty 7.0

SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

A five-phase co-training framework enables stable JEPA pretraining on EHR trajectories, producing converging latent rollouts and higher multi-task AUROC than baselines on MIMIC-IV ICU data.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
TRAP: Tail-aware Ranking Attack for World-Model Planning
cs.LG 2026-05 unverdicted novelty 6.0

TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on c...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models
cs.LG 2026-04 unverdicted novelty 6.0

Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
cs.LG 2026-03 unverdicted novelty 6.0

LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
cs.AI 2026-01 conditional novelty 6.0

Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
cs.RO 2023-12 conditional novelty 6.0

A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling
cs.LG 2026-05 unverdicted novelty 5.0

A delay-aware RL approach learns transferable structured representations and dynamics via implicit causal graphs, outperforming baselines on delayed DMC tasks and accelerating adaptation to new tasks.
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
cs.RO 2026-04 unverdicted novelty 5.0

Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...
The Cartesian Cut in Agentic AI
cs.AI 2026-04 unverdicted novelty 5.0

LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
cs.LG 2026-04 unverdicted novelty 5.0

CausalVAE plug-in for world models preserves factual prediction and boosts counterfactual retrieval, with large gains on physics benchmarks and recovered physical interaction trends.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 22 Pith papers · 40 internal anchors

[1]

Stochastic Variational Video Prediction

M Babaeizadeh, C Finn, D Erhan, RH Campbell, S Levine. Stochastic Variational Video Prediction. ArXiv Preprint ArXiv:1710.11252,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Agent57: Outperforming the Atari Human Benchmark

AP Badia, B Piot, S Kapturowski, P Sprechmann, A Vitvitskyi, D Guo, C Blundell. Agent57: Outperforming the Atari Human Benchmark. ArXiv Preprint ArXiv:2003.13350,

work page arXiv 2003
[3]

A Distributional Perspective on Reinforcement Learning

MG Bellemare, W Dabney, R Munos. A Distributional Perspective on Reinforcement Learning. ArXiv Preprint ArXiv:1707.06887,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y Bengio, N Léonard, A Courville. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. ArXiv Preprint ArXiv:1308.3432,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Learning and Querying Fast Generative Models for Reinforcement Learning

L Buesing, T Weber, S Racaniere, S Eslami, D Rezende, DP Reichert, F Viola, F Besse, K Gregor, D Hassabis, et al. Learning and Querying Fast Generative Models for Reinforcement Learning. ArXiv Preprint ArXiv:1802.03006,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Imagined Value Gradients: Model-Based Policy Optimization With Transferable Latent Dynamics Models

A Byravan, JT Springenberg, A Abdolmaleki, R Hafner, M Neunert, T Lampe, N Siegel, N Heess, M Riedmiller. Imagined Value Gradients: Model-Based Policy Optimization With Transferable Latent Dynamics Models. ArXiv Preprint ArXiv:1910.04142,

work page arXiv 1910
[7]

Dopamine: A Research Framework for Deep Reinforcement Learning

PS Castro, S Moitra, C Gelada, S Kumar, MG Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. ArXiv Preprint ArXiv:1812.06110,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

A Simple Framework for Contrastive Learning of Visual Representations

T Chen, S Kornblith, M Norouzi, G Hinton. A Simple Framework for Contrastive Learning of Visual Representations. ArXiv Preprint ArXiv:2002.05709,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[9]

Recurrent Environment Simulators

S Chiappa, S Racaniere, D Wierstra, S Mohamed. Recurrent Environment Simulators. ArXiv Preprint ArXiv:1704.02254,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K Cho, B Van Merriënboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation. ArXiv Preprint ArXiv:1406.1078,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

DA Clevert, T Unterthiner, S Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (Elus). ArXiv Preprint ArXiv:1511.07289,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Distributional Reinforcement Learning with Quantile Regression

W Dabney, M Rowland, MG Bellemare, R Munos. Distributional Reinforcement Learning With Quantile Regression. ArXiv Preprint ArXiv:1710.10044,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Implicit Quantile Networks for Distributional Reinforcement Learning

W Dabney, G Ostrovski, D Silver, R Munos. Implicit Quantile Networks for Distributional Reinforcement Learning. ArXiv Preprint ArXiv:1806.06923,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Stochastic Video Generation With a Learned Prior

E Denton R Fergus. Stochastic Video Generation With a Learned Prior. ArXiv Preprint ArXiv:1802.07687,

work page arXiv
[15]

Probabilistic Recurrent State-Space Models

A Doerr, C Daniel, M Schiegg, D Nguyen-Tuong, S Schaal, M Toussaint, S Trimpe. Probabilistic Recurrent State-Space Models. ArXiv Preprint ArXiv:1801.10395,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Self-Supervised Visual Planning with Temporal Skip Connections

12 Published as a conference paper at ICLR 2021 F Ebert, C Finn, AX Lee, S Levine. Self-Supervised Visual Planning With Temporal Skip Connections. ArXiv Preprint ArXiv:1710.05268,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Noisy Networks for Exploration

M Fortunato, MG Azar, B Piot, J Menick, I Osband, A Graves, V Mnih, R Munos, D Hassabis, O Pietquin, et al. Noisy Networks for Exploration. ArXiv Preprint ArXiv:1706.10295,

work page arXiv
[18]

Generative Temporal Models with Memory

M Gemici, CC Hung, A Santoro, G Wayne, S Mohamed, DJ Rezende, D Amos, T Lillicrap. Generative Temporal Models With Memory. ArXiv Preprint ArXiv:1702.04649,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Temporal Difference Variational Auto-Encoder

K Gregor F Besse. Temporal Difference Variational Auto-Encoder.ArXiv Preprint ArXiv:1806.03107,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

JB Grill, F Strub, F Altché, C Tallec, PH Richemond, E Buchatskaya, C Doersch, BA Pires, ZD Guo, MG Azar, et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. ArXiv Preprint ArXiv:2006.07733,

work page arXiv 2006
[21]

The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

A Gruslys, W Dabney, MG Azar, B Piot, M Bellemare, R Munos. The Reactor: A Fast and Sample- Efﬁcient Actor-Critic Agent for Reinforcement Learning. ArXiv Preprint ArXiv:1704.04651 ,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

World Models

D Ha J Schmidhuber. World Models. ArXiv Preprint ArXiv:1803.10122,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Learning Latent Dynamics for Planning from Pixels

D Hafner, T Lillicrap, I Fischer, R Villegas, D Ha, H Lee, J Davidson. Learning Latent Dynamics for Planning From Pixels. ArXiv Preprint ArXiv:1811.04551,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Dream to Control: Learning Behaviors by Latent Imagination

D Hafner, T Lillicrap, J Ba, M Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. ArXiv Preprint ArXiv:1912.01603,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[25]

Model-Based Planning with Discrete and Continuous Actions

M Henaff, WF Whitney, Y LeCun. Model-Based Planning With Discrete and Continuous Actions. ArXiv Preprint ArXiv:1705.07177,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Synthesizing Neural Network Controllers with Probabilistic Model based Reinforcement Learning

JCG Higuera, D Meger, G Dudek. Synthesizing Neural Network Controllers With Probabilistic Model Based Reinforcement Learning. ArXiv Preprint ArXiv:1803.02291,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Distributed Prioritized Experience Replay

D Horgan, J Quan, D Budden, G Barth-Maron, M Hessel, H Van Hasselt, D Silver. Distributed Prioritized Experience Replay. ArXiv Preprint ArXiv:1803.00933,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Deep Variational Reinforcement Learning for POMDPs

M Igl, L Zintgraf, TA Le, F Wood, S Whiteson. Deep Variational Reinforcement Learning for Pomdps. ArXiv Preprint ArXiv:1806.02426,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Model-Based Reinforcement Learning for Atari

L Kaiser, M Babaeizadeh, P Milos, B Osinski, RH Campbell, K Czechowski, D Erhan, C Finn, P Kozakowski, S Levine, et al. Model-Based Reinforcement Learning for Atari. ArXiv Preprint ArXiv:1903.00374,

work page arXiv 1903
[30]

Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data

13 Published as a conference paper at ICLR 2021 M Karl, M Soelch, J Bayer, P van der Smagt. Deep Variational Bayes Filters: Unsupervised Learning of State Space Models From Raw Data. ArXiv Preprint ArXiv:1605.06432,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Adam: A Method for Stochastic Optimization

DP Kingma J Ba. Adam: A Method for Stochastic Optimization. ArXiv Preprint ArXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Auto-Encoding Variational Bayes

DP Kingma M Welling. Auto-Encoding Variational Bayes. ArXiv Preprint ArXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning From Pixels

I Kostrikov, D Yarats, R Fergus. Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning From Pixels. ArXiv Preprint ArXiv:2004.13649,

work page arXiv 2004
[34]

Deep Kalman Filters

RG Krishnan, U Shalit, D Sontag. Deep Kalman Filters. ArXiv Preprint ArXiv:1511.05121,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Model-Ensemble Trust-Region Policy Optimization

T Kurutach, I Clavera, Y Duan, A Tamar, P Abbeel. Model-Ensemble Trust-Region Policy Optimization. ArXiv Preprint ArXiv:1802.10592,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Stochastic Latent Actor-Critic: Deep Reinforcement Learning With a Latent Variable Model

AX Lee, A Nagabandi, P Abbeel, S Levine. Stochastic Latent Actor-Critic: Deep Reinforcement Learning With a Latent Variable Model. ArXiv Preprint ArXiv:1907.00953,

work page arXiv 1907
[37]

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

A Nagabandi, G Kahn, RS Fearing, S Levine. Neural Network Dynamics for Model-Based Deep Reinforcement Learning With Model-Free Fine-Tuning. ArXiv Preprint ArXiv:1708.02596,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

P Parmas, CE Rasmussen, J Peters, K Doya. Pipps: Flexible Model-Based Policy Search Robust to the Curse of Chaos. ArXiv Preprint ArXiv:1902.01240,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[39]

Stochastic Backpropagation and Approximate Inference in Deep Generative Models

DJ Rezende, S Mohamed, D Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ArXiv Preprint ArXiv:1401.4082,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Prioritized Experience Replay

T Schaul, J Quan, I Antonoglou, D Silver. Prioritized Experience Replay. ArXiv Preprint ArXiv:1511.05952,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019

J Schrittwieser, I Antonoglou, T Hubert, K Simonyan, L Sifre, S Schmitt, A Guez, E Lockhart, D Hassabis, T Graepel, et al. Mastering Atari, Go, Chess and Shogi by Planning With a Learned Model. ArXiv Preprint ArXiv:1911.08265,

work page arXiv 1911
[42]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J Schulman, P Moritz, S Levine, M Jordan, P Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. ArXiv Preprint ArXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Proximal Policy Optimization Algorithms

14 Published as a conference paper at ICLR 2021 J Schulman, F Wolski, P Dhariwal, A Radford, O Klimov. Proximal Policy Optimization Algorithms. ArXiv Preprint ArXiv:1707.06347, 2017a. J Schulman, F Wolski, P Dhariwal, A Radford, O Klimov. Proximal Policy Optimization Algorithms. ArXiv Preprint ArXiv:1707.06347, 2017b. R Sekar, O Rybkin, K Daniilidis, P Ab...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

Universal Planning Networks

A Srinivas, A Jabri, P Abbeel, S Levine, C Finn. Universal Planning Networks. ArXiv Preprint ArXiv:1804.00645,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Curl: Contrastive Unsupervised Representations for Reinforcement Learning

A Srinivas, M Laskin, P Abbeel. Curl: Contrastive Unsupervised Representations for Reinforcement Learning. ArXiv Preprint ArXiv:2004.04136,

work page arXiv 2004
[46]

Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the Playing Field

M Toromanoff, E Wirbel, F Moutarde. Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the Playing Field. ArXiv Preprint ArXiv:1908.04683,

work page arXiv 1908
[47]

Deep Reinforcement Learning with Double Q-learning

H Van Hasselt, A Guez, D Silver. Deep Reinforcement Learning With Double Q-Learning. ArXiv Preprint ArXiv:1509.06461,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Exploring Model-based Planning with Policy Networks

T Wang J Ba. Exploring Model-Based Planning With Policy Networks. ArXiv Preprint ArXiv:1906.08649,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[49]

Benchmarking Model-Based Reinforcement Learning

T Wang, X Bao, I Clavera, J Hoang, Y Wen, E Langlois, S Zhang, G Zhang, P Abbeel, J Ba. Benchmarking Model-Based Reinforcement Learning. CoRR, abs/1907.02057,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[50]

Unsupervised Predictive Memory in a Goal-Directed Agent

G Wayne, CC Hung, D Amos, M Mirza, A Ahuja, A Grabska-Barwinska, J Rae, P Mirowski, JZ Leibo, A Santoro, et al. Unsupervised Predictive Memory in a Goal-Directed Agent. ArXiv Preprint ArXiv:1803.10760,

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Imagination-Augmented Agents for Deep Reinforcement Learning

T Weber, S Racanière, DP Reichert, L Buesing, A Guez, DJ Rezende, AP Badia, O Vinyals, N Heess, Y Li, et al. Imagination-Augmented Agents for Deep Reinforcement Learning. ArXiv Preprint ArXiv:1707.06203,

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Mopo: Model-Based Ofﬂine Policy Optimization

T Yu, G Thomas, L Yu, S Ermon, J Zou, S Levine, C Finn, T Ma. Mopo: Model-Based Ofﬂine Policy Optimization. ArXiv Preprint ArXiv:2005.13239,

work page arXiv 2005
[53]

The task is provided by the DeepMind Control Suite and uses a continuous action space with 21 dimensions

15 Published as a conference paper at ICLR 2021 A H UMANOID FROM PIXELS Figure A.1: Behavior learned by DreamerV2 on the Humanoid Walk task from pixel inputs only. The task is provided by the DeepMind Control Suite and uses a continuous action space with 21 dimensions. The frames show the agent inputs. 0 1 2 3 4 1e7 0 200 400 600 800 Humanoid Walk Dreamer...

work page 2021
[54]

We ﬁnd that DreamerV2 reliably solves both the stand-up motion required at the beginning of the episode and the subsequent walking

We also set η = 10−5 and β = 2 to further accelerate learning. We ﬁnd that DreamerV2 reliably solves both the stand-up motion required at the beginning of the episode and the subsequent walking. The score is shown in Figure A.2. To the best of our knowledge, this constitutes the ﬁrst published re- sult of solving the humanoid environment from only pixel i...

work page 2021
[55]

that was applied on top of Rainbow by Taiga et al. (2019). This suggests that the world model may help with solving sparse reward tasks, for example due to improved generalization, efﬁcient policy opti- mization in the compact latent space enabling more actor critic updates, or because the reward predictor generalizes and thus smooths out the sparse rewar...

work page 2019
[56]

• Layer norm Using layer normalization in the GRU that is used as part of the RSSM latent transition model, instead of no normalization, provided no or marginal beneﬁts

provided marginal or no beneﬁts. • Layer norm Using layer normalization in the GRU that is used as part of the RSSM latent transition model, instead of no normalization, provided no or marginal beneﬁts. Due to the large computational requirements, a comprehensive ablation study on this list of all changes is unfortunately infeasible for us. This would req...

work page 2021