arxiv: 1912.01603 · v3 · submitted 2019-12-03 · 💻 cs.LG · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner , Timothy Lillicrap , Jimmy Ba , Mohammad Norouzi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO

keywords reinforcement learningworld modelslatent spacevisual controlmodel-based planninggradient optimization

0 comments

The pith

Dreamer learns behaviors for visual control tasks by propagating gradients through imagined trajectories in a learned latent world model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an agent called Dreamer that first learns a world model from high-dimensional image inputs. It then derives behaviors by optimizing policies using analytic gradients of state values backpropagated through trajectories imagined entirely within the compact latent space of that model. This approach allows solving long-horizon tasks without direct interaction during behavior learning. On 20 challenging visual control tasks, it shows improvements over prior methods in how quickly it learns, how much computation it uses, and how well it performs at the end.

Core claim

Dreamer is a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. Behaviors are learned efficiently by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model.

What carries the argument

Latent imagination, the process of generating and optimizing trajectories inside the learned world model's state space to derive control policies.

If this is right

Learning in latent space reduces the need for real environment interactions, improving data efficiency.
Gradient propagation through imagined rollouts enables faster optimization compared to sampling-based methods.
The method achieves higher final performance on visual control tasks.
Computation time is reduced because planning happens in a compact latent representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

World models that support long-horizon accuracy could enable planning in even more complex domains like robotics with high-dimensional sensors.
If the latent space captures dynamics well, this could reduce the sample complexity of reinforcement learning in general.
Extending the imagination horizon might require better uncertainty handling in the world model to prevent error accumulation.

Load-bearing premise

The learned world model must stay accurate enough over long imagined horizons for the optimized policies to work when executed in the actual environment.

What would settle it

Testing whether policies learned via latent imagination perform as well as expected when the world model's prediction error is measured and increased artificially over the planning horizon.

read the original abstract

Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dreamer shows you can optimize policies by backpropagating through imagined latent trajectories in an RSSM and gets clear gains in sample efficiency on visual control, though long-horizon model accuracy is only validated indirectly via task performance.

read the letter

The main point is that Dreamer trains both actor and critic entirely inside the latent space of a learned world model by rolling out imagined trajectories and propagating value gradients analytically. This produces better data efficiency, lower compute, and higher final performance than prior model-based and model-free baselines across 20 visual tasks from the DeepMind Control Suite. The approach builds directly on PlaNet but adds the full behavior-learning loop in latent space rather than just planning with MPC. That combination is the concrete advance. The experiments are straightforward and well-executed: multiple random seeds, ablations on horizon length and model components, and consistent improvements reported with standard metrics. Implementation details for the RSSM and the imagination procedure are given, which helps reproducibility. The central performance claims hold up on the evidence presented. The soft spot is the one the stress-test flags. The paper does not report separate multi-step prediction error on held-out real trajectories at the full imagination horizons used during training. Without those numbers it remains possible that the method benefits more from short-horizon fidelity plus the actor-critic structure than from genuinely reliable long-horizon latent rollouts. Task success shows the policies transfer, so the gap is not fatal, but it does leave the key modeling assumption less directly tested than the rest of the pipeline. This work is for researchers focused on sample-efficient model-based RL from pixels. Anyone building agents for robotics or games that need to reduce real-world interactions will find the method and the empirical pattern useful. The paper is coherent on its own terms and the empirical case is strong enough that it deserves a serious referee rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The paper introduces Dreamer, a model-based RL agent that learns a recurrent state-space model (RSSM) from high-dimensional image observations and derives policies by propagating analytic gradients of learned state values through imagined trajectories in the compact latent space, without requiring real-environment rollouts during planning. It reports that this latent imagination approach yields better data efficiency, lower computation time, and higher final performance than prior methods across 20 visual control tasks.

Significance. If the central performance claims hold, the work provides strong empirical evidence that gradient-based optimization over long-horizon latent trajectories can produce transferable behaviors, advancing sample-efficient model-based RL for visual domains. Credit is due for the breadth of evaluation (20 diverse tasks, multiple baselines, ablation studies) and for supplying implementation details that support reproducibility of the world model and imagination procedure.

major comments (1)

[§4 and Appendix] §4 (Experiments) and Appendix: the central claim that analytic gradients through long-horizon imagined trajectories produce policies that transfer to the real environment rests on the RSSM remaining sufficiently accurate; however, no separate quantitative evaluation of multi-step prediction MSE or horizon-length sensitivity is reported on held-out real trajectories independent of task success. This leaves open whether gains derive primarily from short-horizon fidelity plus the actor-critic rather than reliable long-horizon latent imagination.

minor comments (2)

[§3.1] §3.1: the RSSM transition and observation model equations would benefit from an explicit statement of the exact loss terms used for each component to improve clarity for readers implementing the method.
[Figure 4] Figure 4: the caption should specify the precise imagination horizon length and number of gradient steps used for the reported curves to allow direct comparison with the ablation results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and positive recommendation of minor revision. The feedback helps strengthen the presentation of the latent imagination approach. We address the single major comment below.

read point-by-point responses

Referee: [§4 and Appendix] §4 (Experiments) and Appendix: the central claim that analytic gradients through long-horizon imagined trajectories produce policies that transfer to the real environment rests on the RSSM remaining sufficiently accurate; however, no separate quantitative evaluation of multi-step prediction MSE or horizon-length sensitivity is reported on held-out real trajectories independent of task success. This leaves open whether gains derive primarily from short-horizon fidelity plus the actor-critic rather than reliable long-horizon latent imagination.

Authors: We appreciate the referee's emphasis on isolating the contribution of long-horizon model accuracy. The empirical results across 20 tasks show Dreamer outperforming both model-free agents and prior model-based methods that lack comparable long-horizon latent planning; such gains would be difficult to achieve if the RSSM were limited to short-horizon fidelity. That said, we agree that explicit quantitative metrics would provide additional clarity. In the revised manuscript we will add multi-step prediction MSE evaluated on held-out real trajectories (independent of the RL objective) in the appendix, together with an expanded analysis of performance as a function of imagination horizon length. These additions will be presented separately from task success to directly address the concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation separates model learning from policy optimization via independent empirical validation.

full rationale

The paper's core chain learns an RSSM world model from real experience via variational inference, then optimizes actor-critic parameters by back-propagating value gradients through finite-horizon imagined latent trajectories. Neither the model parameters nor the policy objective reduce to a fitted input by construction; the imagined trajectories are generated from the learned dynamics and the final performance is measured on held-out real-environment rollouts across 20 tasks. Self-citations to prior RSSM work supply the model architecture but do not bear the load of the behavior-learning claim, which is tested externally rather than being tautological. No self-definitional equations, fitted-input predictions, or uniqueness theorems imported from overlapping authors appear in the derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard MDP assumptions and the ability of the RSSM to learn useful dynamics; several architecture and optimization hyperparameters are tuned but are not load-bearing for the conceptual contribution.

free parameters (2)

imagination horizon length
Chosen to trade off planning depth against computation; affects how far gradients are propagated.
RSSM and actor-critic network sizes and learning rates
Standard deep RL hyperparameters tuned on validation tasks.

axioms (2)

domain assumption The environment dynamics can be captured by a latent state-space model that generalizes to imagined trajectories.
Invoked throughout the world model training and imagination procedure.
domain assumption Gradients through the imagined model provide a useful learning signal for the policy.
Core justification for backpropagating through latent rollouts instead of model-free updates.

pith-pipeline@v0.9.0 · 5394 in / 1376 out tokens · 44745 ms · 2026-05-12T01:11:22.310464+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
Operator-Guided Invariance Learning for Continuous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
RopeDreamer: A Kinematic Recurrent State Space Model for Dynamics of Flexible Deformable Linear Objects
cs.RO 2026-04 unverdicted novelty 7.0

RopeDreamer uses quaternionic kinematic chains in a recurrent state space model with a dual decoder to cut open-loop prediction error by 40.52% over 50 steps on simulated DLO trajectories while preserving physical con...
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
Beyond Static Forecasting: Unleashing the Power of World Models for Mobile Traffic Extrapolation
cs.NI 2026-04 unverdicted novelty 7.0

MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching
cs.RO 2026-05 unverdicted novelty 6.0

DRIS improves zero-shot sim-to-real transfer for reactive catching by maintaining and acting on sets of randomized dynamics instances instead of single instances per episode.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations
cs.LG 2026-05 unverdicted novelty 6.0

LaWM induces latent transitions from a learned discrete variational principle rather than an unconstrained neural predictor, yielding improved physical consistency on synthetic dynamics and robot benchmarks.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
TRAP: Tail-aware Ranking Attack for World-Model Planning
cs.LG 2026-05 unverdicted novelty 6.0

TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on c...
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models
cs.LG 2026-04 unverdicted novelty 6.0

Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Learning Ad Hoc Network Dynamics via Graph-Structured World Models
cs.LG 2026-04 unverdicted novelty 6.0

G-RSSM learns per-node dynamics in wireless ad hoc networks via graph attention and trains clustering policies through imagined rollouts, generalizing from N=50 training to larger networks.
Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization
cs.CV 2026-04 unverdicted novelty 6.0

A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.
Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation
cs.RO 2026-04 conditional novelty 6.0

An end-to-end RL policy trained via high-fidelity differentiable simulation maps depth images straight to bodyrate commands, achieving top success rates, low jerk, and zero-shot real-world generalization up to 7.5 m/s...
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control
cs.RO 2026-04 unverdicted novelty 6.0

A behavior-constrained RL framework with receding-horizon credit assignment learns high-performance control policies that stay aligned with expert behavior in race car simulation.
Safety, Security, and Cognitive Risks in World Models
cs.CR 2026-04 unverdicted novelty 6.0

World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
cs.AI 2026-01 conditional novelty 6.0

Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling
cs.LG 2026-05 unverdicted novelty 5.0

A delay-aware RL approach learns transferable structured representations and dynamics via implicit causal graphs, outperforming baselines on delayed DMC tasks and accelerating adaptation to new tasks.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
Neural Control: Adjoint Learning Through Equilibrium Constraints
cs.RO 2026-05 unverdicted novelty 5.0

Neural Control introduces adjoint-based differentiation through implicit equilibrium constraints to enable memory-efficient gradient computation and robust receding-horizon MPC for multi-stable deformable object manip...
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
cs.RO 2026-04 unverdicted novelty 5.0

The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
cs.LG 2026-04 unverdicted novelty 5.0

CausalVAE plug-in for world models preserves factual prediction and boosts counterfactual retrieval, with large gains on physics benchmarks and recovered physical interaction trends.
Neural Computers
cs.LG 2026-04 unverdicted novelty 5.0

Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
eess.SY 2026-04 unverdicted novelty 2.0

A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 39 Pith papers · 10 internal anchors

[1]

A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy. Deep variational information bottleneck.arXiv preprint arXiv:1612.00410,

work page arXiv
[2]

Banijamali, R

E. Banijamali, R. Shu, M. Ghavamzadeh, H. Bui, and A. Ghodsi. Robust locally-linear controllable embedding. arXiv preprint arXiv:1710.05373,

work page arXiv
[3]

Distributed distributional deterministic policy gradients

G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lil- licrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617,

work page arXiv
[4]

DeepMind Lab

C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V . Valdés, A. Sadik, et al. Deepmind lab.arXiv preprint arXiv:1612.03801,

work page Pith review arXiv
[5]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Learning and Querying Fast Generative Models for Reinforcement Learning

L. Buesing, T. Weber, S. Racaniere, S. Eslami, D. Rezende, D. P. Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006,

work page Pith review arXiv
[7]

Imagined Value Gradients: Model-Based Policy Optimization With Transferable Latent Dynamics Models

A. Byravan, J. T. Springenberg, A. Abdolmaleki, R. Hafner, M. Neunert, T. Lampe, N. Siegel, N. Heess, and M. Riedmiller. Imagined value gradients: Model-based policy optimization with transferable latent dynamics models. arXiv preprint arXiv:1910.04142,

work page arXiv 1910
[8]

P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare. Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110,

work page Pith review arXiv
[9]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,

work page Pith review arXiv
[10]

J. V . Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous. Tensorﬂow distributions. arXiv preprint arXiv:1711.10604,

work page arXiv
[11]

Probabilistic Recurrent State-Space Models

A. Doerr, C. Daniel, M. Schiegg, D. Nguyen-Tuong, S. Schaal, M. Toussaint, and S. Trimpe. Probabilistic recurrent state-space models. arXiv preprint arXiv:1801.10395,

work page Pith review arXiv
[12]

Self-Supervised Visual Planning with Temporal Skip Connections

F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268,

work page Pith review arXiv
[13]

Espeholt, H

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561,

work page arXiv
[14]

Model-based value estimation for efficient model-free reinforcement learning.arXiv preprint arXiv:1803.00101, 2018

V . Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine. Model-based value estimation for efﬁcient model-free reinforcement learning.arXiv preprint arXiv:1803.00101,

work page arXiv
[15]

Bellemare

10 Published as a conference paper at ICLR 2020 C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare. Deepmdp: Learning continuous latent space models for representation learning. arXiv preprint arXiv:1906.02736,

work page arXiv 2020
[16]

Gregor, D

K. Gregor, D. J. Rezende, F. Besse, Y . Wu, H. Merzic, and A. v. d. Oord. Shaping belief states with generative environment models for rl. arXiv preprint arXiv:1906.09237,

work page arXiv 1906
[17]

Z. D. Guo, M. G. Azar, B. Piot, B. A. Pires, T. Pohlen, and R. Munos. Neural predictive belief representations. arXiv preprint arXiv:1811.06407,

work page arXiv
[18]

World Models

D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290,

work page internal anchor Pith review arXiv
[20]

Learning Latent Dynamics for Planning from Pixels

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551,

work page Pith review arXiv
[22]

Model-Based Planning with Discrete and Continuous Actions

M. Henaff, W. F. Whitney, and Y . LeCun. Model-based planning with discrete and continuous actions. arXiv preprint arXiv:1705.07177,

work page Pith review arXiv
[23]

Henaff, A

M. Henaff, A. Canziani, and Y . LeCun. Model-predictive policy learning with uncertainty regulariza- tion for driving in dense trafﬁc.arXiv preprint arXiv:1901.02705,

work page arXiv 1901
[24]

Reinforcement learning with unsupervised auxiliary tasks,

M. Jaderberg, V . Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397,

work page arXiv
[25]

Model-Based Reinforcement Learning for Atari

L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374,

work page arXiv 1903
[26]

M. Karl, M. Soelch, J. Bayer, and P. van der Smagt. Deep variational bayes ﬁlters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432,

work page Pith review arXiv
[27]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman ﬁlters. arXiv preprint arXiv:1511.05121,

work page Pith review arXiv
[30]

Model-Ensemble Trust-Region Policy Optimization

T. Kurutach, I. Clavera, Y . Duan, A. Tamar, and P. Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592,

work page Pith review arXiv
[31]

LeCun, B

11 Published as a conference paper at ICLR 2020 Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551,

work page 2020
[32]

A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953,

work page arXiv 1907
[33]

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Lowrey, A

K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch. Plan online, learn ofﬂine: Efﬁcient learning and exploration via model-based control. arXiv preprint arXiv:1811.01848,

work page arXiv
[35]

McAllester and K

D. McAllester and K. Statos. Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251,

work page arXiv
[36]

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937,

work page 1928
[37]

A. v. d. Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

P. Parmas, C. E. Rasmussen, J. Peters, and K. Doya. Pipps: Flexible model-based policy search robust to the curse of chaos. arXiv preprint arXiv:1902.01240,

work page Pith review arXiv 1902
[39]

Piergiovanni, A

A. Piergiovanni, A. Wu, and M. S. Ryoo. Learning real-world robot policies by dreaming. arXiv preprint arXiv:1805.07813,

work page arXiv
[40]

arXiv preprint arXiv:1905.06922 , year=

B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker. On variational bounds of mutual information. arXiv preprint arXiv:1905.06922,

work page arXiv 1905
[41]

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082,

work page Pith review arXiv
[42]

Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019

J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265,

work page arXiv 1911
[43]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Universal Planning Networks

12 Published as a conference paper at ICLR 2020 A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn. Universal planning networks. arXiv preprint arXiv:1804.00645,

work page Pith review arXiv 2020
[45]

DeepMind Control Suite

Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690,

work page internal anchor Pith review arXiv
[46]

The information bottleneck method

N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint physics/0004057,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Exploring Model-based Planning with Policy Networks

T. Wang and J. Ba. Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649,

work page Pith review arXiv 1906
[48]

T. Wang, X. Bao, I. Clavera, J. Hoang, Y . Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba. Benchmarking model-based reinforcement learning. CoRR, abs/1907.02057,

work page Pith review arXiv 1907
[49]

Imagination-Augmented Agents for Deep Reinforcement Learning

T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y . Li, et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203,

work page Pith review arXiv
[50]

(2018), and implement all other functions as three dense layers of size 300 with ELU activations (Clevert et al., 2015)

13 Published as a conference paper at ICLR 2020 A H YPER PARAMETERS Model components We use the convolutional encoder and decoder networks from Ha and Schmid- huber (2018), the RSSM of Hafner et al. (2018), and implement all other functions as three dense layers of size 300 with ELU activations (Clevert et al., 2015). Distributions in latent space are 30-...

work page 2020
[51]

The imagination horizon is H = 15 and the same trajectories are used to update both action and value models

but clip them below 3 free nats as in PlaNet. The imagination horizon is H = 15 and the same trajectories are used to update both action and value models. We compute the Vλ targets with γ = 0.99 and λ = 0.95. We did not ﬁnd latent overshooting for learning the model, an entropy bonus for the action model, or target networks for the value model necessary. ...

work page 2018
[52]

for latent dynamics models, max I(s1:T ; (o1:T , r1:T ) | a1:T ) − β I(s1:T , i1:T | a1:T ), (13) where β is scalar and it are dataset indices that determine the observations p(ot | it) .= δ(ot − ¯ot) as in Alemi et al. (2016). Maximizing the objective leads to model states that can predict the sequence of observations and rewards while limiting the amoun...

work page 2016
[53]

and DeepMind Lab (Beattie et al., 2016). While agents that purely learn through world models are not yet competitive in these domains (Kaiser et al., 2019), the tasks offer a diverse test bed with visual complexity, sparse rewards, and early termination. Agents observe 64 × 64 × 3 images and select one of between 3 and 18 actions. For Atari, we follow the...

work page 2016