Recognition: 2 theorem links
· Lean TheoremDream to Control: Learning Behaviors by Latent Imagination
Pith reviewed 2026-05-12 01:11 UTC · model grok-4.3
The pith
Dreamer learns behaviors for visual control tasks by propagating gradients through imagined trajectories in a learned latent world model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dreamer is a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. Behaviors are learned efficiently by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model.
What carries the argument
Latent imagination, the process of generating and optimizing trajectories inside the learned world model's state space to derive control policies.
If this is right
- Learning in latent space reduces the need for real environment interactions, improving data efficiency.
- Gradient propagation through imagined rollouts enables faster optimization compared to sampling-based methods.
- The method achieves higher final performance on visual control tasks.
- Computation time is reduced because planning happens in a compact latent representation.
Where Pith is reading between the lines
- World models that support long-horizon accuracy could enable planning in even more complex domains like robotics with high-dimensional sensors.
- If the latent space captures dynamics well, this could reduce the sample complexity of reinforcement learning in general.
- Extending the imagination horizon might require better uncertainty handling in the world model to prevent error accumulation.
Load-bearing premise
The learned world model must stay accurate enough over long imagined horizons for the optimized policies to work when executed in the actual environment.
What would settle it
Testing whether policies learned via latent imagination perform as well as expected when the world model's prediction error is measured and increased artificially over the planning horizon.
read the original abstract
Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Dreamer, a model-based RL agent that learns a recurrent state-space model (RSSM) from high-dimensional image observations and derives policies by propagating analytic gradients of learned state values through imagined trajectories in the compact latent space, without requiring real-environment rollouts during planning. It reports that this latent imagination approach yields better data efficiency, lower computation time, and higher final performance than prior methods across 20 visual control tasks.
Significance. If the central performance claims hold, the work provides strong empirical evidence that gradient-based optimization over long-horizon latent trajectories can produce transferable behaviors, advancing sample-efficient model-based RL for visual domains. Credit is due for the breadth of evaluation (20 diverse tasks, multiple baselines, ablation studies) and for supplying implementation details that support reproducibility of the world model and imagination procedure.
major comments (1)
- [§4 and Appendix] §4 (Experiments) and Appendix: the central claim that analytic gradients through long-horizon imagined trajectories produce policies that transfer to the real environment rests on the RSSM remaining sufficiently accurate; however, no separate quantitative evaluation of multi-step prediction MSE or horizon-length sensitivity is reported on held-out real trajectories independent of task success. This leaves open whether gains derive primarily from short-horizon fidelity plus the actor-critic rather than reliable long-horizon latent imagination.
minor comments (2)
- [§3.1] §3.1: the RSSM transition and observation model equations would benefit from an explicit statement of the exact loss terms used for each component to improve clarity for readers implementing the method.
- [Figure 4] Figure 4: the caption should specify the precise imagination horizon length and number of gradient steps used for the reported curves to allow direct comparison with the ablation results.
Simulated Author's Rebuttal
We thank the referee for the constructive review and positive recommendation of minor revision. The feedback helps strengthen the presentation of the latent imagination approach. We address the single major comment below.
read point-by-point responses
-
Referee: [§4 and Appendix] §4 (Experiments) and Appendix: the central claim that analytic gradients through long-horizon imagined trajectories produce policies that transfer to the real environment rests on the RSSM remaining sufficiently accurate; however, no separate quantitative evaluation of multi-step prediction MSE or horizon-length sensitivity is reported on held-out real trajectories independent of task success. This leaves open whether gains derive primarily from short-horizon fidelity plus the actor-critic rather than reliable long-horizon latent imagination.
Authors: We appreciate the referee's emphasis on isolating the contribution of long-horizon model accuracy. The empirical results across 20 tasks show Dreamer outperforming both model-free agents and prior model-based methods that lack comparable long-horizon latent planning; such gains would be difficult to achieve if the RSSM were limited to short-horizon fidelity. That said, we agree that explicit quantitative metrics would provide additional clarity. In the revised manuscript we will add multi-step prediction MSE evaluated on held-out real trajectories (independent of the RL objective) in the appendix, together with an expanded analysis of performance as a function of imagination horizon length. These additions will be presented separately from task success to directly address the concern. revision: partial
Circularity Check
No significant circularity; derivation separates model learning from policy optimization via independent empirical validation.
full rationale
The paper's core chain learns an RSSM world model from real experience via variational inference, then optimizes actor-critic parameters by back-propagating value gradients through finite-horizon imagined latent trajectories. Neither the model parameters nor the policy objective reduce to a fitted input by construction; the imagined trajectories are generated from the learned dynamics and the final performance is measured on held-out real-environment rollouts across 20 tasks. Self-citations to prior RSSM work supply the model architecture but do not bear the load of the behavior-learning claim, which is tested externally rather than being tautological. No self-definitional equations, fitted-input predictions, or uniqueness theorems imported from overlapping authors appear in the derivation.
Axiom & Free-Parameter Ledger
free parameters (2)
- imagination horizon length
- RSSM and actor-critic network sizes and learning rates
axioms (2)
- domain assumption The environment dynamics can be captured by a latent state-space model that generalizes to imagined trajectories.
- domain assumption Gradients through the imagined model provide a useful learning signal for the policy.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 41 Pith papers
-
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
-
Operator-Guided Invariance Learning for Continuous Reinforcement Learning
VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
RopeDreamer: A Kinematic Recurrent State Space Model for Dynamics of Flexible Deformable Linear Objects
RopeDreamer uses quaternionic kinematic chains in a recurrent state space model with a dual decoder to cut open-loop prediction error by 40.52% over 50 steps on simulated DLO trajectories while preserving physical con...
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
-
Beyond Static Forecasting: Unleashing the Power of World Models for Mobile Traffic Extrapolation
MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
Mastering Diverse Domains through World Models
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
-
Mastering Atari with Discrete World Models
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
-
Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching
DRIS improves zero-shot sim-to-real transfer for reactive catching by maintaining and acting on sets of randomized dynamics instances instead of single instances per episode.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations
LaWM induces latent transitions from a learned discrete variational principle rather than an unconstrained neural predictor, yielding improved physical consistency on synthetic dynamics and robot benchmarks.
-
Predictive but Not Plannable: RC-aux for Latent World Models
RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
-
Learning to Theorize the World from Observation
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
-
TRAP: Tail-aware Ranking Attack for World-Model Planning
TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on c...
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models
Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
Learning Ad Hoc Network Dynamics via Graph-Structured World Models
G-RSSM learns per-node dynamics in wireless ad hoc networks via graph attention and trains clustering policies through imagined rollouts, generalizing from N=50 training to larger networks.
-
Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization
A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.
-
Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation
An end-to-end RL policy trained via high-fidelity differentiable simulation maps depth images straight to bodyrate commands, achieving top success rates, low jerk, and zero-shot real-world generalization up to 7.5 m/s...
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control
A behavior-constrained RL framework with receding-horizon credit assignment learns high-performance control policies that stay aligned with expert behavior in race car simulation.
-
Safety, Security, and Cognitive Risks in World Models
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling
A delay-aware RL approach learns transferable structured representations and dynamics via implicit causal graphs, outperforming baselines on delayed DMC tasks and accelerating adaptation to new tasks.
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
Neural Control: Adjoint Learning Through Equilibrium Constraints
Neural Control introduces adjoint-based differentiation through implicit equilibrium constraints to enable memory-efficient gradient computation and robust receding-horizon MPC for multi-stable deformable object manip...
-
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
-
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
CausalVAE plug-in for world models preserves factual prediction and boosts counterfactual retrieval, with large gains on physics benchmarks and recovered physical interaction trends.
-
Neural Computers
Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
Reference graph
Works this paper leans on
- [1]
-
[2]
E. Banijamali, R. Shu, M. Ghavamzadeh, H. Bui, and A. Ghodsi. Robust locally-linear controllable embedding. arXiv preprint arXiv:1710.05373,
-
[3]
Distributed distributional deterministic policy gradients
G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lil- licrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617,
-
[4]
C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V . Valdés, A. Sadik, et al. Deepmind lab.arXiv preprint arXiv:1612.03801,
-
[5]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Y . Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Learning and Querying Fast Generative Models for Reinforcement Learning
L. Buesing, T. Weber, S. Racaniere, S. Eslami, D. Rezende, D. P. Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006,
-
[7]
Imagined Value Gradients: Model-Based Policy Optimization With Transferable Latent Dynamics Models
A. Byravan, J. T. Springenberg, A. Abdolmaleki, R. Hafner, M. Neunert, T. Lampe, N. Siegel, N. Heess, and M. Riedmiller. Imagined value gradients: Model-based policy optimization with transferable latent dynamics models. arXiv preprint arXiv:1910.04142,
-
[8]
P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare. Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110,
-
[9]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,
- [10]
-
[11]
Probabilistic Recurrent State-Space Models
A. Doerr, C. Daniel, M. Schiegg, D. Nguyen-Tuong, S. Schaal, M. Toussaint, and S. Trimpe. Probabilistic recurrent state-space models. arXiv preprint arXiv:1801.10395,
-
[12]
Self-Supervised Visual Planning with Temporal Skip Connections
F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268,
-
[13]
L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561,
-
[14]
V . Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine. Model-based value estimation for efficient model-free reinforcement learning.arXiv preprint arXiv:1803.00101,
- [15]
- [16]
- [17]
-
[18]
D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290,
work page internal anchor Pith review arXiv
-
[20]
Learning Latent Dynamics for Planning from Pixels
D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551,
-
[22]
Model-Based Planning with Discrete and Continuous Actions
M. Henaff, W. F. Whitney, and Y . LeCun. Model-based planning with discrete and continuous actions. arXiv preprint arXiv:1705.07177,
- [23]
-
[24]
Reinforcement learning with unsupervised auxiliary tasks,
M. Jaderberg, V . Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397,
-
[25]
Model-Based Reinforcement Learning for Atari
L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374,
-
[26]
M. Karl, M. Soelch, J. Bayer, and P. van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432,
-
[27]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121,
-
[30]
Model-Ensemble Trust-Region Policy Optimization
T. Kurutach, I. Clavera, Y . Duan, A. Tamar, and P. Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592,
- [31]
- [32]
-
[33]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,
work page internal anchor Pith review Pith/arXiv arXiv
- [34]
-
[35]
D. McAllester and K. Statos. Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251,
-
[36]
V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937,
work page 1928
-
[37]
A. v. d. Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos
P. Parmas, C. E. Rasmussen, J. Peters, and K. Doya. Pipps: Flexible model-based policy search robust to the curse of chaos. arXiv preprint arXiv:1902.01240,
work page Pith review arXiv 1902
-
[39]
A. Piergiovanni, A. Wu, and M. S. Ryoo. Learning real-world robot policies by dreaming. arXiv preprint arXiv:1805.07813,
-
[40]
arXiv preprint arXiv:1905.06922 , year=
B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker. On variational bounds of mutual information. arXiv preprint arXiv:1905.06922,
-
[41]
D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082,
-
[42]
J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265,
-
[43]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
12 Published as a conference paper at ICLR 2020 A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn. Universal planning networks. arXiv preprint arXiv:1804.00645,
work page Pith review arXiv 2020
-
[45]
Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690,
work page internal anchor Pith review arXiv
-
[46]
The information bottleneck method
N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint physics/0004057,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Exploring Model-based Planning with Policy Networks
T. Wang and J. Ba. Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649,
work page Pith review arXiv 1906
-
[48]
T. Wang, X. Bao, I. Clavera, J. Hoang, Y . Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba. Benchmarking model-based reinforcement learning. CoRR, abs/1907.02057,
work page Pith review arXiv 1907
-
[49]
Imagination-Augmented Agents for Deep Reinforcement Learning
T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y . Li, et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203,
-
[50]
13 Published as a conference paper at ICLR 2020 A H YPER PARAMETERS Model components We use the convolutional encoder and decoder networks from Ha and Schmid- huber (2018), the RSSM of Hafner et al. (2018), and implement all other functions as three dense layers of size 300 with ELU activations (Clevert et al., 2015). Distributions in latent space are 30-...
work page 2020
-
[51]
but clip them below 3 free nats as in PlaNet. The imagination horizon is H = 15 and the same trajectories are used to update both action and value models. We compute the Vλ targets with γ = 0.99 and λ = 0.95. We did not find latent overshooting for learning the model, an entropy bonus for the action model, or target networks for the value model necessary. ...
work page 2018
-
[52]
for latent dynamics models, max I(s1:T ; (o1:T , r1:T ) | a1:T ) − β I(s1:T , i1:T | a1:T ), (13) where β is scalar and it are dataset indices that determine the observations p(ot | it) .= δ(ot − ¯ot) as in Alemi et al. (2016). Maximizing the objective leads to model states that can predict the sequence of observations and rewards while limiting the amoun...
work page 2016
-
[53]
and DeepMind Lab (Beattie et al., 2016). While agents that purely learn through world models are not yet competitive in these domains (Kaiser et al., 2019), the tasks offer a diverse test bed with visual complexity, sparse rewards, and early termination. Agents observe 64 × 64 × 3 images and select one of between 3 and 18 actions. For Atari, we follow the...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.