Recognition: 2 theorem links
· Lean TheoremMastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Pith reviewed 2026-05-16 23:52 UTC · model grok-4.3
The pith
MuZero achieves superhuman performance in Atari, Go, chess and shogi by learning a model that predicts only the reward, policy and value needed for planning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MuZero learns a model that, when applied iteratively, predicts the reward, the action-selection policy, and the value function. When this model is used inside tree-based search, the resulting agent reaches superhuman performance across visually complex domains without any knowledge of their underlying dynamics, and matches AlphaZero on Go, chess and shogi without being given the game rules.
What carries the argument
The MuZero learned model that iteratively predicts reward, policy and value inside Monte Carlo tree search.
If this is right
- Planning methods can now be applied to domains that lack perfect simulators, such as real-world control tasks.
- A new state of the art is reached on the full set of 57 Atari games.
- Superhuman performance is obtained in Go, chess and shogi with zero prior knowledge of the rules.
- Predictions can be limited to reward, policy and value rather than full next-state reconstruction while still enabling effective search.
Where Pith is reading between the lines
- Targeted prediction of planning quantities may be sufficient for many sequential decision problems where full model learning is intractable.
- The same architecture could be tested in continuous or partially observable settings where accumulating model error has historically limited planning.
- If prediction accuracy holds at longer horizons, similar learned models might reduce the sample complexity gap between model-based and model-free methods in visual domains.
Load-bearing premise
The learned predictions remain accurate enough over many steps to support planning even when the true dynamics are unknown and high-dimensional.
What would settle it
A direct comparison showing that MuZero's performance collapses to below AlphaZero levels in Go when the learned model is replaced by the true game rules, or that prediction error grows rapidly enough to prevent superhuman play on any Atari game.
read the original abstract
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the MuZero algorithm, which learns a model to iteratively predict reward, policy, and value for use inside Monte Carlo tree search. This enables planning in environments with unknown dynamics. The method achieves a new state of the art on 57 Atari games and matches the superhuman performance of AlphaZero on Go, chess, and shogi without being given the game rules or dynamics.
Significance. If the results hold, this is a significant advance for model-based RL: it demonstrates that a learned model can support effective long-horizon planning in high-dimensional, visually complex domains where prior model-based methods have struggled. The large-scale evaluation (57 Atari games plus three board games), direct comparisons to AlphaZero and prior SOTA agents, and reported training curves provide strong empirical grounding.
minor comments (3)
- [§3.2] §3.2 (MuZero algorithm): the mapping from the three learned heads (reward, policy, value) to the MCTS backup and selection steps could be stated more explicitly, perhaps with an additional equation or annotated diagram.
- [Table 2] Table 2 (Atari results): while median human-normalized scores are given, adding per-game statistical significance or variance across seeds would strengthen the 'new state of the art' claim.
- [Figure 4] Figure 4 (board-game learning curves): including the AlphaZero curve on the same plot would make the matching-performance claim easier to assess visually.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work, the recognition of its significance for model-based RL, and the recommendation for minor revision. We appreciate the detailed summary highlighting the key contributions of MuZero in learning models that directly support planning without access to environment dynamics.
Circularity Check
No significant circularity
full rationale
The MuZero paper defines an algorithmic procedure (representation, dynamics, and prediction functions trained via a combined loss on observed rewards, policies, and values) and validates it through direct empirical evaluation on external benchmarks (57 Atari games and board games) against independent baselines such as AlphaZero and human performance. No load-bearing derivation step equates a claimed result to its own fitted inputs by construction, nor does any central claim reduce to a self-citation chain or renamed empirical pattern; the reported superhuman performance is measured externally and is not tautological with the training objectives.
Axiom & Free-Parameter Ledger
free parameters (1)
- network architecture and optimizer hyperparameters
axioms (1)
- domain assumption Iterated application of the learned model inside tree search yields predictions accurate enough for superhuman planning
invented entities (1)
-
Learned dynamics model with reward-policy-value heads
no independent evidence
Forward citations
Cited by 17 Pith papers
-
PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling
PMCTS is the first principled parallel MCTS algorithm that preserves formal policy improvement guarantees and scales with parallel compute.
-
Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing
Finite-sample risk bounds for DQN with ReLU networks are extended to τ-mixing data, showing an extra dimensionality penalty in the convergence rate due to dependence.
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models
Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
-
Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum
Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
-
Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search
Inverse-RPO derives two variance-aware prior-based UCT policies from UCB-V that outperform PUCT on benchmarks with no extra cost.
-
Latent Chain-of-Thought World Modeling for End-to-End Driving
LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and b...
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
Mastering Diverse Domains through World Models
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
-
Mastering Atari with Discrete World Models
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
-
Dream to Control: Learning Behaviors by Latent Imagination
Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
-
Plan Before You Trade: Inference-Time Optimization for RL Trading Agents
FPILOT optimizes pre-trained RL trading policies at inference time using forecasted price trajectories to improve portfolio allocations and risk-adjusted returns on the DJ30 benchmark.
-
Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning
Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.
-
Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits
Hybrid agent with variational quantum circuits for feature extraction in hierarchical RL outperforms classical baselines with 66% parameter savings, but quantum value estimation degrades results.
-
Is Conditional Generative Modeling all you need for Decision-Making?
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
-
Interpretable experiential learning based on state history and global feedback
A transition graph model with utility and evidence counts learns behaviors from state history and feedback, showing performance comparable to neural networks on Atari Breakout.
-
Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them
XAI-based correction methods outperform non-XAI baselines for fixing spurious correlations in DNNs, with Counterfactual Knowledge Distillation most effective, but all are limited by reliance on unavailable group label...
Reference graph
Works this paper leans on
-
[1]
Lipton, and Animashree Anandkumar
Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary C. Lipton, and Animashree Anandkumar. Surprising negative results for generative adversarial tree search.CoRR, abs/1806.05780, 2018
-
[2]
The arcade learning environment: An evaluation platform for general agents
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013
work page 2013
-
[3]
Superhuman ai for heads-up no-limit poker: Libratus beats top profes- sionals
Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top profes- sionals. Science, 359(6374):418–424, 2018
work page 2018
-
[4]
Learning and Querying Fast Generative Models for Reinforcement Learning
Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Joseph Hoane, Jr., and Feng-hsiung Hsu
Murray Campbell, A. Joseph Hoane, Jr., and Feng-hsiung Hsu. Deep blue. Artif. Intell., 134(1-2):57–83, January 2002
work page 2002
-
[6]
R. Coulom. Whole-history rating: A Bayesian rating system for players of time-varying strength. In Inter- national Conference on Computers and Games, pages 113–124, 2008
work page 2008
-
[7]
Efficient selectivity and backup operators in monte-carlo tree search
R ´emi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. InInternational confer- ence on computers and games, pages 72–83. Springer, 2006
work page 2006
-
[8]
MP. Deisenroth and CE. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011 , pages 465–472. Omnipress, 2011
work page 2011
-
[9]
Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the International Conference on Machine Learning (ICML) , 2018
work page 2018
-
[10]
TreeQN and ATreec: Differ- entiable tree planning for deep reinforcement learning
Gregory Farquhar, Tim Rocktaeschel, Maximilian Igl, and Shimon Whiteson. TreeQN and ATreec: Differ- entiable tree planning for deep reinforcement learning. In International Conference on Learning Represen- tations, 2018
work page 2018
-
[11]
Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. DeepMDP: Learning continuous latent space models for representation learning. In Kamalika Chaudhuri and Ruslan Salakhutdi- nov, editors, Proceedings of the 36th International Conference on Machine Learning , volume 97 of Pro- ceedings of Machine Learning Research, pages 2170–2...
work page 2019
-
[12]
Cloud tpu. https://cloud.google.com/tpu/. Accessed: 2019
work page 2019
-
[13]
Recurrent world models facilitate policy evolution
David Ha and J ¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 2455–2467, USA, 2018. Curran Associates Inc. 8
work page 2018
-
[14]
Learning Latent Dynamics for Planning from Pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Identity mappings in deep residual networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In 14th European Conference on Computer Vision, pages 630–645, 2016
work page 2016
-
[16]
Learning con- tinuous control policies by stochastic value gradients
Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, and Tom Erez. Learning con- tinuous control policies by stochastic value gradients. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2944–2952, Cambridge, MA, USA,
-
[17]
Rainbow: Combining improvements in deep reinforcement learning
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[18]
Distributed prioritized experience replay
Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018
work page 2018
-
[19]
Reinforcement Learning with Unsupervised Auxiliary Tasks
Max Jaderberg, V olodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Sil- ver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
Model-based reinforcement learning for atari
Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019
-
[21]
Recurrent experience replay in distributed reinforcement learning
Steven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, and Remi Munos. Recurrent experience replay in distributed reinforcement learning. InInternational Conference on Learning Representations, 2019
work page 2019
-
[22]
Bandit based monte-carlo planning
Levente Kocsis and Csaba Szepesv ´ari. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006
work page 2006
-
[23]
Imagenet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012
work page 2012
-
[24]
Learning neural network policies with guided policy search under un- known dynamics
Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under un- known dynamics. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1071–1079. Curran Associates, Inc., 2014
work page 2014
-
[25]
Human-level control through deep reinforcement learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015
work page 2015
-
[26]
Deepstack: Expert-level artificial intelligence in heads-up no-limit poker
Matej Morav ˇc´ık, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017
work page 2017
-
[27]
Massively Parallel Methods for Deep Reinforcement Learning
Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Ve- davyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, V olodymyr Mnih, Koray Kavukcuoglu, and David Silver. Massively parallel methods for deep reinforcement learning. CoRR, abs/1507.04296, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. InAdvances in Neural Information Processing Systems, pages 6118–6128, 2017
work page 2017
- [29]
-
[30]
Observe and Look Further: Achieving Consistent Performance on Atari
Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Veˇcer´ık, et al. Observe and look further: Achieving consis- tent performance on atari. arXiv preprint arXiv:1805.11593, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [31]
-
[32]
Multi-armed bandits with episode context
Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011
work page 2011
-
[33]
Single-player monte-carlo tree search
Maarten PD Schadd, Mark HM Winands, H Jaap Van Den Herik, Guillaume MJ-B Chaslot, and Jos WHM Uiterwijk. Single-player monte-carlo tree search. In International Conference on Computers and Games , pages 1–12. Springer, 2008
work page 2008
-
[34]
A world championship caliber checkers program
Jonathan Schaeffer, Joseph Culberson, Norman Treloar, Brent Knight, Paul Lu, and Duane Szafron. A world championship caliber checkers program. Artificial Intelligence, 53(2-3):273–289, 1992
work page 1992
-
[35]
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Interna- tional Conference on Learning Representations, Puerto Rico, 2016
work page 2016
-
[36]
Off-policy actor-critic with shared experience replay
Simon Schmitt, Matteo Hessel, and Karen Simonyan. Off-policy actor-critic with shared experience replay. arXiv preprint arXiv:1909.11583, 2019
-
[37]
Planning chemical syntheses with deep neural networks and symbolic ai
Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic ai. Nature, 555(7698):604, 2018
work page 2018
-
[38]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the g...
work page 2016
-
[39]
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018
work page 2018
-
[40]
Mastering the game of go without human knowledge
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–359, October 2017
work page 2017
-
[41]
The predictron: End-to-end learning and planning
David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac- Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 3191–3199. JMLR. org, 2017
work page 2017
-
[42]
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018
work page 2018
-
[43]
Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning
Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999
work page 1999
-
[44]
Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. InAdvances in Neural Information Processing Systems, pages 2154–2162, 2016
work page 2016
-
[45]
When to use parametric models in reinforcement learning? arXiv preprint arXiv:1906.05243, 2019
Hado van Hasselt, Matteo Hessel, and John Aslanides. When to use parametric models in reinforcement learning? arXiv preprint arXiv:1906.05243, 2019. 10
-
[46]
Grandmaster level in StarCraft II using multi-agent reinforcement learning
Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha¨el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, pages 1–5, 2019
work page 2019
-
[47]
I Vlahavas and I Refanidis. Planning and scheduling. EETN, Greece, Tech. Rep, 2013
work page 2013
-
[48]
From Pixels to Torques: Policy Learning with Deep Dynamical Models
Niklas Wahlstr ¨om, Thomas B. Sch ¨on, and Marc Peter Deisenroth. From pixels to torques: Policy learning with deep dynamical models. CoRR, abs/1502.02251, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[49]
Embed to control: A locally linear latent dynamics model for control from raw images
Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2746–2754, Cambridge, MA, USA, 2015. MIT Press. Supplementary Materi...
work page 2015
-
[50]
AlphaZero had access to a perfect simulator of the true dynamics process
State transitions. AlphaZero had access to a perfect simulator of the true dynamics process. In contrast, MuZero employs a learned dynamics model within its search. Under this model, each node in the tree is represented by a corresponding hidden state; by providing a hidden statesk−1 and an actionak to the model the search algorithm can transition to a ne...
-
[51]
Actions available. AlphaZero used the set of legal actions obtained from the simulator to mask the prior produced by the network everywhere in the search tree. MuZero only masks legal actions at the root of the search tree where the environment can be queried, but does not perform any masking within the search tree. This is possible because the network ra...
-
[52]
Terminal nodes. AlphaZero stopped the search at tree nodes representing terminal states and used the ter- minal value provided by the simulator instead of the value produced by the network. MuZero does not give special treatment to terminal nodes and always uses the value predicted by the network. Inside the tree, the search can proceed past a terminal no...
-
[53]
In the experiments reported in this paper, we always unroll for K = 5 steps
This ensures that the total gradient applied to the dynamics function stays constant. In the experiments reported in this paper, we always unroll for K = 5 steps. For a detailed illustration, see Figure 1. To improve the learning process and bound the activations, we also scale the hidden state to the same range as the action input ([0, 1]):sscaled = s−mi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.