Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver , Thomas Hubert , Julian Schrittwieser , Ioannis Antonoglou , Matthew Lai , Arthur Guez , Marc Lanctot , Laurent Sifre

show 5 more authors

Dharshan Kumaran Thore Graepel Timothy Lillicrap Karen Simonyan Demis Hassabis

Authors on Pith no claims yet

classification 💻 cs.AI cs.LG

keywords chessgamesuperhumanachievedalgorithmalphazerodomaingames

0 comments

read the original abstract

The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI safety via debate
stat.ML 2018-05 conditional novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.
ASH: Agents that Self-Hone via Embodied Learning
cs.AI 2026-05 unverdicted novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Causal inference for social network formation
econ.EM 2026-04 conditional novelty 7.0

Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
On the Measure of Intelligence
cs.AI 2019-11 unverdicted novelty 7.0

Intelligence is skill-acquisition efficiency, and the ARC benchmark measures human-like general fluid intelligence by testing abstraction and reasoning with minimal, innate-like priors.
Solving Rubik's Cube with a Robot Hand
cs.LG 2019-10 accept novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Toward Modeling Player-Specific Chess Behaviors
cs.AI 2026-05 unverdicted novelty 6.0

Champion-specific embeddings and limited MCTS in Maia-2 reduce average Jensen-Shannon divergence to 16 historical chess champions' move distributions in a new latent-space metric, even as standard move accuracy falls.
Evaluating the False Trust engendered by LLM Explanations
cs.HC 2026-05 unverdicted novelty 6.0

A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
Scaling Self-Play with Self-Guidance
cs.LG 2026-04 unverdicted novelty 6.0

SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
AlphaCNOT: Learning CNOT Minimization with Model-Based Planning
cs.AI 2026-04 unverdicted novelty 6.0

AlphaCNOT combines reinforcement learning with Monte Carlo Tree Search planning to reduce CNOT gate counts by up to 32% versus heuristics in quantum circuit synthesis.
Computer Architecture's AlphaZero Moment: Automated Discovery in an Encircled World
cs.AR 2026-03 conditional novelty 6.0

Automated architectural discovery engines can outperform human design teams by exploring massive design spaces and compressing development cycles from months to weeks.
Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse
cs.LG 2026-03 unverdicted novelty 6.0

Probabilistic language tries unify compression, sequential decision making, and inference caching by making explicit the prefix structure of any generative model over sequences.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
PAWN: Piece Value Analysis with Neural Networks
cs.LG 2026-04 unverdicted novelty 5.0

A CNN autoencoder that encodes the entire chessboard state improves MLP prediction of relative piece values by 16% MAE reduction to roughly 0.65 pawns using 12 million Stockfish-labeled positions from grandmaster games.