Recognition: unknown
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
read the original abstract
The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.
This paper has not been read by Pith yet.
Forward citations
Cited by 17 Pith papers
-
AI safety via debate
AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.
-
ASH: Agents that Self-Hone via Embodied Learning
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
-
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
Causal inference for social network formation
Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
-
On the Measure of Intelligence
Intelligence is skill-acquisition efficiency, and the ARC benchmark measures human-like general fluid intelligence by testing abstraction and reasoning with minimal, innate-like priors.
-
Solving Rubik's Cube with a Robot Hand
Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
-
Toward Modeling Player-Specific Chess Behaviors
Champion-specific embeddings and limited MCTS in Maia-2 reduce average Jensen-Shannon divergence to 16 historical chess champions' move distributions in a new latent-space metric, even as standard move accuracy falls.
-
Evaluating the False Trust engendered by LLM Explanations
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
-
Scaling Self-Play with Self-Guidance
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
-
AlphaCNOT: Learning CNOT Minimization with Model-Based Planning
AlphaCNOT combines reinforcement learning with Monte Carlo Tree Search planning to reduce CNOT gate counts by up to 32% versus heuristics in quantum circuit synthesis.
-
Computer Architecture's AlphaZero Moment: Automated Discovery in an Encircled World
Automated architectural discovery engines can outperform human design teams by exploring massive design spaces and compressing development cycles from months to weeks.
-
Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse
Probabilistic language tries unify compression, sequential decision making, and inference caching by making explicit the prefix structure of any generative model over sequences.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
PAWN: Piece Value Analysis with Neural Networks
A CNN autoencoder that encodes the entire chessboard state improves MLP prediction of relative piece values by 16% MAE reduction to roughly 0.65 pawns using 12 million Stockfish-labeled positions from grandmaster games.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.