Nature , volume=

Mastering atari, go, chess, shogi by planning with a learned model , author= · 2020

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

representative citing papers

NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

NonZero introduces an interaction score and bandit-formalized proposal rule for local agent deviations in multi-agent MCTS, delivering a sublinear local-regret guarantee and improved sample efficiency on game benchmarks without full joint-action enumeration.

Behavior-Consistent Deep Reinforcement Learning

cs.LG · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.

On Training in Imagination

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-based policy optimization.

AdamO: A Collapse-Suppressed Optimizer for Offline RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

cs.RO · 2026-04-22 · unverdicted · novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.

TD-MPC2: Scalable, Robust World Models for Continuous Control

cs.LG · 2023-10-25 · conditional · novelty 6.0

TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.

Reasoning with Language Model is Planning with World Model

cs.CL · 2023-05-24 · unverdicted · novelty 6.0

RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.

Learning to Theorize the World from Observation

cs.LG · 2026-05-05

citing papers explorer

Showing 8 of 8 citing papers.

NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search cs.LG · 2026-05-01 · unverdicted · none · ref 30
NonZero introduces an interaction score and bandit-formalized proposal rule for local agent deviations in multi-agent MCTS, delivering a sublinear local-regret guarantee and improved sample efficiency on game benchmarks without full joint-action enumeration.
Behavior-Consistent Deep Reinforcement Learning cs.LG · 2026-05-20 · unverdicted · none · ref 267 · 2 links
QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
On Training in Imagination cs.LG · 2026-05-07 · unverdicted · none · ref 17
The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-based policy optimization.
AdamO: A Collapse-Suppressed Optimizer for Offline RL cs.LG · 2026-05-03 · unverdicted · none · ref 56
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models cs.RO · 2026-04-22 · unverdicted · none · ref 3
Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
TD-MPC2: Scalable, Robust World Models for Continuous Control cs.LG · 2023-10-25 · conditional · none · ref 134
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
Reasoning with Language Model is Planning with World Model cs.CL · 2023-05-24 · unverdicted · none · ref 131
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
Learning to Theorize the World from Observation cs.LG · 2026-05-05 · unreviewed · ref 84

Nature , volume=

fields

years

verdicts

representative citing papers

citing papers explorer