hub Canonical reference

Dota 2 with Large Scale Deep Reinforcement Learning

· 2019 · cs.LG · arXiv 1912.06680

Canonical reference. 93% of citing Pith papers cite this work as background.

57 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 57 citing papers arXiv PDF

abstract

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 other 1

citation-polarity summary

background 13 unclear 1

representative citing papers

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Generative Language Modeling for Automated Theorem Proving

cs.LG · 2020-09-07 · unverdicted · novelty 8.0

GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

Sample efficient inductive matrix completion with noise and inexact side information

stat.ML · 2026-05-16 · unverdicted · novelty 7.0

Nonconvex projected gradient descent for noisy inductive matrix completion achieves linear convergence and order-optimal error at sample complexity scaling with side-information dimension a instead of ambient dimension n.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

Controllability in preference-conditioned multi-objective reinforcement learning

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

cs.AI · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.

Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters

cs.LG · 2026-05-07 · accept · novelty 7.0

Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control

cs.MA · 2026-04-15 · unverdicted · novelty 7.0

InfoChess proposes a symmetric adversarial game focused purely on information control and probabilistic king-location inference, with RL agents outperforming heuristic baselines and gameplay dissected via belief entropy, cross-entropy, and predictive scores.

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

cs.LG · 2026-04-04 · conditional · novelty 7.0

PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalization to 77.1%.

NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning

cs.LG · 2026-03-07 · unverdicted · novelty 7.0

NePPO learns a player-independent potential function via a novel objective whose minimization yields an approximate Nash equilibrium for general-sum multi-agent games.

An Information-Geometric Approach to Artificial Curiosity

cs.LG · 2025-04-08 · unverdicted · novelty 7.0

Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.

Voyager: An Open-Ended Embodied Agent with Large Language Models

cs.AI · 2023-05-25 · unverdicted · novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

pcsp is a shared RL policy using LLM persona embeddings, low-rank projection, and PPO+InfoNCE+KL training that delivers 17x above-chance zero-shot persona identification and 22x faster inference on a 300-persona benchmark.

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

cs.RO · 2026-05-21 · conditional · novelty 6.0

Multi-agent RL with league self-play trains quadrotors to exceed champion human performance in multi-player races above 22 m/s while cutting collisions by 50% and generalizing zero-shot to safer human interaction.

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

cs.RO · 2026-05-19 · accept · novelty 6.0 · 2 refs

ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.

GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.

Multi-agent AI systems outperform human teams in creativity

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Multi-agent LLM teams outperform human teams in creativity (d=1.50) across tasks by producing more novel ideas, with distinct semantic exploration patterns predicting success for each group.

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

cs.LG · 2026-05-07 · conditional · novelty 6.0 · 2 refs

SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.

citing papers explorer

Showing 50 of 57 citing papers.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing cs.CV · 2026-02-04 · unverdicted · none · ref 3 · internal anchor
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
Generative Agents: Interactive Simulacra of Human Behavior cs.HC · 2023-04-07 · accept · none · ref 13 · internal anchor
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
Generative Language Modeling for Automated Theorem Proving cs.LG · 2020-09-07 · unverdicted · none · ref 14 · internal anchor
GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation cs.LG · 2026-05-18 · unverdicted · none · ref 133 · internal anchor
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Sample efficient inductive matrix completion with noise and inexact side information stat.ML · 2026-05-16 · unverdicted · none · ref 103 · internal anchor
Nonconvex projected gradient descent for noisy inductive matrix completion achieves linear convergence and order-optimal error at sample complexity scaling with side-information dimension a instead of ambient dimension n.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling cs.LG · 2026-05-14 · unverdicted · none · ref 23 · internal anchor
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Controllability in preference-conditioned multi-objective reinforcement learning cs.LG · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding cs.AI · 2026-05-08 · unverdicted · none · ref 96 · 2 links · internal anchor
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters cs.LG · 2026-05-07 · accept · none · ref 265 · internal anchor
Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control cs.MA · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
InfoChess proposes a symmetric adversarial game focused purely on information control and probabilistic king-location inference, with RL agents outperforming heuristic baselines and gameplay dissected via belief entropy, cross-entropy, and predictive scores.
Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO cs.LG · 2026-04-04 · conditional · none · ref 1 · internal anchor
PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalization to 77.1%.
NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning cs.LG · 2026-03-07 · unverdicted · none · ref 2 · internal anchor
NePPO learns a player-independent potential function via a novel objective whose minimization yields an approximate Nash equilibrium for general-sum multi-agent games.
An Information-Geometric Approach to Artificial Curiosity cs.LG · 2025-04-08 · unverdicted · none · ref 10 · internal anchor
Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.
Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 34 · internal anchor
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents cs.AI · 2026-05-22 · unverdicted · none · ref 18 · internal anchor
pcsp is a shared RL policy using LLM persona embeddings, low-rank projection, and PPO+InfoNCE+KL training that delivers 17x above-chance zero-shot persona identification and 22x faster inference on a 300-persona benchmark.
Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning cs.RO · 2026-05-21 · conditional · none · ref 20 · internal anchor
Multi-agent RL with league self-play trains quadrotors to exceed champion human performance in multi-player races above 22 m/s while cutting collisions by 50% and generalizing zero-shot to safer human interaction.
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders cs.RO · 2026-05-19 · accept · none · ref 3 · 2 links · internal anchor
ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning cs.LG · 2026-05-19 · unverdicted · none · ref 33 · internal anchor
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
Multi-agent AI systems outperform human teams in creativity cs.CL · 2026-05-18 · unverdicted · none · ref 14 · internal anchor
Multi-agent LLM teams outperform human teams in creativity (d=1.50) across tasks by producing more novel ideas, with distinct semantic exploration patterns predicting success for each group.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data cs.LG · 2026-05-07 · conditional · none · ref 2 · 2 links · internal anchor
SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 22 · internal anchor
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 20 · internal anchor
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 7 · internal anchor
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models cs.LG · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V cs.AI · 2026-04-09 · unverdicted · none · ref 29 · internal anchor
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution cs.CL · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
Heterogeneous Self-Play for Realistic Highway Traffic Simulation cs.AI · 2026-03-31 · accept · none · ref 16 · internal anchor
PHASE uses heterogeneous self-play and context-conditioned policies to achieve realistic, zero-shot highway traffic simulation that outperforms traditional rule-based and self-play models on real-world datasets.
Tournament Informed Adversarial Quality Diversity cs.NE · 2026-01-27 · unverdicted · none · ref 6 · internal anchor
Tournament-informed task selection in adversarial QD produces higher quality and diversity in coevolved solutions across Pong, cat-and-mouse, and pursuers-evaders games.
RAPTOR: A Foundation Policy for Quadrotor Control cs.RO · 2025-09-15 · unverdicted · none · ref 69 · internal anchor
A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.
SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning cs.RO · 2025-06-17 · unverdicted · none · ref 3 · internal anchor
SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated and real robot tasks.
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments cs.AI · 2025-06-03 · unverdicted · none · ref 9 · internal anchor
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning cs.LG · 2025-05-30 · conditional · none · ref 2 · internal anchor
AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 60 · internal anchor
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 47 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
Proximal Policy Distillation cs.LG · 2024-07-21 · conditional · none · ref 4 · internal anchor
PPD integrates PPO into policy distillation so the student collects and uses its own rewards, yielding better sample efficiency and robustness than standard student-distill or teacher-distill on ATARI, Mujoco, and Procgen tasks.
TD-MPC2: Scalable, Robust World Models for Continuous Control cs.LG · 2023-10-25 · conditional · none · ref 133 · internal anchor
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization cs.LG · 2023-06-09 · unverdicted · none · ref 9 · internal anchor
TreeDQN is a sample-efficient off-policy RL method for combinatorial optimization that uses tree MDPs, requires up to 10 times less training data than on-policy methods, and outperforms state-of-the-art on ML4CO tasks.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 95 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 40 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning cs.RO · 2021-08-24 · conditional · none · ref 3 · internal anchor
Isaac Gym achieves 2-3 orders of magnitude faster robot policy training by keeping physics simulation and PyTorch-based RL entirely on GPU with direct buffer sharing.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 183 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Jukebox: A Generative Model for Music eess.AS · 2020-04-30 · unverdicted · none · ref 2 · internal anchor
Jukebox generates high-fidelity and diverse songs with singing and coherence up to multiple minutes by compressing raw audio via multi-scale VQ-VAE and modeling the codes with large autoregressive Transformers conditioned on artist, genre, and unaligned lyrics.
Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games cs.LG · 2026-05-14 · unverdicted · none · ref 31 · internal anchor
DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse cs.CV · 2026-05-11 · unverdicted · none · ref 123 · 2 links · internal anchor
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length cs.AI · 2026-05-04 · unverdicted · none · ref 26 · internal anchor
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations cs.MA · 2026-04-29 · unverdicted · none · ref 3 · internal anchor
A C++ Dec-POMDP simulator using data-oriented design and zero-copy PyTorch integration achieves up to 33 million steps per second on a 16-core CPU, enabling multi-agent policy training in minutes with PPO, DQN, and SAC.
RAMP: Hybrid DRL for Online Learning of Numeric Action Models cs.AI · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.
ARROW: Augmented Replay for RObust World models cs.LG · 2026-03-12 · unverdicted · none · ref 24 · internal anchor
ARROW adds a distribution-matching long-term replay buffer to DreamerV3 and shows reduced forgetting versus same-size baselines on Atari and Procgen continual RL benchmarks.
From monoliths to modules: Decomposing transducers for efficient world modelling cs.AI · 2025-12-01 · unverdicted · none · ref 3 · internal anchor
A framework for decomposing transducers into sub-transducers on distinct subspaces to enable parallel and interpretable world models.

Dota 2 with Large Scale Deep Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer