Benchmarking Model-Based Reinforcement Learning

· 2019 · cs.LG · arXiv 1907.02057

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. However, research in model-based RL has not been very standardized. It is fairly common for authors to experiment with self-designed environments, and there are several separate lines of research, which are sometimes closed-sourced or not reproducible. Accordingly, it is an open question how these various existing MBRL algorithms perform relative to each other. To facilitate research in MBRL, in this paper we gather a wide collection of MBRL algorithms and propose over 18 benchmarking environments specially designed for MBRL. We benchmark these algorithms with unified problem settings, including noisy environments. Beyond cataloguing performance, we explore and unify the underlying algorithmic differences across MBRL algorithms. We characterize three key research challenges for future MBRL research: the dynamics bottleneck, the planning horizon dilemma, and the early-termination dilemma. Finally, to maximally facilitate future research on MBRL, we open-source our benchmark in http://www.cs.toronto.edu/~tingwuwang/mbrl.html.

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Mastering Atari with Discrete World Models

cs.LG · 2020-10-05 · accept · novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

Dream to Control: Learning Behaviors by Latent Imagination

cs.LG · 2019-12-03 · accept · novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.

Is Conditional Generative Modeling all you need for Decision-Making?

cs.LG · 2022-11-28 · unverdicted · novelty 6.0

Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.

Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs

cs.LG · 2026-05-02 · unverdicted · novelty 6.0

An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

cs.RO · 2021-08-06 · accept · novelty 6.0

A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.

D2 Actor Critic: Diffusion Actor Meets Distributional Critic

cs.LG · 2025-10-03 · unverdicted · novelty 5.0

D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.

citing papers explorer

Showing 8 of 8 citing papers.

Mastering Atari with Discrete World Models cs.LG · 2020-10-05 · accept · none · ref 49 · internal anchor
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Advantage-Guided Diffusion for Model-Based Reinforcement Learning cs.AI · 2026-04-10 · unverdicted · none · ref 35
Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
Dream to Control: Learning Behaviors by Latent Imagination cs.LG · 2019-12-03 · accept · none · ref 48
Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Is Conditional Generative Modeling all you need for Decision-Making? cs.LG · 2022-11-28 · unverdicted · none · ref 123 · internal anchor
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 202
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs cs.LG · 2026-05-02 · unverdicted · none · ref 9
An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation cs.RO · 2021-08-06 · accept · none · ref 66
A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.
D2 Actor Critic: Diffusion Actor Meets Distributional Critic cs.LG · 2025-10-03 · unverdicted · none · ref 35 · internal anchor
D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.

Benchmarking Model-Based Reinforcement Learning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer