Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Kurtland Chua; Roberto Calandra; Rowan McAllister; Sergey Levine

arxiv: 1805.12114 · v2 · pith:SDGOT44Fnew · submitted 2018-05-30 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Kurtland Chua , Roberto Calandra , Rowan McAllister , Sergey Levine This is my paper

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords algorithmsdeepdynamicsmodel-freemodelsasymptoticfewerlearning

0 comments

read the original abstract

Model-based reinforcement learning (RL) algorithms can attain excellent sample efficiency, but often lag behind the best model-free algorithms in terms of asymptotic performance. This is especially true with high-capacity parametric function approximators, such as deep networks. In this paper, we study how to bridge this gap, by employing uncertainty-aware dynamics models. We propose a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation. Our comparison to state-of-the-art model-based and model-free deep RL algorithms shows that our approach matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples (e.g., 8 and 125 times fewer samples than Soft Actor Critic and Proximal Policy Optimization respectively on the half-cheetah task).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Benchmarking Model-Based Reinforcement Learning
cs.LG 2019-07 accept novelty 7.0

Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termin...
Exploring Model-based Planning with Policy Networks
cs.LG 2019-06 unverdicted novelty 7.0

POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.
Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing
cs.LG 2026-05 unverdicted novelty 6.0

Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
cs.RO 2024-11 unverdicted novelty 6.0

DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics
cs.LG 2026-05 unverdicted novelty 5.0

TRM trains a small horizon-matched pairwise head on trajectory data to improve terminal-state ranking in latent MPC, raising success from 7% to 97% on TwoRoom and 32.7% to 84% on PLDM without changing the encoder or dynamics.