pith. machine review for the scientific record. sign in

arxiv: 1812.03381 · v1 · submitted 2018-12-08 · 💻 cs.LG · cs.AI· cs.NE· stat.ML

Recognition: unknown

Learning Montezuma's Revenge from a Single Demonstration

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIcs.NEstat.ML
keywords demonstrationlearningagentgamemethodmontezumarevengerewards
0
0 comments X
read the original abstract

We propose a new method for learning from a single demonstration to solve hard exploration tasks like the Atari game Montezuma's Revenge. Instead of imitating human demonstrations, as proposed in other recent works, our approach is to maximize rewards directly. Our agent is trained using off-the-shelf reinforcement learning, but starts every episode by resetting to a state from a demonstration. By starting from such demonstration states, the agent requires much less exploration to learn a game compared to when it starts from the beginning of the game at every episode. We analyze reinforcement learning for tasks with sparse rewards in a simple toy environment, where we show that the run-time of standard RL methods scales exponentially in the number of states between rewards. Our method reduces this to quadratic scaling, opening up many tasks that were previously infeasible. We then apply our method to Montezuma's Revenge, for which we present a trained agent achieving a high-score of 74,500, better than any previously published result.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Minimizing Worst-Case Weighted Latency for Multi-Robot Persistent Monitoring: Theory and RL-Based Solutions

    cs.RO 2026-05 unverdicted novelty 6.0

    Proposes tail-performance objectives and the TWLO-MDP reformulation solved via RL for multi-robot monitoring, with theoretical properties and experimental outperformance on a new benchmark.

  2. Hypothesis generation and updating in large language models

    cs.LG 2026-05 unverdicted novelty 6.0

    LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.