pith. sign in

arxiv: 1810.12894 · v1 · pith:T2T3YSUBnew · submitted 2018-10-30 · 💻 cs.LG · cs.AI· stat.ML

Exploration by Random Network Distillation

classification 💻 cs.LG cs.AIstat.ML
keywords networkbonusexplorationgamedeepdistillationfirstintroduce
0
0 comments X
read the original abstract

We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the first level.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  2. Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

    cs.AI 2026-05 unverdicted novelty 7.0

    Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.

  3. Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning

    cs.MA 2026-05 unverdicted novelty 7.0

    A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.

  4. Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    TeLAPA maintains archives of behaviorally diverse yet competent policies aligned in a shared latent space to preserve plasticity and enable faster recovery after interference in continual reinforcement learning.

  5. Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

    cs.LG 2025-09 unverdicted novelty 7.0

    LPM uses a dual-network design to compute intrinsic rewards from the change in prediction error across iterations, providing a noise-robust signal that is theoretically linked to information gain.

  6. An Information-Geometric Approach to Artificial Curiosity

    cs.LG 2025-04 unverdicted novelty 7.0

    Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-bas...

  7. Dota 2 with Large Scale Deep Reinforcement Learning

    cs.LG 2019-12 accept novelty 7.0

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  8. Solving Rubik's Cube with a Robot Hand

    cs.LG 2019-10 accept novelty 7.0

    Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.

  9. Goal-Conditioned Agents that Learn Everything All at Once

    cs.LG 2026-05 unverdicted novelty 6.0

    LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.

  10. Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficien...

  11. Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math ...

  12. Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration

    cs.RO 2026-05 unverdicted novelty 6.0

    QOED selects identifiable parameter directions via Fisher matrix eigenspace analysis and modifies exploration objectives to approximate ideal information gain under bounded nuisance assumptions, yielding 21-35% perfor...

  13. Shaping Zero-Shot Coordination via State Blocking

    cs.LG 2026-05 unverdicted novelty 6.0

    SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.

  14. Learning to Theorize the World from Observation

    cs.LG 2026-05 unverdicted novelty 6.0

    NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

  15. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  16. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  17. Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs

    cs.LG 2026-05 unverdicted novelty 6.0

    An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.

  18. Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields

    cs.AI 2026-04 unverdicted novelty 6.0

    Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.

  19. Improving Zero-Shot Offline RL via Behavioral Task Sampling

    cs.AI 2026-04 unverdicted novelty 6.0

    Extracting task vectors from the offline dataset for policy training improves zero-shot offline RL performance by an average of 20% over random sampling baselines.

  20. Dual-Timescale Memory in a Spiking Neuron-Astrocyte Network for Efficient Navigation

    q-bio.QM 2026-04 unverdicted novelty 6.0

    A neuron-astrocyte network with dual-timescale memory reduces median path lengths up to sixfold in partially observable grid-world navigation tasks.

  21. Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms

    cs.RO 2026-04 unverdicted novelty 6.0

    A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.

  22. SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning

    cs.RO 2025-06 unverdicted novelty 6.0

    SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated ...

  23. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

    cs.AI 2024-08 conditional novelty 6.0

    Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

  24. RoboNet: Large-Scale Multi-Robot Learning

    cs.RO 2019-10 conditional novelty 6.0

    RoboNet is a multi-robot video dataset that enables pre-training of vision-based manipulation models which, after fine-tuning on a new robot, outperform robot-specific training that uses 4-20 times more data.

  25. Benchmarking Batch Deep Reinforcement Learning Algorithms

    cs.LG 2019-10 unverdicted novelty 6.0

    Many batch RL algorithms underperform both online DQN and the behavioral policy on Atari; an adapted discrete-action BCQ outperforms the others tested.

  26. When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

    cs.LG 2026-05 unverdicted novelty 5.0

    Robust minimax task inference in BFMs achieves dynamics-shift robustness from nominal offline data alone and outperforms standard baselines.

  27. OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

    cs.RO 2026-05 unverdicted novelty 5.0

    OrbiSim builds a differentiable physics engine from world models to support gradient-based policy optimization and contact modeling in robotics.

  28. Test-Time Alignment via Hypothesis Reweighting

    cs.LG 2024-12 unverdicted novelty 5.0

    HyRe personalizes reward models at test time by reweighting an ensemble of heads trained on aggregate preferences, using few target examples to outperform uniform averaging and prior methods on RewardBench and 32 tasks.