A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
hub
Exploration by random network distillation
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
TeLAPA maintains archives of behaviorally diverse yet competent policies aligned in a shared latent space to preserve plasticity and enable faster recovery after interference in continual reinforcement learning.
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
QOED selects identifiable parameter directions via Fisher matrix eigenspace analysis and modifies exploration objectives to approximate ideal information gain under bounded nuisance assumptions, yielding 21-35% performance gains in robotic tasks.
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
Extracting task vectors from the offline dataset for policy training improves zero-shot offline RL performance by an average of 20% over random sampling baselines.
A neuron-astrocyte network with dual-timescale memory reduces median path lengths up to sixfold in partially observable grid-world navigation tasks.
A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.
citing papers explorer
-
Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning
A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
-
Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning
TeLAPA maintains archives of behaviorally diverse yet competent policies aligned in a shared latent space to preserve plasticity and enable faster recovery after interference in continual reinforcement learning.
-
Dota 2 with Large Scale Deep Reinforcement Learning
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
-
Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration
QOED selects identifiable parameter directions via Fisher matrix eigenspace analysis and modifies exploration objectives to approximate ideal information gain under bounded nuisance assumptions, yielding 21-35% performance gains in robotic tasks.
-
Shaping Zero-Shot Coordination via State Blocking
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
-
Learning to Theorize the World from Observation
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
-
Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs
An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.
-
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
-
Improving Zero-Shot Offline RL via Behavioral Task Sampling
Extracting task vectors from the offline dataset for policy training improves zero-shot offline RL performance by an average of 20% over random sampling baselines.
-
Dual-Timescale Memory in a Spiking Neuron-Astrocyte Network for Efficient Navigation
A neuron-astrocyte network with dual-timescale memory reduces median path lengths up to sixfold in partially observable grid-world navigation tasks.
-
Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms
A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.