Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research

Alex Ray; Bob McGrew; Bowen Baker; Glenn Powell; Jonas Schneider; Josh Tobin; Maciek Chociej; Marcin Andrychowicz; Matthias Plappert; Peter Welinder

arxiv: 1802.09464 · v2 · pith:D4HKML3Ynew · submitted 2018-02-26 · 💻 cs.LG · cs.AI· cs.RO

Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research

Matthias Plappert , Marcin Andrychowicz , Alex Ray , Bob McGrew , Bowen Baker , Glenn Powell , Jonas Schneider , Josh Tobin

show 4 more authors

Maciek Chociej Peter Welinder Vikash Kumar Wojciech Zaremba

This is my paper

classification 💻 cs.LG cs.AIcs.RO

keywords multi-goaltaskschallenginglearningreinforcementresearchroboticsadditional

0 comments

read the original abstract

The purpose of this technical report is two-fold. First of all, it introduces a suite of challenging continuous control tasks (integrated with OpenAI Gym) based on currently existing robotics hardware. The tasks include pushing, sliding and pick & place with a Fetch robotic arm as well as in-hand object manipulation with a Shadow Dexterous Hand. All tasks have sparse binary rewards and follow a Multi-Goal Reinforcement Learning (RL) framework in which an agent is told what to do using an additional input. The second part of the paper presents a set of concrete research ideas for improving RL algorithms, most of which are related to Multi-Goal RL and Hindsight Experience Replay.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
cs.LG 2026-05 unverdicted novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
Solving Rubik's Cube with a Robot Hand
cs.LG 2019-10 accept novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
cs.LG 2026-05 unverdicted novelty 6.0

Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Introduces RAPCs and a contraction Bellman operator that jointly enforce probabilistic reach-avoid constraints while minimizing expected costs in stochastic RL, with almost-sure convergence to local optima.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
Learning Optimal Strategies for Temporal Tasks in Stochastic Games
cs.AI 2021-02 unverdicted novelty 6.0

Model-free RL learns optimal strategies in stochastic games for LTL specs by constructing a product with DPA and assigning rewards/discounts from acceptance conditions.
Disentangled Skill Embeddings for Reinforcement Learning
cs.LG 2019-06 unverdicted novelty 6.0

Disentangled Skill Embeddings (DSE) is a variational inference framework for multi-task RL using shared parameters and task-specific latent embeddings for generalization to unseen conditions and as skills in hierarchical RL.
Trajectory-Level Data Augmentation for Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

Trajectory-based data augmentation exploits geometric relationships between rewards, values, and logging policies to enable effective offline RL from few suboptimal trajectories.
When a Robot is More Capable than a Human: Learning from Constrained Demonstrators
cs.RO 2025-10 unverdicted novelty 5.0

Robots outperform constrained human demonstrations by inferring state-only rewards from demos and using temporal interpolation to label and explore better trajectories, achieving 10x faster task completion on a real r...
D2 Actor Critic: Diffusion Actor Meets Distributional Critic
cs.LG 2025-10 unverdicted novelty 5.0

D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.
Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning
cs.LG 2025-06 unverdicted novelty 5.0

SSE improves long-horizon goal-conditioned RL by using failure and partial-success transitions to identify unreliable subgoals, streamline high-level planning, and outperform prior hierarchical methods on benchmarks.
Goal-Conditioned Decision Transformer for Multi-Goal Offline Reinforcement Learning
cs.RO 2024-10 unverdicted novelty 5.0

A Goal-Conditioned Decision Transformer is adapted for offline multi-goal RL and shown to outperform online baselines on a new Franka Emika Panda dataset.
Middle-mile logistics through the lens of goal-conditioned reinforcement learning
stat.ML 2026-05 unverdicted novelty 4.0

Middle-mile logistics is cast as a multi-object goal-conditioned MDP and solved by combining graph neural networks with model-free RL via extraction of small feature graphs.