pith. sign in

Intrinsic motivation and automatic curricula via asymmetric self-play

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it
abstract

We describe a simple scheme that allows an agent to learn about its environment in an unsupervised manner. Our scheme pits two versions of the same agent, Alice and Bob, against one another. Alice proposes a task for Bob to complete; and then Bob attempts to complete the task. In this work we will focus on two kinds of environments: (nearly) reversible environments and environments that can be reset. Alice will "propose" the task by doing a sequence of actions and then Bob must undo or repeat them, respectively. Via an appropriate reward structure, Alice and Bob automatically generate a curriculum of exploration, enabling unsupervised training of the agent. When Bob is deployed on an RL task within the environment, this unsupervised training reduces the number of supervised episodes needed to learn, and in some cases converges to a higher reward.

citation-role summary

background 1

citation-polarity summary

fields

cs.LG 3

years

2026 1 2019 2

roles

background 1

polarities

background 1

representative citing papers

Scaling Self-Play with Self-Guidance

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.

Growing Action Spaces

cs.LG · 2019-06-28 · unverdicted · novelty 5.0

A curriculum of growing action spaces combined with simultaneous off-policy value estimation accelerates learning in large multi-agent action spaces.

citing papers explorer

Showing 3 of 3 citing papers.

  • Dota 2 with Large Scale Deep Reinforcement Learning cs.LG · 2019-12-13 · accept · none · ref 36

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  • Scaling Self-Play with Self-Guidance cs.LG · 2026-04-22 · unverdicted · none · ref 34

    SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.

  • Growing Action Spaces cs.LG · 2019-06-28 · unverdicted · none · ref 10 · internal anchor

    A curriculum of growing action spaces combined with simultaneous off-policy value estimation accelerates learning in large multi-agent action spaces.