Growing Action Spaces

Gabriel Synnaeve; Gregory Farquhar; Laura Gustafson; Nicolas Usunier; Shimon Whiteson; Zeming Lin

arxiv: 1906.12266 · v1 · pith:6TI76Q22new · submitted 2019-06-28 · 💻 cs.LG · cs.AI· stat.ML

Growing Action Spaces

Gregory Farquhar , Laura Gustafson , Zeming Lin , Shimon Whiteson , Nicolas Usunier , Gabriel Synnaeve This is my paper

Pith reviewed 2026-05-25 13:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords growing action spacesoff-policy reinforcement learningcurriculum learningStarCraft micromanagementaction space restrictionvalue function transfermulti-agent tasksexploration efficiency

0 comments

The pith

An agent can accelerate learning on large action space tasks by starting with restricted actions and expanding them using off-policy reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a curriculum of progressively growing action spaces lets reinforcement learning make efficient progress in environments where the full combinatorial action space makes random exploration too slow. The agent internally limits its actions at the start, then expands them while off-policy methods estimate value functions for several action space sizes at once and move data, estimates, and state representations forward to the complete task. A sympathetic reader would care because this keeps the environment unchanged yet still produces faster learning on demanding multi-agent problems such as StarCraft micromanagement. The approach is shown to work in simple control tasks and in the large-scale setting.

Core claim

Off-policy reinforcement learning can estimate optimal value functions for multiple action spaces simultaneously and efficiently transfer data, value estimates, and state representations from restricted action spaces to the full task, accelerating learning on large-scale StarCraft micromanagement tasks.

What carries the argument

The internal curriculum of progressively growing action spaces, supported by off-policy value estimation that operates across different restriction levels at the same time.

Load-bearing premise

Restricting the agent's actions internally leaves the environment dynamics and the optimal policy for the full action space unchanged, so that value estimates from smaller spaces stay useful after expansion.

What would settle it

Running the same StarCraft micromanagement tasks with and without the growing curriculum and finding equal or faster learning when training directly on the full action space from the beginning would show the transfer does not help.

Figures

Figures reproduced from arXiv: 1906.12266 by Gabriel Synnaeve, Gregory Farquhar, Laura Gustafson, Nicolas Usunier, Shimon Whiteson, Zeming Lin.

**Figure 2.** Figure 2: Architecture for GAS with hierarchical clustering. For clarity, only two levels of hierarchy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: StarCraft micromanagement with growing action spaces. We report the mean and standard [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Final learned policies of StarCraft micromanagement unit control with growing action [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

In complex tasks, such as those with large combinatorial action spaces, random exploration may be too inefficient to achieve meaningful learning progress. In this work, we use a curriculum of progressively growing action spaces to accelerate learning. We assume the environment is out of our control, but that the agent may set an internal curriculum by initially restricting its action space. Our approach uses off-policy reinforcement learning to estimate optimal value functions for multiple action spaces simultaneously and efficiently transfers data, value estimates, and state representations from restricted action spaces to the full task. We show the efficacy of our approach in proof-of-concept control tasks and on challenging large-scale StarCraft micromanagement tasks with large, multi-agent action spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable curriculum for large action spaces by growing them progressively and transferring value estimates off-policy, which helps on StarCraft micromanagement but stays incremental.

read the letter

The central takeaway is that this work shows how to use a progressive curriculum on action space size to make exploration more efficient in complex RL tasks, by estimating values for multiple spaces at once and moving knowledge forward as the space grows. What the paper does well is apply this to multi-agent StarCraft micromanagement, where the action space is huge, and demonstrate that the transfer helps learning progress. The off-policy nature allows reusing experiences from smaller spaces, which makes sense and avoids the need to restart from scratch. The math and method seem grounded in standard RL techniques like Q-learning or actor-critic with action masking, so no major contradictions there. The citation pattern likely covers relevant curriculum and large action space papers. Soft spots are minor: the paper could have more analysis on when the transfer is most beneficial or comparisons to other curricula like reward shaping, but the core results don't seem overstated based on the abstract and description. This is for RL practitioners and researchers working on scaling to real-world like game AI with many actions. A reader looking for implementable ideas in multi-agent RL will find value. It deserves a serious referee because the problem is relevant and the solution is tested on non-trivial tasks.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes using a curriculum of progressively growing action spaces in reinforcement learning tasks with large combinatorial action spaces. The agent internally restricts its action space at the outset while using off-policy RL to simultaneously estimate optimal value functions across multiple action spaces; data, value estimates, and state representations are transferred from restricted spaces to the full task. Empirical support is claimed on proof-of-concept control tasks and large-scale StarCraft micromanagement domains.

Significance. If the results hold, the approach offers a practical curriculum strategy for improving sample efficiency when action spaces are large, by reusing off-policy data and representations across action-space sizes. This is a direct, internally consistent extension of standard off-policy RL and could be relevant to multi-agent and combinatorial control settings.

minor comments (2)

[Abstract] Abstract: the description of simultaneous value estimation and transfer would be strengthened by naming the specific off-policy algorithm and the precise mechanism used to mask or grow the action space.
The manuscript should include an explicit statement (with a short derivation or pseudocode) confirming that the environment transition dynamics and the optimal policy for the full action space remain unchanged when the agent applies an internal action mask.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work on growing action spaces via curriculum learning with off-policy value estimation. The recommendation for minor revision is appreciated, and we note that no specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an algorithmic procedure for curriculum learning via growing action spaces in off-policy RL, with no mathematical derivation, closed-form prediction, or first-principles result that reduces to its inputs by construction. The central claim is that off-policy methods can simultaneously estimate values across action-space restrictions and transfer data/representations; this is an empirical and procedural claim, not a self-referential equation or fitted parameter renamed as a prediction. No load-bearing self-citation, uniqueness theorem, or ansatz is invoked in the provided abstract or description. The approach is self-contained as a standard application of existing off-policy RL techniques (e.g., replay reuse under action masks) without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the domain assumption that internal action restriction is feasible and that off-policy value estimation can be performed simultaneously across action spaces without interference. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Off-policy RL algorithms can maintain and update value estimates for multiple different action spaces from the same trajectory data.
Invoked when the paper states that value functions for restricted and full action spaces are estimated simultaneously.
domain assumption Restricting the agent's action space internally does not alter the underlying MDP dynamics or the optimal policy of the unrestricted task.
Required for the curriculum to be valid; stated as an assumption in the abstract.

pith-pipeline@v0.9.0 · 5650 in / 1288 out tokens · 22427 ms · 2026-05-25T13:22:58.838575+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 14 internal anchors

[1]

Mix&Match - Agent Curricula for Reinforcement Learning

Wojciech Marian Czarnecki, Siddhant M Jayakumar, Max Jaderberg, Leonard Hasenclever, Yee Whye Teh, Simon Osindero, Nicolas Heess, and Razvan Pascanu. Mix&match-agent curricula for reinforcement learning. arXiv preprint arXiv:1806.01780,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Reverse Curriculum Generation for Reinforcement Learning

9 Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Population Based Training of Neural Networks

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Asynchronous methods for deep reinforcement learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937,

work page 1928
[6]

Cassl: Curriculum accelerated self-supervised learning

Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. Cassl: Curriculum accelerated self-supervised learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6453–6460. IEEE,

work page 2018
[7]

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

An Overview of Multi-Task Learning in Deep Neural Networks

Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play

Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

TorchCraft: a Library for Machine Learning Research on Real-Time Strategy Games

Gabriel Synnaeve, Nantas Nardelli, Alex Auvolat, Soumith Chintala, Timothée Lacroix, Zeming Lin, Florian Richoux, and Nicolas Usunier. Torchcraft: a library for machine learning research on real-time strategy games. arXiv preprint arXiv:1611.00625,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks

Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic exploration for deep deter- ministic policies: An application to starcraft micromanagement tasks. arXiv preprint arXiv:1609.02993,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

StarCraft II: A New Challenge for Reinforcement Learning

Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Dueling Network Architectures for Deep Reinforcement Learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Learning to Execute

Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

attack-closest

11 8 Appendix 8.1 Discretised continuous control For our experiments in discretised continous control, we use a standard DQN trainer [Mnih et al., 2015] with the following parameters. Parameter Value batch size 128 replay buffer size 10000 target update interval 200 ϵ initial 1.0 ϵ ﬁnal 0.1 ϵ decay 25000 env steps ℓ lead-in 25000 env steps ℓ growth 25000 ...

work page 2015

[1] [1]

Mix&Match - Agent Curricula for Reinforcement Learning

Wojciech Marian Czarnecki, Siddhant M Jayakumar, Max Jaderberg, Leonard Hasenclever, Yee Whye Teh, Simon Osindero, Nicolas Heess, and Razvan Pascanu. Mix&match-agent curricula for reinforcement learning. arXiv preprint arXiv:1806.01780,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Reverse Curriculum Generation for Reinforcement Learning

9 Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Population Based Training of Neural Networks

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Asynchronous methods for deep reinforcement learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937,

work page 1928

[6] [6]

Cassl: Curriculum accelerated self-supervised learning

Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. Cassl: Curriculum accelerated self-supervised learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6453–6460. IEEE,

work page 2018

[7] [7]

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

An Overview of Multi-Task Learning in Deep Neural Networks

Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play

Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

TorchCraft: a Library for Machine Learning Research on Real-Time Strategy Games

Gabriel Synnaeve, Nantas Nardelli, Alex Auvolat, Soumith Chintala, Timothée Lacroix, Zeming Lin, Florian Richoux, and Nicolas Usunier. Torchcraft: a library for machine learning research on real-time strategy games. arXiv preprint arXiv:1611.00625,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks

Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic exploration for deep deter- ministic policies: An application to starcraft micromanagement tasks. arXiv preprint arXiv:1609.02993,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

StarCraft II: A New Challenge for Reinforcement Learning

Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Dueling Network Architectures for Deep Reinforcement Learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Learning to Execute

Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

attack-closest

11 8 Appendix 8.1 Discretised continuous control For our experiments in discretised continous control, we use a standard DQN trainer [Mnih et al., 2015] with the following parameters. Parameter Value batch size 128 replay buffer size 10000 target update interval 200 ϵ initial 1.0 ϵ ﬁnal 0.1 ϵ decay 25000 env steps ℓ lead-in 25000 env steps ℓ growth 25000 ...

work page 2015