pith. sign in

arxiv: 1906.12266 · v1 · pith:6TI76Q22new · submitted 2019-06-28 · 💻 cs.LG · cs.AI· stat.ML

Growing Action Spaces

Pith reviewed 2026-05-25 13:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords growing action spacesoff-policy reinforcement learningcurriculum learningStarCraft micromanagementaction space restrictionvalue function transfermulti-agent tasksexploration efficiency
0
0 comments X

The pith

An agent can accelerate learning on large action space tasks by starting with restricted actions and expanding them using off-policy reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a curriculum of progressively growing action spaces lets reinforcement learning make efficient progress in environments where the full combinatorial action space makes random exploration too slow. The agent internally limits its actions at the start, then expands them while off-policy methods estimate value functions for several action space sizes at once and move data, estimates, and state representations forward to the complete task. A sympathetic reader would care because this keeps the environment unchanged yet still produces faster learning on demanding multi-agent problems such as StarCraft micromanagement. The approach is shown to work in simple control tasks and in the large-scale setting.

Core claim

Off-policy reinforcement learning can estimate optimal value functions for multiple action spaces simultaneously and efficiently transfer data, value estimates, and state representations from restricted action spaces to the full task, accelerating learning on large-scale StarCraft micromanagement tasks.

What carries the argument

The internal curriculum of progressively growing action spaces, supported by off-policy value estimation that operates across different restriction levels at the same time.

Load-bearing premise

Restricting the agent's actions internally leaves the environment dynamics and the optimal policy for the full action space unchanged, so that value estimates from smaller spaces stay useful after expansion.

What would settle it

Running the same StarCraft micromanagement tasks with and without the growing curriculum and finding equal or faster learning when training directly on the full action space from the beginning would show the transfer does not help.

Figures

Figures reproduced from arXiv: 1906.12266 by Gabriel Synnaeve, Gregory Farquhar, Laura Gustafson, Nicolas Usunier, Shimon Whiteson, Zeming Lin.

Figure 1
Figure 1. Figure 1: Discretised continuous control with growing action spaces. We report the mean and [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture for GAS with hierarchical clustering. For clarity, only two levels of hierarchy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: StarCraft micromanagement with growing action spaces. We report the mean and standard [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Final learned policies of StarCraft micromanagement unit control with growing action [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

In complex tasks, such as those with large combinatorial action spaces, random exploration may be too inefficient to achieve meaningful learning progress. In this work, we use a curriculum of progressively growing action spaces to accelerate learning. We assume the environment is out of our control, but that the agent may set an internal curriculum by initially restricting its action space. Our approach uses off-policy reinforcement learning to estimate optimal value functions for multiple action spaces simultaneously and efficiently transfers data, value estimates, and state representations from restricted action spaces to the full task. We show the efficacy of our approach in proof-of-concept control tasks and on challenging large-scale StarCraft micromanagement tasks with large, multi-agent action spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes using a curriculum of progressively growing action spaces in reinforcement learning tasks with large combinatorial action spaces. The agent internally restricts its action space at the outset while using off-policy RL to simultaneously estimate optimal value functions across multiple action spaces; data, value estimates, and state representations are transferred from restricted spaces to the full task. Empirical support is claimed on proof-of-concept control tasks and large-scale StarCraft micromanagement domains.

Significance. If the results hold, the approach offers a practical curriculum strategy for improving sample efficiency when action spaces are large, by reusing off-policy data and representations across action-space sizes. This is a direct, internally consistent extension of standard off-policy RL and could be relevant to multi-agent and combinatorial control settings.

minor comments (2)
  1. [Abstract] Abstract: the description of simultaneous value estimation and transfer would be strengthened by naming the specific off-policy algorithm and the precise mechanism used to mask or grow the action space.
  2. The manuscript should include an explicit statement (with a short derivation or pseudocode) confirming that the environment transition dynamics and the optimal policy for the full action space remain unchanged when the agent applies an internal action mask.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work on growing action spaces via curriculum learning with off-policy value estimation. The recommendation for minor revision is appreciated, and we note that no specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an algorithmic procedure for curriculum learning via growing action spaces in off-policy RL, with no mathematical derivation, closed-form prediction, or first-principles result that reduces to its inputs by construction. The central claim is that off-policy methods can simultaneously estimate values across action-space restrictions and transfer data/representations; this is an empirical and procedural claim, not a self-referential equation or fitted parameter renamed as a prediction. No load-bearing self-citation, uniqueness theorem, or ansatz is invoked in the provided abstract or description. The approach is self-contained as a standard application of existing off-policy RL techniques (e.g., replay reuse under action masks) without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the domain assumption that internal action restriction is feasible and that off-policy value estimation can be performed simultaneously across action spaces without interference. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Off-policy RL algorithms can maintain and update value estimates for multiple different action spaces from the same trajectory data.
    Invoked when the paper states that value functions for restricted and full action spaces are estimated simultaneously.
  • domain assumption Restricting the agent's action space internally does not alter the underlying MDP dynamics or the optimal policy of the unrestricted task.
    Required for the curriculum to be valid; stated as an assumption in the abstract.

pith-pipeline@v0.9.0 · 5650 in / 1288 out tokens · 22427 ms · 2026-05-25T13:22:58.838575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 14 internal anchors

  1. [1]

    Mix&Match - Agent Curricula for Reinforcement Learning

    Wojciech Marian Czarnecki, Siddhant M Jayakumar, Max Jaderberg, Leonard Hasenclever, Yee Whye Teh, Simon Osindero, Nicolas Heess, and Razvan Pascanu. Mix&match-agent curricula for reinforcement learning. arXiv preprint arXiv:1806.01780,

  2. [2]

    IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

    Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561,

  3. [3]

    Reverse Curriculum Generation for Reinforcement Learning

    9 Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300,

  4. [4]

    Population Based Training of Neural Networks

    Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846,

  5. [5]

    Asynchronous methods for deep reinforcement learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937,

  6. [6]

    Cassl: Curriculum accelerated self-supervised learning

    Adithyavairavan Murali, Lerrel Pinto, Dhiraj Gandhi, and Abhinav Gupta. Cassl: Curriculum accelerated self-supervised learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6453–6460. IEEE,

  7. [7]

    QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

    Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485,

  8. [8]

    An Overview of Multi-Task Learning in Deep Neural Networks

    Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098,

  9. [9]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815,

  10. [10]

    Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play

    Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407,

  11. [11]

    Value-Decomposition Networks For Cooperative Multi-Agent Learning

    Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296,

  12. [12]

    TorchCraft: a Library for Machine Learning Research on Real-Time Strategy Games

    Gabriel Synnaeve, Nantas Nardelli, Alex Auvolat, Soumith Chintala, Timothée Lacroix, Zeming Lin, Florian Richoux, and Nicolas Usunier. Torchcraft: a library for machine learning research on real-time strategy games. arXiv preprint arXiv:1611.00625,

  13. [13]

    Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks

    Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic exploration for deep deter- ministic policies: An application to starcraft micromanagement tasks. arXiv preprint arXiv:1609.02993,

  14. [14]

    StarCraft II: A New Challenge for Reinforcement Learning

    Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782,

  15. [15]

    Dueling Network Architectures for Deep Reinforcement Learning

    Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581,

  16. [16]

    Learning to Execute

    Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615,

  17. [17]

    attack-closest

    11 8 Appendix 8.1 Discretised continuous control For our experiments in discretised continous control, we use a standard DQN trainer [Mnih et al., 2015] with the following parameters. Parameter Value batch size 128 replay buffer size 10000 target update interval 200 ϵ initial 1.0 ϵ final 0.1 ϵ decay 25000 env steps ℓ lead-in 25000 env steps ℓ growth 25000 ...