pith. machine review for the scientific record. sign in

arxiv: 2602.01453 · v3 · submitted 2026-02-01 · 💻 cs.LG

Recognition: unknown

The Horizon Threshold in Cooperative Multi-Agent Reward-Free Exploration

Authors on Pith no claims yet
classification 💻 cs.LG
keywords learningagentsnumberphasesalgorithmepsilonwhencooperative
0
0 comments X
read the original abstract

We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite-horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the number of agents, especially when the number of learning phases is small. Our results identify a regime change governed by the horizon $H$. When the number of learning phases equals $H$, we present a computationally efficient algorithm that uses only $\tilde{O}(S^6 H^6 A / \epsilon^2)$ agents to obtain an $\epsilon$ approximation of the dynamics (i.e., yields an $\epsilon$-optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to $\rho < H$ phases requires at least $A^{H/\rho}$ agents to achieve constant accuracy. Thus, we show that having $\Theta(H)$ learning phases is both necessary and sufficient when restricting the number of agents to be polynomial.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Collaborating in Multi-Armed Bandits with Strategic Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    CAOS sustains collaboration as a Nash equilibrium among persistent strategic agents in Bayesian multi-armed bandits via information sharing, with strong regret guarantees.