Recognition: unknown
The Horizon Threshold in Cooperative Multi-Agent Reward-Free Exploration
read the original abstract
We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite-horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the number of agents, especially when the number of learning phases is small. Our results identify a regime change governed by the horizon $H$. When the number of learning phases equals $H$, we present a computationally efficient algorithm that uses only $\tilde{O}(S^6 H^6 A / \epsilon^2)$ agents to obtain an $\epsilon$ approximation of the dynamics (i.e., yields an $\epsilon$-optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to $\rho < H$ phases requires at least $A^{H/\rho}$ agents to achieve constant accuracy. Thus, we show that having $\Theta(H)$ learning phases is both necessary and sufficient when restricting the number of agents to be polynomial.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Collaborating in Multi-Armed Bandits with Strategic Agents
CAOS sustains collaboration as a Nash equilibrium among persistent strategic agents in Bayesian multi-armed bandits via information sharing, with strong regret guarantees.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.