Follow-the-Perturbed-Leader for Decoupled Bandits: Best-of-Both-Worlds and Practicality

Chaiwon Kim; Jongyeong Lee; Min-hwan Oh

read the original abstract

We study the decoupled multi-armed bandit problem, where the learner separately selects one arm for exploration and one, possibly different, arm for exploitation at each round. In this setting, the loss of the explored arm is observed but not incurred, whereas the loss of the exploited arm is incurred without being observed. We propose an efficient Follow-the-Perturbed-Leader (FTPL) policy that achieves Best-of-Both-Worlds (BOBW) guarantee with constant regret in the stochastic regime and optimal $O(\sqrt{KT})$ regret in the adversarial regime. A key feature of our method is that it completely avoids both the convex optimization required by prior BOBW policies and the resampling procedures typically used in FTPL bandit policies. This allows FTPL to fully realize its computational efficiency advantages, leading to substantial reductions in computational cost. We empirically confirm that our policy not only improves the runtime but also demonstrates superior regret performance in both regimes.

Follow-the-Perturbed-Leader for Decoupled Bandits: Best-of-Both-Worlds and Practicality

discussion (0)