Abstract

Multi-armed bandit is a well-established area in online decision making: Where one player makes sequential decisions in a non-stationary environment to maximize his/her accumulative rewards. 

The multi-armed bandit problem becomes significantly more challenging when there are multiple players in the same environment, while only one piece of reward is presented at a time for each arm. In this setting, if two players pick the same arm at the same round, they are only able to get one piece of reward instead of two. To maximize the reward, players need to collaborate to avoid "collision" -- i.e. they need to make sure that they do not all rush to the same arm (even if it has the highest reward). 

We consider the even more challenging setting where communications between players are completely disabled: e.g. they are separated in different places of the world without any "Zoom". We show that nearly optimal regret can still be obtained in this setting: Players can actually collaborate in a non-stationary environment without any communication (of course, without quantum entanglement either :))

Video Recording