Abstract

In this work, we study structured multi-armed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain structural information regarding the reward distributions and would like to minimize their regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer only minimal regret. As recently pointed out, neither algorithms are, however, capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call DUSA whose worst-case regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. Our proposed algorithm is the first computationally viable learning policy for structured bandit problems that has asymptotic minimal regret.

Attachment

Video Recording