Abstract

In multi-armed bandit models, there is a fundamental difference between optimization (a.k.a. best arm identification or pure-exploration) and regret minimization. As we shall see, in both cases a notion of (asymptotically) optimal algorithm can be defined, and the behavior of optimal algorithms for the two objectives are indeed quite different. Moreover, the complexity of the two problems feature different information-theoretic quantities. Building on these observations, we will investigate the best possible performance of an Explore-Then-Commit strategy for regret minimization, that starts with a pure-exploration phase to identify the best arm and then plays this estimated best arm till the end of the budget. In two-armed bandit models with Gaussian arms, we will see that the regret of such strategies is at least twice larger as that of the best strategies interleaving exploration and exploitation. This talk is based on joint works with Aur_lien Garivier and Tor Lattimore.

Video Recording