Abstract

To evaluate the performance of a bandit learner in a changing environment, the standard notion of regret is insufficient. Instead, "dynamic regret" is a better measure that can evaluate the learner's ability to track the changes. How to achieve the optimal dynamic regret without prior knowledge on the number of times the environment changes had been open for a long time, and was recently resolved by Auer, Gajane, and Ortner in their COLT 2019 paper. We will discuss their consecutive sampling technique, which is rare in the bandit literature, and see how their idea can be elegantly generalized to a wide range of bandit/RL problems. Finally, we will discuss important open problems that remain in the area. 

Video Recording