Abstract

Multi-armed bandit algorithms can offer enormous efficiency benefits in problems where learning to make effective decisions requires careful experimentation. They achieve this through adaptivity: early feedback is used to identify competitive parts of the decision space and future experimentation effort is focused there. Unfortunately, due to this adaptivity, these algorithms risk confounding in problems where nonstationary contexts influence performance. As a result, many practitioners resort to non-adaptive randomized experimentation, providing robustness but foregoing efficiency benefits. We develop a new model to study this issue and propose deconfounded Thompson sampling, which involves a simple modification to one of leading multi-armed bandit algorithms. We argue that this method strikes a delicate balance, allowing one to build in robustness to nonstationarity while, when possible, preserving the efficiency benefits of adaptivity.

Video Recording