Abstract

Over the past decade, there has been impressive progress in designing state-of-the-art algorithms for contextual bandits (CB) and reinforcement learning (RL) that minimize regret with respect to the best policy in hindsight from a policy class that is fixed a-priori. Until recently, relatively less attention has been devoted to the choice of policy class itself. This choice is critical in high-stakes applications of CB and RL that are data-hungry — a policy class that is insufficiently expressive will heavily underfit the data and yield low regret but suboptimal reward, while a policy class that is overly expressive will heavily overfit the data and incur higher regret than necessarily. In general, the best choice of policy class is itself data-dependent, and can be formalized as a problem of online model selection. This problem of model selection was posed as an open problem in COLT 2020, and a principal challenge in both CB and RL involves delicately balancing exploration with information acquisition for effective model selection. In this talk, I will provide a brief overview of the first approaches to the CB model selection problem, and a description of this core challenge. Next, I will discuss recent progress on resolving this open problem. On the “optimistic” side, I will discuss improved algorithms and estimators that achieve universal and adaptive model selection guarantees for linear (stochastic) contextual bandits under minimal assumptions. On the “pessimistic” side, I will discuss recently proved fundamental limits on model selection for general (stochastic) contextual bandits. I will conclude with a brief summary of remaining open directions for CB and RL model selection. The “optimistic” guarantees are co-authored by me and are joint work with Akshay Krishnamurthy, Jonathan Lee, Weihao Kong, Aldo Pacchiano and Emma Brunskill. The “pessimistic” results were provided very recently by Teodor Marinov and Julian Zimmert.