Ambuj Tewari (University of Michigan)
This tutorial will cover the basics of regret analysis in (tabular) Markov Decision Processes (MDPs). The first part will focus on learning in an unknown but fixed MDP. Two families of algorithms will be covered: those based on optimism in the face of uncertainty (OFU) and those based on posterior (or Thompson) sampling. The second part will focus on learning in a harder setting where the MDP transition function and/or the reward function can change adversarially at each time step. Techniques from online learning such as exponential weights, online mirror descent (OMD), and follow the regularized leader (FTRL) will be used to study this challenging problem. Algorithms and their regret analyses establish upper bounds on the minimax regret in various settings. To complement these upper bounds, we will also include a discussion of hardness results especially in the form of regret lower bounds.