Online Learning and Bandits (Part 1)

Abstract

Online learning provides a method to study repeated decision making; the objective is to derive algorithms with small regret, defined to be the difference between the performance of the algorithm and the performance of the best in some comparator class of policies or actions. Unlike the MDP model, long term reasoning is not explicitly built into the model but is rather implicitly injected through the regret term. In effect, we need to be mindful of what comparators have performed well in the past and what comparators may perform best in the future. This general framework allows us to handle stochastic and adversarial data as well as full and partial information. We will begin by presenting well known algorithms for the full-information case, such as exponential weights (a.k.a. Hedge), follow-the-regularized-leader, online gradient descent, and mirror descent, and derive bounds on their worst-case regret. We will look in detail at linear and quadratic losses, to which the convex and curved convex cases reduce. We will discuss an application to learning saddle points in two-player zero-sum games by means of repeatedly playing a pair of online learners against each other (aka “self-play”).

Attachment

Slides (Part I and Part II)

Online Learning and Bandits (Part 1)

Abstract

Attachment

Video Recording