In the second part, we will study the "multi-armed bandit" problem where only the reward of the chosen action is revealed to the algorithm, and show that we may formalize this problem as either a modified full-information game or as a MDP with a trivial state space. When the target is to minimize regret, we may apply full-information algorithms by constructing estimates of the full loss (as is done by the EXP3 algorithm) or apply the "optimism in the face of uncertainty" (OFU) principle to choose the action with the highest plausible reward (as is done by the UCB algorithm). We will then discuss the "pure exploration" problem (a.k.a. the best-arm-identification problem), where we wish to maximize our chances of finding a (nearly) optimal arm, and introduce "action elimination" algorithms that do well in this setting. Finally, we discuss the setting where the learner also receives some contextual information before having to select an action: the contextual bandit problem requires the learner to compete with a class of mappings from contexts to actions. Throughout, we will highlight important lower bounds on regret which, when combined with the upper bounds for specific algorithms, often fully characterize the difficulty of these settings.


Video Recording