Abstract

This tutorial will cover the basics of performing offline / batch reinforcement learning. There is a significant potential to better leverage existing datasets about decisions made and their outcomes in many applications including commerce and healthcare. The basic problems may be described as batch policy evaluation-- how to estimate the performance of a policy given old data -- and batch policy optimization-- how to find the best policy to deploy in the future. I will discuss common assumptions underlying estimators and optimizers, recent progress in this area, and ideas for relaxing some of the common assumptions, such as that the data collection policy sufficiently covers the space of states and actions, and lack of confounding variables that might have influenced prior data. These topics are also of significant interest in epidemiology, statistics and economics: here I will focus particularly on the sequential decision process (such as Markov decision process) setting with more than 2 actions, which is of particular interest in RL and has been much less well studied than the binary action, single time step setting.

Video Recording