Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. This talk will cover basic theoretical convergence properties, with a focus on tabular, log-linear, and neural policy classes. We will cover provable characterizations of the computational, approximation, and sample size properties of policy gradient methods. One central issue is in providing approximation guarantees that are average case -- which avoid explicit worst-case dependencies on the size of state space -- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).
Joint work with: Alekh Agarwal, Jason Lee, Gaurav Mahajan