Abstract

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: 1) if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); 2) how they cope with approximation error due to using a restricted class of parametric policies; or 3) their finite sample behavior. In this talk, we will study all these issues, and provide a broad understanding of when first-order approaches to direct policy optimization in RL succeed. We will also identify the relevant notions of policy class expressivity underlying these guarantees in the approximate setting. Throughout, we will also highlight the interplay of exploration with policy optimization, both in our upper bounds and illustrative lower bounds. This talk is based on joint work with Sham Kakade, Jason Lee and Gaurav Mahajan.

Video Recording