The off-policy setting presents a unique set of challenges to reinforcement learning algorithms, putting at the forefront issues such as distribution shift and generalization. In this talk, I will present an approach to off-policy learning through the lens of duality. Specifically, I will present a primal and dual linear program (LP) encapsulating the Q-values of a policy. I will then show regularizing the dual variables of this LP can alleviate issues of distribution shift, while regularizing the primal variables can enforce better generalization. Using cleverly chosen regularizers in conjunction with convex duality, I will derive algorithms for policy optimization and policy evaluation that are especially well-suited to off-policy and stochastic settings.


Video Recording