Abstract

Soon before arriving to Berkeley in 2018, we discovered that the Q-learning algorithm of Watkins has infinite variance (as defined in the Central Limit Theorem), and obtained methods to speed up convergence and even optimize the variance by modifying the gain in the recursion. An optimal algorithm that we call Zap Q-learning involves a matrix inverse that is similar to what appears in the classical Newton Raphson algorithm. Stability was established only for the so-called tabular case.  

Big surprises came in 2019:  the analysis in 2018 admits an extremely broad generalization.     Zap stochastic approximation (SA) algorithms are stable and have approximately optimal variance under minimal assumptions.   This has significant implications to reinforcement learning. 

The lecture will lay out design principles for reinforcement learning based on these discoveries, and building on the "Hidden Theory" tutorial from 2018.  Two implications to RL will be discussed in the lecture:

(i)  It is known that the natural generalization of Watkins' Q-learning algorithm  is not always stable in function approximation settings.   Parameter  estimates may diverge to infinity even in the linear function approximation setting with a simple finite state-action model.    The new  Zap Q-learning algorithm is stable and convergent under minimal assumptions, even in the case of  nonlinear function approximation.  

(ii) The GQ algorithm of Maei et. al. 2010 was designed to address the stability challenge.   However, just as in Watkins' algorithm, convergence can be absurdly slow.   Theory for Zap GQ is not entirely complete,  but initial numerical experiments are promising.    

Based on joint research with Adithya Devraj and Shuhang Chen, UF and Ana Busic, Inria Paris