![Theory of Reinforcement Learning_hi-res logo](/sites/default/files/styles/workshop_banner_sm_1x/public/2023-05/Theory%20of%20Reinforcement%20Learning_hi-res.png.jpg?itok=LJoCeC_d)
Abstract
This talk will revolve around two performance criteria that have been studied in the context of episodic reinforcement learning: an old one, Best Policy Identification (BPI) [Fiechter, 1994], and a new one, Reward Free Exploration (RFE) [Jin et al., 2020]. We will see that a variant of the very first BPI algorithm can actually be used for the more challenging reward free exploration problem. This Reward-Free UCRL algorithm, which adaptively explores the MDP and adaptively decides when to stop exploration, requires fewer exploration episodes than state-of-the art algorithms. We will then present alternative algorithms for the BPI objective and discuss the relative complexity of BPI and RFE.