Near-Optimal Time and Sample Complexities for Reinforcement Learning with a Generative Model

Workshop

Bridging Continuous and Discrete Optimization Reunion

Speaker(s)

Lin Yang (Princeton University)

Location

Date

Tuesday, Dec. 11, 2018

Time

4:15 – 4:45 p.m. PT

Abstract

We consider the reinforcement-learning problem of computing an eps-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through samples. Given such a DMDP with states S, actions A, discount factor gamma in (0,1), and rewards in range [0,1] we provide an algorithm which computes an eps-optimal policy with high probability where both the time spent and the number of samples taken are upper bounded by ~ |S||A|(1-gamma)^{-3}eps^{-2}. This improves upon the previously best-known bounds by a factor of (1-gamma)^-1 and matches the information theoretical lower bound up to logarithmic factors.