Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

Parent Program

Theory of Reinforcement Learning

Location

Virtual meeting link will be sent out to program participants.

Speaker(s)

Simon S. Du (University of Washington)

Date

Tuesday, Nov. 24, 2020

Time

10 – 11 a.m. PT

Back to calendar

Description

Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with S states, A actions, planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. The new bonus only requires tweaking the \emph{constants} to ensure optimism and thus is significantly simpler than existing bonus constructions. We show MVP enjoys an O((\sqrt{SAK}+S^2A)polylog(SAHK)) regret, approaching the Ω(\sqrt{SAK}) lower bound of \emph{contextual bandits}. Notably, this result 1) \emph{exponentially} improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019], Zanette et al. [2019] and Zhang et al. [2020] in terms of the dependency on H, and 2) \emph{exponentially} improves the running time in [Wang et al. 2020] and significantly improves the dependency on S, A and K in sample complexity.

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

All scheduled dates:

Upcoming

Past