RL Seminar: Provably Efficient Exploration in Policy Optimization

Parent Program

Theory of Reinforcement Learning

Location

Virtual meeting link will be sent out to program participants.

Speaker(s)

Zhuoran Yang (Princeton)

Date

Tuesday, Sept. 22, 2020

Time

10 – 11 a.m. PT

Back to calendar

Description

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves O(\sqrt{d^2H^3T}) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

RL Seminar: Provably Efficient Exploration in Policy Optimization

All scheduled dates:

Upcoming

Past