Abstract

Offline RL is crucial in applications where experimentation is limited, such as medicine, but it is also notoriously difficult because the similarity between the trajectories observed and those generated by any proposed policy diminishes exponentially as horizon grows, known as the curse of horizon. To better understand this limitation, we study the statistical efficiency limits of two central tasks in offline reinforcement learning: estimating the policy value and the policy gradient from off-policy data. The efficiency bounds reveal that the curse is generally insurmountable without assuming additional structure and as such plagues many standard estimators that work in general problems, but it may be overcome in Markovian settings and even further attenuated in stationary settings. We develop the first estimators achieving the efficiency limits in finite- and infinite-horizon MDPs using a meta-algorithm we term Double Reinforcement Learning (DRL). We provide favorable guarantees for DRL and for off-policy policy optimization via efficiently-estimated policy gradient ascent.

Attachment

Video Recording