Abstract

The sample and computational complexity of reinforcement learning algorithms under the generative model has been a central focus of the literature. Most existing approaches which can guarantee epsilon-optimality require a complexity either dependent on the size of the state space, or which scales exponentially in the time horizon T, in both cases suffering from the curse of dimensionality. Other approaches may be more efficient, but typically either make strong structural assumptions, or do not come with optimality guarantees. We reconcile this dilemma for the celebrated problem of high-dimensional optimal stopping, a special case of RL with important applications to financial engineering, operations research, economics and computer science, and well-known to be computationally challenging when the state space and time horizon are large. We propose the first algorithm with sample and computational complexity polynomial in T, and effectively independent of the underlying state space, which outputs epsilon-optimal stopping policies and value function approximations. Our algorithm is inspired by a novel connection between network flow theory in combinatorial optimization and the classical martingale duality theory for stopping problems, and can be viewed as a higher-order generalization of the celebrated prophet inequalities. Our approach yields an expansion for the value functions as an infinite sum, which is fundamentally different from standard expansions based on backwards induction, with the following properties : 1. each term is the expectation of an elegant recursively defined infimum; 2. the first k terms only require T^k samples to approximate; and 3. truncating at the first k terms yields a (normalized) error of 1/k. The terms of the associated sum can also be viewed as a provably good set of basis functions, which can be efficiently evaluated by simulation, in contrast with past approximate dynamic programming approaches which take a heuristic approach to deriving basis functions. Our work shows that although the state space is exponentially large, one need only explore it to the extent that one can compute these (few) basis functions, which can be done efficiently. Time permitting, we will also explore applications of our approach to other high-dimensional RL problems (including those arising in dynamic pricing and online combinatorial optimization), and discuss connections to parallel and quantum computing. Joint work with Ph.D. student Yilun Chen.

Video Recording