We consider off-policy evaluation (OPE) in reinforcement learning, which evaluates the performance of a new policy purely from observed data collected from previous experiments, without requiring the actual execution of the new policy. Off-policy evaluation is of key importance in many areas with high execution cost or safety concerns, such as medical diagnosis, recommendation systems, and robotics. We will introduce an approach for constructing rigorous, non-asymptotic confidence intervals, not just point estimation, for evaluating the policy performance. The idea is to reduce the problem of calculating tight confidence bounds in OPE into an optimization problem of finding the "extreme" Q-functions that yield the largest and smallest expected reward inside a high-probability feasible set. The feasible set is constructed based on data by controlling the Bellman residual error using tail bounds of kernel Bellman loss. Compared with the popular approaches based on importance sampling, our approach has the advantages of 1) yielding much tighter bounds, especially on infinite-horizon problems (on which IS methods tend to fail); 2) working for arbitrarily collected data without a known or fixed behavior policy (i.e., being behavior-agnostic); and 3) requiring no independence assumptions between the transition pairs. We will also discuss how our method can leverage prior knowledge and be used to perform post hoc diagnosis and correction for existing estimations.

Video Recording