Abstract

For many A/B tests, the treatment is a new algorithm or policy (e.g., a matching policy in a ridesharing system, delivery system, or ad auction). These policies change the state of the system, and so classical experimental designs that randomize platform participants to treatment or control are infeasible. Instead, the typical design is a "switchback" experiment: each policy is alternately applied in successive time intervals, and then the performance of each policy is estimated by its sample average across intervals where it was used. Despite their appeal, switchback designs suffer from *temporal interference*: the application of policy A in one interval affects the initial state of the system as seen by policy B in the next interval. Motivated by the need to mitigate temporal interference, in this talk we study optimal experimental design and estimation for comparison of two policies. Our key innovation is to model the problem as one of efficiently estimating the difference of the steady state reward between two different Markov chains on the same state space (the two policies). We completely characterize the optimal experimental design and estimator in this setting, and discuss implications for practice. Joint work with Peter Glynn and Mohammad Rasouli.