Abstract

Web applications typically optimize their product offerings using randomized controlled trials (RCTs), commonly called A/B testing.  These tests are usually analyzed via p-values and confidence intervals presented though an online platform.  Used properly, these measures both control Type I error (false positives) and deliver nearly optimal Type II error (false negatives).

Unfortunately, inferences based on these measures are wholly unreliable if users make decisions while continuously monitoring their tests.  On the other hand, users have good reason to continuously monitor: there are often significant opportunity costs to letting experiments continue, and thus optimal inference depends on the relative preference of the user for faster run-time vs. greater detection.  Furthermore, the platform does not know these preferences in advance; indeed, they can evolve depending on the data observed during the test itself.

This sets the challenge we address in our work: can we deliver valid and essentially optimal inference, but in an environment where users continuously monitor experiments, despite not knowing the user's risk preferences in advance? We provide a solution leveraging methods from sequential hypothesis testing, and refer to our measures as *always valid* p-values and confidence intervals. Our solution led to a practical implementation in a commercial A/B testing platform, serving thousands of customers since 2015.

Joint work with Leo Pekelis and David Walsh.  This work was carried out with Optimizely, a leading commercial A/B testing platform.

Video Recording