Preventing Overfitting in Adaptive Data Analysis via Stability

Abstract

Statistical machine learning uses data to make inferences about the distribution from which that data was drawn. A key challenge is to prevent overfitting -- that is, inferring a property that only occurs in the data by chance and is not reflected in the underlying distribution. The problem of overfitting is exacerbated by adaptive re-use of data; if we have previously used the same data, then we can no longer assume that it is "fresh" when used again, as the later analysis may now depend on the data via the outcome of the earlier analysis.

This talk will discuss the use of information-theoretic stability as a method to prevent overfitting when data is re-used adaptively. Intuitively, an algorithm is stable if it "learns" very little about each individual datum (which does not preclude learning about the data as a whole). I will discuss how techniques from the differential privacy literature can be used to construct stable algorithms to perform data analysis tasks and how stability protects against overfitting.

Preventing Overfitting in Adaptive Data Analysis via Stability

Abstract

Video Recording