Abstract

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to fairly quantify the value of data in algorithmic predictions and decisions. For example, Gov. Newsom recently proposed "data dividend" whereby consumers are compensated by companies for the data that they generate. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on n data points to produce a predictor, we study data Shapley as an equitable metric to quantify the value of each training datum to the predictor performance. Data Shapley uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, our experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it gives actionable insights on what types of data benefit or harm the prediction model; 2) weighting training data by Shapley value improves domain adaptation. This is joint work with Amirata Ghorbani.

Video Recording