Abstract
Modern machine learning models exhibit amazing accuracy on tasks from image classification to natural-language processing, but accuracy does not tell the entire story of what these models have learned. Does a model memorize and leak its training data? Does it “accidentally" learn representations and tasks uncorrelated with its training objective? Do techniques for censoring representations and adversarial regularization prevent unwanted learning? Is it possible to train "least-privilege” models that learn only the task they were given? I will demonstrate unwanted-learning phenomena in several practical models and discuss why obvious solutions do not appear to work.