Abstract

Today, data sharing is the cornerstone of many modern applications. A common concern in such data sharing pipelines is privacy: organizations are responsible for protecting the privacy of their data, whether it represents user data or enterprise trade secrets. In this talk, we will discuss emerging challenges related to learning large machine learning models from private, federated data. Existing approaches (namely, differentially-private federated learning, DP-FL) involve training models on client devices and are difficult to scale to large models. We will explore the feasibility of replacing DP-FL with centralized training over differentially private synthetic data. We will show that finetuning a model on DP synthetic data can perform similarly to DP-FL in downstream model performance, with order(s)-of-magnitude lower communication and computation. We will also demonstrate conditions under which synthetic data is theoretically guaranteed to approach they underlying private data distribution.

Video Recording