From the Inside: Foundations of Data Privacy

by Katrina Ligett (Hebrew University)

A big tech firm wishes to make “smart keyboard” predictions of what users will type next by learning from what other users have typed in the past, without risking user privacy. The US Census Bureau must publish statistics that will be used to allocate governmental resources but is bound by Title 13 to protect the privacy of respondents. Until fairly recently, organizations wishing to use large, sensitive data sets for machine learning or other statistical purposes such as these had no other choice than to essentially wing it with ad hoc techniques that provided no formal privacy guarantees: delete some columns here, swap some data around there, add a bit of noise, and cross your fingers. Unfortunately, ad hoc privacy techniques are sitting ducks just waiting for a stronger, cleverer, or better-informed adversary to come along and break them, and history has shown us that such an adversary does come along eventually — and often quite soon.

In the early 2000s, theoretical computer scientists, including Irit Dinur, Kobbi Nissim, Cynthia Dwork, Moni Naor, Frank McSherry, and Adam Smith, took on the problem of formalizing what it means to provide provable privacy to participants in a statistical computation, and the result was the notion now known as differential privacy. Differential privacy essentially restricts computations so that they cannot depend too much on any one person’s data. A rich line of work in theoretical computer science has since emerged, exploring variants of this restriction and their consequences for various computations of interest.

Fast-forward 15 years. Between the announcement that the Census Bureau will protect outputs of the 2020 census with differential privacy and the increasing adoption of differential privacy in industry — including by Apple, Google, and Microsoft — the past year has been an incredibly exciting one for the CS theory community that studies data privacy. This wave of deployments, as well as the many theory challenges they raise, presented a perfect occasion for members of the privacy community to gather together for a semester of sharing progress, collaborating intensively, and broadening engagement with a wide range of disciplines and stakeholders. The Simons Institute program on Data Privacy: Foundations and Applications, held in Spring 2019, provided a marvelous opportunity for this gathering.

The program kicked off with a boot camp consisting of a series of in-depth tutorials on topics that are active frontiers of data privacy research today: large-scale private learning, statistical inference and privacy, algorithms for answering linear queries, formal methods approaches, and the economics of privacy and personal data. The program was also sprinkled with “perspectives” talks, providing an opportunity for long-term participants from outside theoretical computer science to introduce their approaches to data privacy, from ethics to survey statistics.

The program featured three additional workshops, each of them an opportunity to dig deeper into research challenges. The first workshop focused on the various challenges of bringing the theory of data privacy into practical use, ranging from the technical challenges of building tools that meet Internet-scale needs (e.g., using techniques from programming languages and formal methods); to the disciplinary challenges of developing privacy-preserving techniques that meet the needs of a wide variety of governmental, academic, and industrial users of data; to the challenges of meaningfully bridging between technical and legal notions of privacy. The speakers spanned a wide variety of backgrounds, from government to industry, from law to ethics, and from formal methods to algorithms. 

The second workshop, on privacy and the science of data analysis, surfaced two key ways in which these two areas interact: first, exciting advances in techniques for performing key data analysis and learning tasks in a privacy-preserving fashion, and second, recent work establishing the notion of differential privacy as a direction for dealing with the risks of overfitting and false discovery that arise in adaptive data analysis.

The final workshop took a step beyond the CS theory comfort zone of differential privacy to examine a broad range of ideas in the space of data rights, including fairness, accountability, and privacy (beyond differential privacy), with the goal of surfacing problems that would benefit from attention of the CS theory lens. The workshop brought in experts from law, statistics, economics, and the social sciences more broadly, and used an innovative format to push for meaningful engagement among these communities.

The program highlighted the unique ability of the Simons Institute to foster deep interactions and collaborations between theoretical computer science and a broad range of other disciplines. The field of data privacy, both the theory and the practice, will surely benefit from this investment in the coming years.

Related Articles