A Simons Institute white paper developed with support from a grant from the Alfred P. Sloan Foundation
Modern computing and communications make it possible to gather and analyze enormous quantities of granular information about individuals. The structure of our economy not only enables the collection and analysis of such information, but requires it. Participation in everyday life in the industrialized world now requires the constant sharing of highly detailed personal data. This transformation took place so quickly that policymaking institutions in industry and government have not had time to come to terms with it. They have so far failed to account for not only the magnitude of data being collected, but also the potential harms and misuses of this data via cleverly designed algorithms that can operate on private information that had supposedly been safeguarded.
Nevertheless, individual privacy is essential for the functioning of democratic society and for the individuals who compose it.
It is not possible to simply undo the erosion of privacy that has taken place in the past half century: too much has come to depend on it. Privacy has been eroded by a proliferation not only of information, but also of inferences that can be made on the basis of that information. Machine learning, statistical data analysis, and other artificial intelligence techniques stand poised to play an ever more significant role in commerce and human interaction more broadly. These techniques rely on data — in many cases on sensitive personal data. Like computerization more generally, the efficiencies they promise make them sufficiently appealing that although we could, as a society, simply decide not to use them, that outcome appears highly unlikely. For one, we cannot take back data that have already been widely disseminated. Furthermore, we don’t want to destroy the insights and services that rely on data.
Thus, policymakers and technologists are faced with an urgent question: how might we enjoy the benefits of modern information technology while at the very least mitigating the impacts on privacy or, better yet, reclaiming some substantial measure of vanished privacy? This might sound like an impossible task. However, a mathematical framework known as differential privacy, together with related techniques, developed in the last decade and a half, hold substantial promise. These techniques are not a panacea: there are types of privacy violations for which they have little relevance. However, their applicability is broad. They can be applied to most any domain in which the statistical analysis of large data sets is performed. Differential privacy does not get you something for nothing. But it does provide a framework for understanding the trade-offs between privacy and utility in a rigorous fashion and therefore a way to better balance these two aims.
Notably, this framework applies to the US census and other similar compilations of government statistics, which are not only invaluable for research purposes, but also necessary to the operation of modern government. Prior to the line of research that led to differential privacy, it was widely believed that anonymizing data was a relatively straightforward and sufficient solution to the privacy challenge. Statistical aggregates could be released, many people thought, without revealing underlying personally identifiable data. Data sets could be released to researchers scrubbed of names, but otherwise with rich individual information, and were thought to have been anonymized.
We now know that this is not the case: multiple uncoordinated releases of data related to the same individual can lead to dramatic reidentification attacks. Fortunately, another line of inquiry has led to a different solution.
Differential privacy is a formal mathematical framework for quantifying and managing privacy risks. It provides provable privacy protection against a wide range of potential attacks. It applies to analyses of collections of individual information. The goal is to make modifications to the outcome of an analysis in such a way that the probability of every possible outcome would be about the same regardless of whether any given individual's information is used in the analysis. This, in turn, means that calculations on the modified collection cannot be used to infer, with confidence, information that is specific to individuals.
There are multiple ways of implementing differential privacy. The principal idea is to introduce noise into statistical procedures so as to hide the contribution of any single individual, while preserving statistical properties of data in the aggregate. In other words, achieving a differential privacy guarantee involves introducing noise into any calculation of interest so that its results cannot leak too much information that is specific to any one individual participant in the underlying database. This introduction of noise changes the results of the calculation. The magnitude of the change depends on the size of the database, properties of the calculation of interest, and the desired level of privacy. As a rule of thumb, stronger privacy protection generally leads to less accurate results, and this inaccuracy will usually be significant when trying to analyze small or sparse subsets of the data. For inquiries that rely on large subpopulations (for example, the average income of all Americans), the use of differential privacy leads to only a negligible, if any, loss of accuracy.
But take a smaller-sized example: say there exists a database of the household incomes of students. A researcher might want to know the average income. Without differential privacy, she might be given the average, under the assumption that the average does not disclose information about any individual student. But what if, for instance, the researcher knew the average income, and then another student arrived, and the researcher again asked for the average income? She could then calculate the household income of the new arrival from the difference between the previous and new averages. Under differential privacy, any query for the average income would return the actual average plus some tolerable degree of inaccuracy. What the mathematical framework behind differential privacy enables is certainty about how much information about individuals “leaks” into calculations that are disclosed: how big must the random distortion to the average be in order to keep such leaks to an acceptable level? Specifying this level is what is called choosing “epsilon,” the Greek letter that is usually used as a measure of an upper bound of how much privacy may be lost.
The trade-off between accuracy and privacy is not straightforward. Indeed, a major area of research in differential privacy is figuring out techniques that can improve that trade-off, making it possible to give stronger privacy guarantees at a given level of noise, or less noise at a given level of privacy. Another important area of research is “local differential privacy,” in which the introduction of noise happens before any information is sent to a central location, giving the end user much stronger privacy protections. Other approaches that allow for accurate analysis without pooling data centrally include techniques that allow computations to take place on encrypted data. In such systems, data are sent in encrypted form to centralized processors, where computations take place. The encrypted results are then returned to users, who decrypt them locally. The centralized processors never get a chance to see unencrypted data, and so privacy is preserved but the computation can combine information from many sources.
Machine learning and differential privacy
Many of the most promising ways to make good use of user data rely on machine learning. For example, machine learning algorithms are used to forecast traffic, predict the next word a user will type, and identify fraudulent credit card transactions. In many of those applications, the algorithms are trained on sensitive user data such as users’ location history, medical records, and keyboard inputs. The trained model is often accessible to third parties either in its entirety or through an interface that can be queried. It is thus necessary to ensure that such access cannot be used to compromise the privacy of user data.
Many of the most successful applications of machine learning rely on training of deep neural networks — models consisting of numerous layers of units capable of performing simple computations. The number of parameters that these models have usually exceeds the size of the data set used to train them. As a result, these networks are capable of storing a lot of information about the data set they are trained on. In fact, many recent works demonstrate that networks trained using standard learning algorithms are highly susceptible to attacks on privacy. For example, on the basis of only partial information, the membership of an individual in the training data set can be inferred with high accuracy, as can missing parts of a user’s record; complex sentence prediction models (such as those that offer to complete emails) can spit out entire blocks of their training data verbatim.
As in many other areas of data analysis, the only known technique that provably limits the leakage of private user information when training deep neural networks is differential privacy. Training deep neural networks while ensuring differential privacy presents a number of significant technical challenges. A learning algorithm for training deep neural networks typically consists of a large number of steps, each updating the current values of the parameters. Ensuring that a large number of parameters and steps cumulatively preserves differential privacy requires the addition of a significant amount of noise in each step. As a result, known differentially private learning algorithms do not achieve state-of-the-art accuracy on many of the problems in this area. Another challenge is the high computational cost of deep learning algorithms. Additional operations are required to ensure that differential privacy criteria are met, such as restricting the influence of each user’s data on the update. These overheads make the process of training a neural network significantly less efficient — potentially too inefficient to run on very large data sets (for the moment — this may change as computational power grows). Accuracy, in particular on types of records for which there are only small numbers of examples, might be a more lasting problem. Finally, the algorithms and models used for training deep neural networks are relatively poorly understood from a theoretical viewpoint. This impedes the development of more principled and efficient algorithms for solving this problem with differential privacy.
In recent years, significant research has been devoted to overcoming these challenges. This effort has led to a number of techniques that improve the trade-off between privacy guarantees and the accuracy of models. Such techniques include reducing the dimension of the problem (or the effective number of trained parameters), advanced accounting of privacy parameters, and tuning of the various choices in the algorithm’s design. In addition, techniques that exploit the availability of public data of a similar type (referred to as transfer learning) or public unlabeled data (semi-supervised learning) have been shown to improve the trade-off.
Recent research also points to some inherent limits to the achievable privacy-utility trade-off. In many domains, such as image and text, data distribution tends to be “long tailed.” In a long-tailed distribution, rare types of data points — those that occur only a few times in the entire data set — make up a significant fraction of the entire data distribution. By their nature, differentially private algorithms cannot be very sensitive to individual data points and as a result will not be as accurate on rare types of data. Thus, achieving the same accuracy while preserving privacy will require a larger data set than would learning without the constraints of privacy. This is also a concern for census data. Providing information about small or sparse subpopulations (say, for instance, Alaska Natives) is hard to do while providing strong privacy guarantees. Outliers stick out. This drives home the importance of getting good data and of thinking hard about what information is most important to release about small subpopulations.
On the other hand, an important side benefit of this low sensitivity to individual data points is that differentially private algorithms also do not fit outliers in the data. As a result, they are not prone to “overfitting” to the data set — a phenomenon in which the accuracy of the algorithm is much higher on the training data set than on “unseen” data to which it might be applied. This property has been widely observed empirically and has also been established theoretically. Thus, ensuring differential privacy eliminates the need for using other techniques that limit overfitting.
Broadly, research on private data analysis raises two major questions for policymakers:
- What technological solutions can we put in place to stay true to the spirit of current laws and policies, many of which were written when large-scale data collection was less common and privacy research was less advanced?
- How should policy evolve and be formulated going forward to reflect our modern understanding?
In the remainder of this white paper, we break down the challenges facing policymakers. While we focus on examples related to differential privacy, the challenges are not specific to a particular technology or definitional approach.
Challenge 1: Reconciling rigorous measures of leakage with existing legal requirements. Existing laws regulating private information vary enormously in the sophistication and precision with which they discuss sensitive data. For example, US federal regulations define what it means for health information to not be “identifiable”2 (and hence not subject to privacy protections). Under these regulations, it is sufficient to remove identifiers on a specific list — exact addresses, Social Security numbers, etc. — as long as the entity releasing information is not aware that what it is releasing can be used to identify someone “alone or in combination with other information.” The power of linking attacks — and the extent of other available data sets — has grown dramatically in the past two decades. Almost any detailed record can be used to identify someone. Does the regulation criterion that the entity releasing the information not be aware that the release can be used to identify someone thus prohibit all or most releases of health-related records? What does it imply about aggregate statistics?
The Family Educational Rights and Privacy Act of 1974 (FERPA) has stood the test of time somewhat better. It allows for the release of information about students (high school grades or college discipline records, say) without consent:
after the removal of all personally identifiable information provided that the educational agency or institution or other party has made a reasonable determination that a student’s identity is not personally identifiable, whether through single or multiple releases, and taking into account other reasonably available information.3
Even this standard is hard to interpret. Nissim et al. argue via formal mathematical modeling that differentially private data releases are consistent with the standard.4 But they point out that the legal standard’s vagueness makes interpretation difficult: it is not clear how it applies to aggregate information and to releases that involve random noise or other probabilistic methods.
The GDPR, the European Union’s recent sweeping data protection regulation, imposes limitations on what sorts of processed information are exempt from privacy protections. Cummings and Desai argue that state-of-the-art machine learning models do not meet the GDPR’s standard: “most recommender systems (and many other types of machine learning models) memoize [emphasis added] individual data entries as they are trained, and thus are not sufficiently anonymized to be GDPR compliant.”5 However, understanding what does comply with the GDPR is far from simple. Cohen and Nissim6, for example, examine the GDPR’s concept of “singling out” (as opposed to identification), providing a precise interpretation but also highlighting its many potential ambiguities given the way modern systems collect data.
FERPA and the GDPR are of course far from the only laws or regulations that mandate privacy requirements in ways that are no longer believed to make sense. Coming to terms with this regulatory patchwork, as well as clarifying the concepts we currently use to discuss private information, is a major challenge.
Challenge 2: Writing new legislation that takes into account the capabilities and limitations of modern privacy attacks and protections. Rather than taking existing laws and regulations as a given, it would be ideal to write new laws that reflect present-day scientific knowledge. At the same time, we need laws that can account for future innovations. How can privacy legislation take into account differential privacy without being overly narrowly tailored to it?
When too little noise is added to a computation or to a sequence of computations, it is possible to use the results to draw strong inferences about individuals in the original data. As a result of this fundamental trade-off, any deployment of differential privacy requires a value judgment. What is the proper balance of privacy and accuracy in a particular context?
Better privacy may be desirable for a number of reasons. Ethical obligations to customers might require it. So too might legal obligations to protect data. More broadly, there are societal benefits to living in a world in which appropriate limits have been set on data flows. But accurate computations are also important to individuals and society. Accuracy can lead to clearer scientific insights and better decision-making and can be important to corporate profitability. Only by analyzing the benefits and harms of various achievable privacy-accuracy combinations can one decide which combination is most appropriate for a given setting.
Such trade-offs are not unique to privacy-preserving decision-making, of course. A similar trade-off exists when deciding how much data to collect or how many subjects to enroll in a study. A larger quantity of data is more expensive to gather but will result in higher-accuracy, more-reliable computations on the data. Because it would be difficult and time-consuming to fully analyze the pros and cons of specific data set sizes for every data-based computation, scientists have developed rules of thumb to help them make these decisions. Essentially, scientists have tools to help them estimate how much data they will need in order to get a certain level of confidence in a conclusion, and academic communities have established the levels of confidence they consider worth publishing or worth acting on.
As academics and decision-makers gain expertise in quantifying privacy-accuracy trade-offs, they may develop similar rules of thumb to decide what constitutes an acceptable level of privacy loss in certain types of settings.
Challenge 3: Maintaining usable data for researchers. Social scientists who rely on the census are greatly concerned by the Census Bureau’s use of differential privacy. They worry that differential privacy is an unrealistically ambitious privacy standard that would make it impossible to do the sort of research that has become standard in recent years in subjects such as “poverty, inequality, immigration, internal migration, ethnicity, residential segregation, disability, transportation, fertility, nuptiality, occupational structure, education, and family change.”7 Such research is done using both the decennial census and the American Community Survey (ACS) — a much more detailed Census Bureau survey that covers a sample of the US population and is conducted continuously, as opposed to once every 10 years. They are particularly concerned with the potential application of differential privacy to ACS microdata.
While known differentially private algorithms handle simple summaries reasonably well, challenges remain in the design of algorithms that (a) perform efficiently and accurately for large-scale, complex data sets and (b) produce output that is usable in day-to-day analysis.
Challenge 4: Governance of private data. We now have theoretical understandings of some of the fundamental limits on the number of queries that can be answered on a data set while preserving privacy, and at what level of accuracy. Hence, there will necessarily be tensions among researchers who will all want to use the same data for different purposes. How should these be resolved? Who gets to allocate priorities among researchers from different disciplines and methodological traditions? In the past, databases were seen as an evergreen resource — one person asking a question did not then limit another person’s ability to ask something. This is a fundamental shift in how data are viewed and will require significant structural changes in data governance.
Given the explosion of data in recent years, an increasing awareness of the limitations of earlier approaches to anonymizing data, and a growing realization of the power of differential privacy and related techniques to enable positive applications of data while mitigating negative impacts, the time is ripe for policymakers to work closely with researchers to jointly formulate recommendations for privacy regulations. We hope this white paper will serve as a useful guide to the issues that should be addressed.
Articles and Book Chapters
- “Differential Privacy: A Primer for a Non-Technical Audience,” Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, James Honaker, Kobbi Nissim, David R. O’Brien, Thomas Steinke, and Salil Vadhan, Vanderbilt Journal of Entertainment & Technology Law 21, issue 1 (Fall 2018): 209–76, http://www.jetlaw.org/wp-content/uploads/2018/12/4_Wood_Final.pdf.
- “Differential privacy: an introduction for statistical agencies,” Hector Page, Charlie Cabot, and Kobbi Nissim, December 2018, https://gss.civilservice.gov.uk/wp-content/uploads/2018/12/12-12-18_FINAL_Privitar_Kobbi_Nissim_article.pdf.
- “Privacy and Data-Based Research,” Ori Heffetz and Katrina Ligett, Journal of Economic Perspectives 28, no. 2 (Spring 2014): 75–98, https://www.aeaweb.org/articles?id=10.1257/jep.28.2.75.
- The Ethical Algorithm, Michael Kearns and Aaron Roth (New York: Oxford University Press, 2019), Chapter 1.
- “The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks,” Nicholas Carlini et al., July 2019 preprint, https://arxiv.org/pdf/1802.08232.pdf.
- “Differential Privacy and Social Science: An Urgent Puzzle,” Daniel L. Oberski and Frauke Kreuter, Harvard Data Science Review issue 2.1 (January 2020): https://doi.org/10.1162/99608f92.63a22079.
- “Protecting Privacy With Math,” MinutePhysics in collaboration with the US Census Bureau, September 12, 2019, https://youtu.be/pT19VwBAqKA.
- “Four Facets of Differential Privacy,” videos from a 2016 symposium at the Institute for Advanced Study, https://www.ias.edu/ideas/2016/differential-privacy-symposium.
1. Kobbi Nissim contributed to this white paper some passages from his article with Alexandra Wood et al., “Differential Privacy: A Primer for a Non-Technical Audience,” Vanderbilt Journal of Entertainment & Technology Law 21, issue 1 (Fall 2018): 209–76, http://www.jetlaw.org/wp-content/uploads/2018/12/4_Wood_Final.pdf.↩︎
2. See Electronic Code of Federal Regulations, 45 CFR § 164.514(b), accessed March 30, 2020, https://www.law.cornell.edu/cfr/text/45/164.514.↩︎
3. “34 CFR Part 99—Family Educational Rights and Privacy,” U.S. Department of Education, accessed June 25, 2020, https://studentprivacy.ed.gov/ferpa.↩︎
4. Kobbi Nissim et al., “Bridging the Gap Between Computer Science and Legal Approaches to Privacy,” Harvard Journal of Law & Technology 31, no. 2 (Spring 2018): 687–780, https://jolt.law.harvard.edu/assets/articlePDFs/v31/02.-Article-Wood-7.21.pdf.↩︎
5. Rachel Cummings and Deven Desai, “The Role of Differential Privacy in GDPR Compliance,” FATREC2018, https://piret.gitlab.io/fatrec2018/program/fatrec2018-cummings.pdf.↩︎
6. Aloni Cohen and Kobbi Nissim, “Towards Formalizing the GDPR’s Notion of Singling Out,” forthcoming, preliminary version (April 12, 2019): arXiv:1904.06009 [cs.CY], https://arxiv.org/abs/1904.06009.↩︎
7. See, for example, Steven Ruggles et al., “Differential Privacy and Census Data: Implications for Social and Economic Research,” AEA Papers and Proceedings 109 (May 2019): 403–8, https://doi.org/10.1257/pandp.20191107.↩︎
- Letter from the Director, June 2020
- Computational and Statistical Tools to Control a Pandemic | Theoretically Speaking
- Berkeley in the 80s: Manuel Blum
- Simons Institute Polylogues: Algorithms and Race
- From the Inside: The Quantum Wave in Computing
- From the Inside: Lattices: Algorithms, Complexity, and Cryptography
- Research Vignette: Generalization and Interpolation
- Research Vignette: The Many Dimensions of High-Dimensional Expanders
- Cooks, in Theory