The analysis of Big Data involves often the selection of a few promising finding out of extremely large number of potential ones. Such selection may affect our inferences about the significance, the size, and the uncertainty of the selected findings. In this tutorial I shall motivate our interest in this 'selective inference' problem, present the false discovery rate (FDR) as an effective approach to assess the significance of the selected discoveries in large scale problems, and review the available methodologies to address its control (and some open problems). I shall then carry the insight gained from FDR into selective estimation and confidence intervals, as well as model selection. I shall end by describing the currently active area of post model selection confidence intervals and estimators.

### Wednesday, December 11th, 2013

The analysis of Big Data involves often the selection of a few promising finding out of extremely large number of potential ones. Such selection may affect our inferences about the significance, the size, and the uncertainty of the selected findings. In this tutorial I shall motivate our interest in this 'selective inference' problem, present the false discovery rate (FDR) as an effective approach to assess the significance of the selected discoveries in large scale problems, and review the available methodologies to address its control (and some open problems). I shall then carry the insight gained from FDR into selective estimation and confidence intervals, as well as model selection. I shall end by describing the currently active area of post model selection confidence intervals and estimators.

Models of human mobility have broad applicability in urban planning, ecology, epidemiology, and other fields. Starting with Call Detail Records (CDRs) from a cellular telephone network that have gone through a straightforward anonymization procedure, the prior WHERE modeling approach produces synthetic CDRs for a synthetic population. The accuracy of WHERE has been validated against billions of location samples for hundreds of thousands of cell phones in the New York and Los Angeles metropolitan areas. In this work, we introduce DP-WHERE, which modifies WHERE by adding controlled noise to achieve differential privacy. We present experiments showing that the accuracy of DP-WHERE remains close to that of WHERE and of real CDRs. With this work, we aim to enable the creation and possible release of synthetic models that capture the mobility patterns of real metropolitan populations while preserving privacy.

This is joint work with Darakhshan Mir, Sibren Isaacman, Ramón Cáceres, and Margaret Martonosi, and appears in the 2013 IEEE International Conference on Big Data.

Real-time aggregate statistics of data collected from individuals can be shared to enable many applications such as disease surveillance and traffic monitoring. However, it must be ensured that the privacy of individuals is not compromised. While differential privacy has emerged as a de facto standard for private data analysis, directly applying the differential privacy mechanisms on time-series has limited utility due to high correlations between data values. In this talk, I will present FAST, a novel Filtering and Adaptive Sampling based framework for monitoring aggregate Time-series under differential privacy. FAST adaptively samples long time-series according to detected data dynamics and simultaneously uses filtering techniques to dynamically predict and correct released data values. I will present experimental studies using real datasets demonstrating the feasibility and benefit of FAST and conclude with open questions.

We propose a private query release mechanism, related to private boosting, that solves the "dual" of the query release linear program. It is known that no private query release algorithm capable of accurately answering large numbers of general queries can run in polynomial time in the worst case, and our algorithm is no exception. The difference in our approach compared to private multiplicative weights is that the computational difficulty does not stem from needing to represent a large state, but instead the need to solve an NP-hard problem. Importantly, the hard problem that needs to be solved to run our algorithm does not need to be solved privately. As a result, we can attempt to solve it using existing optimization algorithms: we express it as an integer program, and solve it using CPLEX. We find that in practice, this hard problem is not to hard, and that our approach scales to high dimensional data on real datasets. We report the results of recent experiments we have run to privately compute marginal queries on real data sets.

Stochastic gradient descent (SGD) methods are very popular for large scale optimization and learning because they are scalable and have good theoretical guarantees. In particular, they are robust to noise, which is particularly appealing in the context of privacy because it leads to a natural optimization method using noisy gradients. However, from a practical perspective using SGD is often an art, since the user must carefully pick parameters to yield good performance on real data. In this talk I will discuss how minibatching gives practical improvements for differentially private SGD and how that interacts with the step size of the algorithm.

Joint Work with Shuang Song and Kamalika Chaudhuri.

### Thursday, December 12th, 2013

We describe a framework for designing efficient active learning algorithms that are tolerant to random classification noise and differentially-private. The framework is based on active learning algorithms that are statistical in the sense that they rely on estimates of expectations of functions of filtered random examples.

We show that any efficient active statistical learning algorithm can be automatically converted to an efficient active learning algorithm which is tolerant to random classification noise. We show that commonly studied concept classes including thresholds, rectangles, and linear separators can be efficiently actively learned in our framework. These results combined with our generic conversion lead to the first computationally-efficient algorithms for actively learning some of these concept classes in the presence of random classification noise that provide exponential improvement in the dependence on the error parameter over their passive counterparts. In addition, we show that our algorithms can be automatically converted to efficient active differentially-private algorithms. This leads to the first differentially-private active learning algorithms with exponential label savings over the passive case.

In this talk we establish a connection between some notions of algorithmic stability and differential privacy. The main thesis is that a stable algorithm (under certain notions of stability) can be transformed into a differentially private algorithm with good utility guarantee. In particular, we discuss two notions of stability: i) perturbation stability, and ii) sub-sampling stability. Based on these notions of stability, we provide two generic approaches for the design of differentially private algorithms.

In the second part of the talk, we use the generic approaches designed in the first part to the problem of sparse linear regression in high-dimensions. We show that one can design differentially private model (feature) selection algorithms for the above problem. Moreover these algorithms have (nearly) optimal sample complexity. We use the celebrated LASSO estimator as our basic building block.

Based on joint work with Adam Smith.

Differential privacy is a cryptographically motivated definition of privacy which has gained considerable attention in the algorithms, machine-learning and data-mining communities. While there has been an explosion of work on differentially private machine learning algorithms, an important barrier to achieving end-to-end differential privacy in practical machine learning applications is the lack of an effective procedure for differentially private parameter tuning, or, determining the parameter value (such as a regularization parameter for SVMs or bin size for histograms) that is suitable for a particular application.

In this paper, we introduce a generic validation procedure for differentially private machine learning algorithms that apply when a certain stability condition holds on the training algorithm and the validation performance metric. The training data size and the privacy budget used for training in our procedure is independent of the number of parameter values searched over. We apply our generic procedure to two fundamental tasks in statistics and machine-learning—training a regularized linear classifier and building a histogram density estimator that result in end-to-end differentially private solutions for these problems.

Based on joint work with Staal Vinterbo.

Following up on the overview talk on stability and privacy, I will discuss connections between another notion of stability, called forward stability, and privacy. Specifically, we will see how differentially private learning algorithms can be converted to **online** learning via a generalization of "follow the perturbed leader" framework. This leads to new algorithms for private analysis, as well as to new online learning algorithms.

Based on work to appear at NIPS 2013.

Viewing differential privacy through the operational lens of hypothesis testing, we derive tight data processing inequalities. Applications include optimal composition theorems that quantify the privacy degradation level during interactive querying and the optimality of the staircase privacy mechanism in several canonical settings.

### Friday, December 13th, 2013

Principal components analysis is a popular dimension reduction method with wide applications. In high dimensional data analysis, sparse PCA offers simultaneous dimension reduction and variable selection. This talk will report some recent development in sparse PCA with a special focus on subspace estimation. The results include minimax rates of estimation, a new convex relaxation, and some statistical properties of this new method such as matrix norm convergence and variable selection consistency. Some possible future research topics related to differential privacy will also be mentioned.

We compare the sample complexity of private learning and sanitization tasks under pure eps-differential privacy and approximate (eps, del)-differential privacy. We show that the sample complexity of these tasks under approximate differential privacy can be significantly lower than that under pure differential privacy.

We define a family of optimization problems, which we call Concave Promise Problems, that generalizes some of the tasks we consider. We observe that a concave promise problem can be privately approximated using a solution to a smaller instance of a concave promise problem. This allows us to construct an efficient recursive algorithm solving such problems privately.

Joint work with Amos Beimel and Uri Stemmer.

We review models and algorithms for private analysis of massive dynamic data. We first consider algorithms that satisfy the combined constraints of the streaming model and differential privacy: they make a constant number of passes over the data, keep sublinear space, and output a differentially private statistic. Then we consider a natural further requirement for dynamic data: that the algorithm continuously updates a statistic as new data arrives, while keeping the entire sequence of updates private. Finally, we explore the connections between streaming algorithms and pan-private algorithms, i.e. algorithms whose memory state is private, so that their guarantees hold even in the face of a security breach. While the strong pan-privacy requirement forbids explicitly storing sketches of the data, we show we can store sufficient statistics on a sketch in order to estimate interesting quantities, such as the number of distinct elements or the number of heavy hitters.

Privacy definitions provide ways for trading-off the privacy of individuals in a statistical database for the utility of downstream analysis of the data. Recently we proposed Pufferfish a general framework for privacy definitions that provides data publishers with very fine grained control over the privacy definition. In this talk I will describe a sub-class of Pufferfish privacy definitions allows data publishers to extend differential privacy using a policy, which specifies (a) secrets, or information that must be kept secret, and (b) constraints that may be known about the data. Policies help data publishers explore a larger space of the privacy-utility tradeoffs while allowing composable mechanisms. I will formalize policies and present novel algorithms that can handle general specifications of sensitive information and certain count constraints. I will also briefly describe applications of policies that are being implemented in the US Census, and conclude with directions for future work.

Empirical risk minimization (ERM) is a canonical learning algorithm for learning a hypothesis with small generalization error. Moreover, there exists several generic results that characterize the generalization error in terms of hypotheses space and other problem specific characteristics.

However, most of these results focus on non-private learning where the learned hypothesis might reveal significant information about individual points. In this talk, we will survey some of the recent results for privacy-preserving ERM. In particular, we will discuss privacy and generalization error guarantees of these methods and also discuss their pros/cons in terms of theoretical bounds as well as empirical results.

The talk is based on joint works with Abhradeep Guha Thakurta and Pravesh Kothari.

We consider the power of linear reconstruction attacks in statistical data privacy, showing that they can be applied to a much wider range of settings than previously understood. Consider a database curator who manages a database of sensitive information but wants to release statistics about how a sensitive attribute (say, disease) in the database relates to some nonsensitive attributes (e.g., postal code, age, gender, etc). We show one can mount linear reconstruction attacks based on any release that gives: a) the fraction of records that satisfy a given non-degenerate boolean function. Such releases include contingency tables as well as more complex outputs like the error rate of certain classifiers such as decision trees; b) any one of a large class of M-estimators including the standard estimators for linear and logistic regression.

We make two contributions: first, we show how these types of releases can be transformed into a linear format, making them amenable to existing polynomial-time reconstruction algorithms. Second, we show how to analyze the resulting attacks under various distributional assumptions on the data.

Joint work with Mark Rudelson and Adam Smith.

### Saturday, December 14th, 2013

We discuss differentially private algorithms for analysis of network data. Such algorithms work on data sets that contain sensitive relationship information (for example, romantic ties). There are two natural variants of differential privacy for graphs: edge differential privacy and node differential privacy. We will survey results for both notions and then focus on the stronger of the two—node differential privacy. We present several techniques for designing node differentially private algorithms, based on combinatorial analysis, network flow, and linear and convex programming. We also discuss the accuracy of such algorithms on realistic networks, showing theoretical and empirical results.

In this talk, I will survey some fruitful connections between convex geometry and differential privacy for linear queries. The connection is established through a polytope that captures the sensitivity of a set of queries. I will describe several techniques on using the geometry of sensitivity polytope to bound (both upper and lower) the noise level needed to achieve DP. This leads to nearly optimal mechanisms for linear queries, under both pure and approximate privacy definition.

This talk surveys a surprising connection between Matrix Completion and privacy-preserving singular vector computation. Matrix Completion is the problem of finding the dominant singular vectors of an unknown matrix from a subsample of its entries. Alternating Minimization is a widely used and empirically successful heuristic for Matrix Completion. We will cast Alternating Minimization as a noisy instance of the well-known Subspace Iteration algorithm (aka Power Method). Building on a new robust convergence analysis of Subspace Iteration, we prove the strongest known sample complexity bounds in the Alternating Minimization framework. Surprisingly, this analysis of Subspace Iteration also leads to an algorithm for approximating the top singular vectors of a given matrix while achieving the privacy guarantee known as Differential Privacy. The algorithm achieves nearly optimal error rates together with a nearly linear running time. But the connection between the two problems goes even further. In both settings the fundamental parameter that enters the picture is the coherence of the underlying matrix—a measure of how delocalized the singular vectors of the matrix are. In fact, our analysis of Alternating Minimization uses ideas for reasoning about coherence that were developed in the privacy setting.

We consider the problem of privately releasing a low dimensional approximation to a set of data records, represented as a matrix $A$ with each row corresponding to a user, and each column to an attribute. Our primary goal is to compute a subspace that captures the covariance of $A$ as much as possible, or classically known as the principal component analysis (PCA). We aim to protect the privacy of each individual: we assume that each row of $A$ has unit $\ell_2$ norm, and the privacy is defined with respect to any single row change. We show that the well-known randomized response algorithm, with properly tuned parameters, can be both private and with nearly optimal additive quality gap, compared to the best possible singular subspace of $A$. We further show that when $A^TA$ has large gap between its eigenvalues (a reason often cited for PCA), the randomized response algorithm may perform much better. For the upperbounds, we utilize results from random matrix theory and matrix perturbation theory. Our lowerbound is inspired by the recent work of Bun, Ullman, and Vadhan on applying Tardos's fingerprinting codes to constructing hard instances for private mechanisms. In order to extend their technique to subspace approximation, we first prove a stronger property of Tardos codes, and then we show how to find a useful vector hidden in a subspace, via a game which we call the list culling game. Both of these might be of independent interests.

We then consider the online setting and show that by combining the randomized response mechanism with the well-known following the perturbed leader (FPL) algorithm, we can obtain a private online algorithm with nearly optimal regret. The regret of our algorithm actually outperforms all the previously known online FPL algorithms even in the non-private setting. We achieve this better bound by, satisfyingly, borrowing insights and tools from differential privacy!

Joint work with Cynthia Dwork, Abhradeep Thakurta and Li Zhang.