I will discuss some new hypothesis testing problems on random graph models. A popular example is the problem of detecting community structure in a graph. Here we will consider more exotic situations, such as testing one of the basic assumption in social network analysis: whether the graph has "geometric structure". We will also talk about dynamical models of random graphs (such as preferential attachment), and how to test different hypotheses on the "history" of these graphs.

### Monday, December 15th, 2014

The area of inference under shape constraints — that is, inference about a probability distribution under the constraint that its density function satisfies certain qualitative properties — is a classical topic in statistics. Shape restricted inference is well-motivated in its own right, and has seen a recent surge of research activity, in part due to the ubiquity of structured distributions in the natural sciences. The hope is that, under such structural constraints, the quality of the resulting estimators may dramatically improve, both in terms of sample size and in terms of computational efficiency.

In this talk, we will describe a framework that yields new, provably efficient estimators for several natural and well-studied classes of distributions. Our approach relies on a single, unified algorithm that provides a comprehensive picture of the sample and computational complexities for fundamental inference tasks, including density estimation and hypothesis testing.

The talk will be based in part on the following papers: http://www.iliasdiakonikolas.org/papers/piecewise-poly-learning.pdf, and http://www.iliasdiakonikolas.org/papers/identity-testing-structured.pdf.

We show how sampling techniques allow processors in a network to build and update a minimum spanning tree using polylog bits of communication per processor with nearly no local memory requirements. Here we assume each processor knows its neighbors only. This is the first method to build a tree without requiring communication across every edge and the first method to process an edge deletion or insertion without requiring communication across every edge or requiring each processor to keep a copy of the tree.

In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR)—the expected fraction of false discoveries among all discoveries—is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. This talk introduces the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, and the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. As the name suggests, the method operates by manufacturing knockoff variables that are cheap—their construction does not require any new data—and are designed to mimic the correlation structure found within the existing variables, in a way that allows for accurate FDR control, beyond what is possible with permutation-based methods. The method of knockoffs is very general and flexible, and can work with a broad class of test statistics. We test the method in combination with statistics from the Lasso for sparse regression, and obtain empirical results showing that the resulting method has far more power than existing selection rules when the proportion of null variables is high. We also apply the knockoff filter to HIV data with the goal of identifying those mutations associated with a form of resistance to treatment plans.

This is joint work with Rina Foygel Barber.

**Location:** Calvin Lab 116

## Effective-Resistance-Reducing Flows, Spectrally Thin Trees and Asymmetric TSP

We show that the integrality gap of the natural LP relaxation of the Asymmetric Traveling Salesman Problem is at most polyloglog(n). To prove this, we show that any k-edge-connected graph (for k=Omega(log(n))) has a polylog(k)/k thin tree. In this talk I will highlight the main ideas of our proof.

This is a joint work with Nima Anari.

A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of holdout sets and sophisticated cross-validation techniques, to procedures for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of science: the theory assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas science is by definition an adaptive process, in which data are shared and re-used, and hypotheses and new studies are generated on the basis of data exploration and previous outcomes.

Surprisingly, the challenges of adaptivity can be addressed using insights from differential privacy, a field of study supporting a definition of privacy tailored to private data analysis. As a corollary we show how to safely reuse a holdout set a great many times without undermining its validation power, even when hypotheses and computations are chosen adaptively. Armed with this technique, the analyst is free to explore the data ad libitum, generating and evaluating hypotheses, verifying results on the holdout, and backtracking as needed.

Joint work with Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer Reingold and Aaron Roth.

Varying coefficient models have been successfully applied in a number of scientific areas ranging from economics and finance to biological and medical science. Varying coefficient models allow for flexible, yet interpretable, modeling when traditional parametric models are too rigid to explain heterogeneity of sub-populations collected. Currently, as a result of technological advances, scientists are collecting large amounts of high-dimensional data from complex systems which require new analysis techniques. We focus on the high-dimensional linear varying-coefficient model and develop a novel procedure for estimating the coefficient functions in the model based on penalized local linear smoothing. Our procedure works for regimes which allow the number of explanatory variables to be much larger than the sample size, under arbitrary heteroscedasticity in residuals, and is robust to model misspecification as long as the model can be approximated by a sparse model. We further derive an asymptotic distribution for the normalized maximum deviation of the estimated coefficient function from the true coefficient function. This result can be used to test hypotheses about a particular coefficient function of interest, for example, whether the coefficient function is constant, as well as construct confidence bands for covering the true coefficient function. Construction of the uniform confidence bands relies on a double selection technique that guards against omitted variable bias arising from potential model selection mistakes. We demonstrate how these results can be used to make inference in high-dimensional dynamic graphical models. Joint work with Damian Kozbur.

### Tuesday, December 16th, 2014

We continue the study of streaming interactive proofs (SIPs). In this setting, a client (verifier) needs to process a massive stream of data, arriving online, but is unable to store even a small fraction of the data. It outsources the processing to a commercial cloud computing service (prover), but is unwilling to blindly trust answers returned by this service. Thus, the service must both supply the final answer and convince the verifier of its correctness.

Our primary objects of study are "barely interactive" SIPs. Specifically, we show that constant-round SIPs are surprisingly powerful, by giving polylogarithmic cost two- and three-round protocols for several "query" problems, including Index, Nearest Neighbor Search, and Range Counting. This was thought to be impossible based on previous work.

On the other hand, in order to study the **limitations** of constant-round SIPs, we introduce a new hierarchy of communication models that we call "online Interactive Proofs" (OIPs). We give upper and lower bounds that (1) characterize every finite level of the OIP hierarchy in terms of previously-studied communication complexity classes, and (2) separate the first four levels of the hierarchy. Our study of OIPs reveals marked contrasts and some parallels with the classic Turing Machine theory of interactive proofs.

Joint work with Amit Chakrabarti, Graham Cormode, Andrew McGregor, and Suresh Venkatasubramanian.

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. We propose a communication-efficient framework, COCOA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, COCOA converges to the same .001-accurate solution quality on average 25× as quickly.

This is joint work with Virginia Smith, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, Michael I. Jordan

Linear Gaussian covariance models are Gaussian models with linear constraints on the covariance matrix. Such models arise in many applications, such as stochastic processes from repeated time series data, Brownian motion tree models used for phylogenetic analyses and network tomography models used for analyzing the connections in the Internet. Maximum likelihood estimation in this model class leads to a non-convex optimization problem that typically has many local maxima. However, I will explain that the log-likelihood function is in fact concave over a large region of the positive definite cone. Using recent results on the asymptotic distribution of extreme eigenvalues of the Wishart distribution I will show that running any hill-climbing method in this region leads to the MLE with high probability.

In the turnstile model of data streams, an underlying vector x in {-m, -m+1,..., m-1, m} is presented as a long sequence of arbitrary positive and negative integer updates to its coordinates. A randomized algorithm seeks to approximate a function f(x) with constant probability while only making a single pass over this sequence of updates and using a small amount of space. All known algorithms in this model are linear sketches: they sample a matrix A from a distribution on integer matrices in the preprocessing phase, and maintain the linear sketch Ax while processing the stream. At the end of the stream, they output an arbitrary function of Ax. One cannot help but ask: are linear sketches universal?

In this work we answer this question by showing that any 1-pass constant probability streaming algorithm for approximating an arbitrary function f of x in the turnstile model can also be implemented by sampling a matrix A from the uniform distribution on O(n log m) integer matrices, with entries of magnitude poly(n), and maintaining the linear sketch Ax. Furthermore, the logarithm of the number of possible states of Ax, as x ranges over {-m, -m+1, ..., m}^n, plus the amount of randomness needed to store A, is at most a logarithmic factor larger than the space required of the space-optimal algorithm. Our result shows that to prove space lower bounds for 1-pass streaming algorithms, it suffices to prove lower bounds in the simultaneous model of communication complexity, rather than the stronger 1-way model. Moreover, the fact that we can assume we have a linear sketch with polynomially-bounded entries further simplifies existing lower bounds, e.g., for frequency moments we present a simpler proof of the tilde{Omega}(n^{1-2/k}) bit complexity lower bound without using communication complexity.

Joint work with Yi Li and Huy L. Nguyen.

### Wednesday, December 17th, 2014

Although graphs and matrices are popular ways to model data drawn from many application domains, the size scale, sparsity properties, noise properties, and many other aspects of graphs and matrices arising in realistic machine learning and data analysis applications present fundamental challenges for traditional matrix and graph algorithms. Informally, this at root has to do with the observation that small-scale or local structure in many data sets is often better or more meaningful than large-scale or global structure, which is exactly the opposite of what many traditional asymptotic tools typically implicitly assume. For example, there might be meaningful small-scale clusters in social networks that are tree-like or expander-like at large size scales. Here, I will describe how eigenvector localization can be used to justify these sorts of claims, how localization can be engineered in a principled way into popular matrix and graph algorithms in order to characterize local properties in large data sets much more generally, and how seemingly-minor details in the approximate computation of localized (as well as non-localized) objectives can make a large difference in the usefulness of a given method in downstream applications. A common theme will be the critical role of what can be called implicit regularization—that is, regularization not in the usual sense of explicitly adding a regularization term and solving the modified objective exactly, but instead as an implicit side effect of heuristics that are often used to make approximation algorithms scale well—as well as the question of what is the objective function that an approximation algorithm implicitly solve exactly.

This talk discusses a general technique to lower bound Bayes risks for arbitrary prior distributions and loss functions. A lower bound on the Bayes risk not only serves as a lower bound on the minimax risk but also characterizes the fundamental limitations of the statistical difficulty of a decision problem under a given prior.

In our method, the key quantity for computing Bayes risk lower bound is the $f$-informativity. We derive a new upper bound of $f$-informativity for a class of $f$ functions, which lead to tight Bayes risk lower bounds. We present Bayes risk lower bounds for several estimation problems as examples, including Gaussian location models, Bayesian Lasso, Bayesian generalized linear model and principle component analysis for spiked covariance model. Our technique also leads to generalizations of a variety of classical minimax bounds. In particular, our result gives the most general Fano's inequality under an arbitrary prior distribution over the parameter space. In addition to Kullback-Leibler divergence based Fano's inequality, our technique leads to a suite of lower bounds on testing risks with different $f$-divergence such as chi-squared, total variation and Hellinger distance. Further, classical results in minimax theory including Le Cam, Assouad and Birge's inequality can all be easily derived by our method.

We present here a new classification model for data mining/ machine learning that is based on combinatorial optimization solving the optimization problem of "normalized cut prime" (NC'). NC' is closely related to the NP-hard problem of normalized cut, yet is polynomial time solvable. NC' is shown to be effective in image segmentation and in approximating the objective function of Normalized Cut as compared to the spectral technique. Its adaptation as a supervised classification data mining technique is called Supervised Normalized Cut (SNC). In a comparative study with the most commonly used data mining and machine learning methods, including Support Vector Machines (SVM), neural networks, PCA, logistic regression, SNC was shown to deliver highly accurate results within competitive run times.

In scaling SNC to large scale data sets, its use of pairwise similarities poses a challenge since the rate of growth of the matrix of pairwise comparisons is quadratic in the size of the dataset. We describe a new approach called sparse computation that generates only the significant weights without ever generating the entire matrix of pairwise comparisons. The sparse computation approach runs in linear time in the number of non-zeros in the output and in that it contrasts with known ``sparsification" approaches that require to generate, in advance, the full set of pairwise comparisons and thus take at least quadratic time. Sparse computation is applicable in any set-up where pairwise similarities are employed, and can be used to scale the spectral method and the k-nearest neighbors as well. The efficacy of sparse computation for SNC is manifested by its retention of accuracy, compared to the use of the fully dense matrix, while achieving a dramatic reduction in matrix density and thus run times.

No abstract available. Joint work with Y. Zhang, X. Chen, and D. Zhou.

**Location:** Calvin Lab 116

## Learning Arbitrary Statistical Mixtures of Discrete Distributions

We study the problem of learning from unlabeled samples very general statistical mixture models on large finite sets. Specifically, we want to learn a model which is a mixture of distributions $p$ on $\{1,2,\dots,n\}$. When we sample from the model, we do not observe a specific $p$ from the mixture directly, but rather indirectly and in a very noisy fashion. We observe $K$ independent samples of $\{1,2,\dots,n\}$ according to a distribution $p$. We give efficient algorithms for learning this mixture model without making any restricting assumptions on the structure of the mixture. We bound the quality of the solution as a function of the size of the samples $K$ and the number of samples used. Our model and results have applications to a variety of unsupervised learning scenarios, including learning topic models and collaborative filtering.

Most of the talk is based on a joint paper with Jian Li, Leonard Schulman, and Chaitanya Swamy. We will also mention some earlier work with some of the above coauthors.

No abstract available.

This talk is about a new direction in my research that arose from the exposure to a broad set of topics during the big data program. Given a real-valued, differentiable multivariate function, an active subspace is a linear combination of the function inputs where the function exhibits large changes in its values. Although this sounds similar to principal components, the idea is distinct. Checking if a function has an active subspace involves computing eigenvalues of a matrix that results from a high-dimensional integration problem. This integral is typically intractable in the cases where active subspaces are used in science and engineering applications. We discuss a randomized procedure to accomplish this check using the randomized matrix methods discussed during the program. In terms of big data, this problem illustrates a case when only a small amount of data seems to be necessary for a useful approximation whereas a tremendously large amount of data is necessary for an exact computation.

For more technical detail, see:

Computing Active Subspaces

Paul Constantine, David Gleich

http://arxiv.org/abs/1408.0545