We discuss new techniques for approximating the mean of a Gaussian in the presence of a large fraction of adversarial errors. We show that by taking advantage of higher moments of these distributions, we can obtain errors close to the information-theoretic optimum, and present an application of this to learning mixtures of spherical Gaussians.

### Monday, October 29th, 2018

We use the Sum of Squares (SoS) method to develop a new efficient algorithm for clustering and mean estimation in well-separated high-dimensional mixture models, substantially improving upon the statistical guarantees achieved by previous efficient algorithms. In particular, we study mixtures of k distributions, where every pair of distributions has means separated by separated by at least k^epsilon for any epsilon > 0. In the special case of spherical Gaussian mixtures, we give a k^O(1/epsilon^2)-time algorithm that learns the means of the components in the mixture and accurately clusters samples from the mixture. This is the first algorithm to improve on greedy (?single-linkage?) and spectral clustering, breaking a long-standing barrier for efficient algorithms at separation k^1/4. Our techniques are based on adapting algorithmic ideas from robust statistics, and are potentially of independent interest. Our main algorithm for learning mixture models provides an entirely SoS interpretation of the convex programming framework of [Diakonikolas et al, FOCS 16]. We show that many of the proofs from that paper can be replaced with much simpler proofs using only basic concentration and Holder's inequality, which allows us to approach this problem via SoS. As a corollary of this, we also obtain improved rates for robust mean estimation in certain regimes. Joint work with Sam Hopkins (Berkeley).

What is the effect of robustness on the computational complexity of high-dimensional estimation? In this talk, I will describe a technique that establishes computational-statistical tradeoffs for a range of robust estimation problems using the lens of the Statistical Query (SQ) complexity. The prototypical applications of our technique will be for the problems of robust mean and covariance estimation. The talk will be based on joint works with Daniel Kane (UCSD) and Alistair Stewart (USC).

We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn i.i.d. from distribution D and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to D, assuming that the marginal distribution of D over the input space is certifiably hypercontractive. This natural property is satisfied by many well-studied distributions such as Gaussian, strongly log-concave distributions and, uniform distribution on the hypercube among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work.

Classical hypothesis testing seeks to decide whether given data is signal or noise. Likelihood ratio (LR) tests are known to minimize the probability of false positive (FP) for any given probability of false negative (FN).

We consider data which is either all noise - Eg. drawn from a standard Gaussian N(0,1) - or mostly noise with a weak signal - Eg. drawn from a Gaussian mixture: (epsilon) N(\mu,1) + (1-epsilon)N(0,1). We seek tests for which both FP and FN go to zero as the number of iid samples goes to infinity. We show essentially that ideal tests exist if the chi-squared distance between signal and noise is higher than a certain threshold.

Interestingly, it turns out that the best tests do not use LR, but a related, yet different quantity.

The proofs are simple and the result is work in progress. The talk will describe things from first principles.

Joint Work with Richard Karp.

Graph distances have proven quite useful in machine learning/statistics, particularly in the estimation of Euclidean or geodesic distances. The talk will include a partial review of the literature, and then present more recent developments on the estimation of curvature-constrained distances on a surface, as well as on the estimation of Euclidean distances based on an unweighted and noisy neighborhood graph.

We develop model-based methods for solving stochastic convex optimization problems, introducing the approximate-proximal point, or aProx, family, which includes stochastic subgradient, proximal point, and bundle methods. When the modeling approaches we propose are appropriately accurate, the methods enjoy stronger convergence and robustness guarantees than classical approaches, even though the model-based methods typically add little to no computational overhead over stochastic subgradient methods. For example, we show that improved models converge with probability 1 and enjoy optimal asymptotic normality results under weak assumptions; these methods are also adaptive to a natural class of what we term easy optimization problems, achieving linear convergence under appropriate strong growth conditions on the objective. Our substantial experimental investigation shows the advantages of more accurate modeling over standard subgradient methods across many smooth and non-smooth optimization problems.

### Tuesday, October 30th, 2018

Heterogeneity across different sub-populations or "homogeneous blocks" can be beneficially exploited for causal inference and novel robustness, with wide-ranging prospects for various applications. The key idea relies on a notion of probabilistic invariance or stability: it opens up new insights for formulating causality as a certain risk minimization problem with a corresponding notion of robustness. The novel methodology has connections to instrumental variable regression and robust optimization.

Uniform stability of a learning algorithm is a classical notion of algorithmic stability introduced to derive high-probability bounds on the generalization error (Bousquet and Elisseeff, 2002). Specifically, for a loss function with range bounded in $[0,1]$, the generalization error of $\gamma$-uniformly stable learning algorithm on $n$ samples is known to be at most $O((\gamma +1/n) \sqrt{n \log(1/\delta)})$ with probability at least $1-\delta$. Unfortunately, this bound does not lead to meaningful generalization bounds in many common settings where $\gamma \geq 1/\sqrt{n}$. Here we prove substantially stronger generalization bounds for uniformly stable algorithms without any additional assumptions. First, we show that the generalization error in this setting is at most $O(\sqrt{(\gamma + 1/n) \log(1/\delta)})$ with probability at least $1-\delta$. In addition, we prove a tight bound of $O(\gamma^2 + 1/n)$ on the second moment of the generalization error. The best previous bound on the second moment of the generalization error is $O(\gamma + 1/n)$. Our proofs are based on new analysis techniques and results imply substantially stronger generalization guarantees for several well-studied algorithms. Joint work with Jan Vondrak (Stanford).

Double machine learning provides $\sqrt{n}$-consistent estimates of parameters of interest even when high-dimensional or nonparametric nuisance parameters are estimated at an $n^{-1/4}$ rate. The key is to employ Neyman-orthogonal moment equations which are first-order insensitive to perturbations in the nuisance parameters. We show that the $n^{-1/4}$ requirement can be improved to $n^{-1/(2k+2)}$ by employing a k-th order notion of orthogonality that grants robustness to more complex or higher-dimensional nuisance parameters. In the partially linear regression setting popular in causal inference, we show that we can construct second-order orthogonal moments if and only if the treatment residual is not normally distributed. Our proof relies on Stein's lemma and may be of independent interest. We conclude by demonstrating the robustness benefits of an explicit doubly-orthogonal estimation procedure for treatment effect.

The talk will first review the problem of robust subspace recovery, which seeks an underlying low-dimensional subspace in a data set that is possibly corrupted with outliers. The emphasis will be on surveying existing theoretical guarantees and tradeoffs. New results for adversarial outliers will also be mentioned. Following this, other related problems will be discussed, along with new results for one of these problems.

We consider the Sherrington-Kirkpatrick model of spin glasses with ferromagnetically biased couplings. For a specific choice of the couplings mean, the resulting Gibbs measure is equivalent to the Bayesian posterior for a high-dimensional estimation problem known as "Z2 synchronization". Statistical physics suggests to compute the expectation with respect to this Gibbs measure (the posterior mean in the synchronization problem), by minimizing the so-called Thouless-Anderson-Palmer (TAP) free energy, instead of the mean field (MF) free energy. We prove that this identification is correct, provided the ferromagnetic bias is larger than a constant (i.e. the noise level is small enough in synchronization). Namely, we prove that the scaled l_2 distance between any low energy local minimizers of the TAP free energy and the mean of the Gibbs measure vanishes in the large size limit. Our proof technique is based on upper bounding the expected number of critical points of the TAP free energy using the Kac-Rice formula.

Logistic regression is arguably the most widely used and studied non-linear model in statistics. Classical maximum likelihood theory provides asymptotic distributions for the maximum likelihood estimate (MLE) and the likelihood ratio test (LRT), which are universally used for inference. Our findings reveal, however, when the number of features p and the sample size n both diverge, with the ratio p/n converging to a positive constant, classical results are far from accurate. For a certain class of logistic models, we observe, (1) the MLE is biased, (2) variability of the MLE is much higher than classical results and (3) the LRT is not distributed as a Chi-Squared. We develop a new theory that quantifies the asymptotic bias and variance of the MLE, and characterizes asymptotic distribution of the LRT under certain assumptions on the distribution of the covariates. Empirical results demonstrate that our asymptotic theory provides extremely accurate inference in finite samples. These novel results depend on the underlying regression coefficients through a single scalar, the overall signal strength, which can be estimated efficiently. This is based on joint work with Emmanuel Candes and Yuxin Chen.

This talk studies hypothesis testing and confidence interval construction in high-dimensional linear models where robustness is view in the context of whether the data is truly sparse. We will show new concepts of uniform and essentially uniform non-testability that allow the study of the limitations of tests across a broad set of alternatives. Uniform non-testability identifies an extensive collection of alternatives such that the power of any test, against any alternative in this group, is asymptotically at most equal to the nominal size, whereas minimaxity shows the existence of one, particularly "bad" alternative. Implications of the new constructions include new minimax testability results that in sharp contrast to existing results do not depend on the sparsity of the model parameters and are therefore robust. We identify new tradeoffs between testability and feature correlation. In particular, we show that in models with weak feature correlations minimax lower bound can be attained by a confidence interval whose width has the parametric rate regardless of the size of the model sparsity.

### Wednesday, October 31st, 2018

In modern science, often data collection precedes the careful specification of hypotheses. Large datasets are mined, testing a large number of possible hypotheses, with the goal of identifying those that hold promise for follow-up. In this framework, controlling the False Discovery Rate is an appropriate criterion to avoid investing time and resources into non viable leads. In many contexts, the initial large collection of explored hypotheses is somewhat redundant: in an effort to maximize power, the same scientific statement can be probed with a number of related hypotheses. For example, the association between a phenotype and one genetic locus can be investigated by exploring the association between the phenotype and many genetic variants in the locus. After the first pass through the data is completed, however, and it is time to take stock of the identified scientific leads, this redundancy is corrected: in the example above, rather than reporting all variants associated with a phenotype, scientists routinely report only a ?lead? variant, selected to represent the entire locus. Because the false discovery proportion is crucially defined with reference to the total set of discoveries, however, these subsets of discoveries identified post-hoc are not equipped with guarantees of FDR control. To overcome this problem, we note that if the criterion with which discoveries will be filtered can be specified in advance, it is possible to modify the Benjamini Hochberg procedure to result in a focused set of discoveries with FDR guarantees. Our framework allows not only subsetting of discoveries, but also their prioritization with weights reflecting the extent to which they provide insight into distinct scientific questions. We illustrate our methodology with examples from gene set enrichment on the Gene Ontology, a collection of hypotheses organized in a directed acyclic graph.

The talk will be based on joint work with Eugene Katsevich (Stanford) and Marina Bogomolov (Technion).

The Benjamin--Hochberg (BH) procedure and many of its generalizations are empirically observed to control the false discovery rate (FDR) much beyond the regimes that are known to enjoy provable FDR control. To address this gap, this talk introduces some new results that imply the robustness of the BH procedure and certain related procedures for FDR control. First, we show that FDR control is maintained up to a small multiplicative factor under arbitrary dependence between false null test statistics and independent true null test statistics. The proof technique is based on a new backward submartingale argument. Next, we further extend the FDR control to the case where the null test statistics exhibit certain positive dependence, implying that the null distribution plays an essential role in FDR control. We conclude the talk by introducing a weak version of FDR control for which the BH procedure is robust against any adversarial false null test statistics. Part of this talk is based on joint work with Cynthia Dwork (Harvard) and Li Zhang (Google).

We consider linear regression in the high-dimensional regime where the number of parameters exceeds the number of samples (p > n) and assume that the high-dimensional parameters vector is sparse. We develop a framework for testing general hypotheses regarding the model parameters. Our framework encompasses testing whether the parameter lies in a convex cone, testing the signal strength, and testing arbitrary functionals of the parameter. We show that the proposed procedure controls the false positive rate and also analyze the power of the procedure. Our numerical experiments confirm our theoretical findings and demonstrate that we control false positive rate near the nominal level, and have high power. By duality between hypotheses testing and confidence intervals, the proposed framework can be used to obtain valid confidence intervals for various functionals of the model parameters. For linear functionals, the length of confidence intervals is shown to be minimax rate optimal. [This talks is based on a joint work with Jason Lee.]

### Thursday, November 1st, 2018

Estimating a high-dimensional sparse covariance matrix from a limited number of samples is a fundamental problem in contemporary data analysis. Most proposals to date, however, are not robust to outliers or heavy tails. Towards bridging this gap, we consider estimating a sparse shape matrix from n samples following a possibly heavy tailed elliptical distribution. We propose estimators based on thresholding either Tyler's M-estimator or its regularized variant. We prove that under suitable conditions the regularized variant can be computed efficiently in practical polynomial time. Furthermore, we prove that in the joint limit as dimension p and sample size n tend to infinity with p/n tending to a constant, our proposed estimators are minimax rate optimal. Results on simulated data support our theoretical analysis.

We study polynomial time algorithms for estimating the mean of a d-dimensional random vector X from n independent samples X1,?,Xn when X may be heavy-tailed. In particular, we assume only that X has finite mean and covariance. In this setting, the radius of confidence intervals achieved by the empirical mean are large compared to the case that X is Gaussian or sub-Gaussian. We offer the first polynomial time algorithm to estimate the mean with sub-Gaussian-style confidence intervals even when X has only finite second moments. Our algorithm is based on a new semidefinite programming relaxation of a high-dimensional median. Previous estimators which assumed only existence of O(1) moments of X either sacrifice sub-Gaussian performance or are only known to be computable via brute-force search procedures requiring exp(d) time.

Robust estimation under Huber's $\epsilon$-contamination model has become an important topic in statistics and theoretical computer science. Rate-optimal procedures such as Tukey's median and other estimators based on statistical depth functions are impractical because of their computational intractability. In this talk, I will discuss an intriguing connection between f-GANs and various depth functions through the lens of f-Learning. Similar to the derivation of f-GAN, I will show that these depth functions that lead to rate-optimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of f-Learning. This connection opens the door of computing robust estimators using tools developed for training GANs. In particular, I will show that a JS-GAN that uses a neural network discriminator with at least one hidden layer is able to achieve the minimax rate of robust mean and covariance matrix estimation under Huber's $\epsilon$-contamination model. Interestingly, the hidden layers for the neural net structure in the discriminator class is shown to be necessary for robust estimation.

Optimizing a non-convex function is hard in general. Existing analysis for non-convex optimization then relies on problem specific structures that may not be robust to adversarial perturbations. In this talk, we will see two scenarios where the standard non-convex optimization techniques are not robust, and show how to modify the algorithms to handle adversarial perturbations. The first scenario considers the matrix completion problem against a semi-random adversary that can reveal more entries of the matrix. Although this weak adversary is harmless to convex relaxations, we show that it can ruin non-convex approaches, and give a fast algorithm to fix this problem. The second scenario considers the more general setting where one does not have access to the exact function to optimize, where we show that it is still possible to find an approximate local optimal solution. Based on joint works with Yu Cheng, Chi Jin, Lydia Liu, Michael I. Jordan.

Probabilistic models are widely used in statistical inference, and for understanding the computational tractability of several unsupervised learning tasks. However, a common criticism of most existing learning algorithms is that their guarantees strongly rely on the unrealistic assumption that the instance is generated exactly from the model. Semi-random models provide a framework to reason about robustness of algorithms to modeling errors by incorporating both adversarial and random choices in instance generation. In this talk, I will describe new semi-random models for two different problems in unsupervised learning: dictionary learning, and clustering mixtures of Gaussians. In dictionary learning the task is to learn a hidden incoherent basis/dictionary A given data Y=AX that represents unknown sparse linear combinations (corresponding to columns of X) of the basis elements. Existing algorithms learn A and X efficiently, assuming strong distributional assumptions about both the supports of the columns of X and the values of these non-zero entries. In the first part of the talk, I will describe a more general semi-random model for dictionary learning aimed at capturing almost arbitrary distributions for the sparse supports, and give a new polynomial time algorithm for learning incoherent over-complete dictionaries under the semi-random model. Finally (time permitting), I will describe a natural semi-random model for k-means clustering that generalizes mixtures of k Gaussians, and demonstrate how the Lloyds algorithm successfully recovers the ground-truth clustering up to near optimal accuracy. Based on joint works with Pranjal Awasthi.

Modern applications increasingly involve high-dimensional and heterogeneous data, e.g., datasets formed by combining numerous measurements from myriad sources. Principal Component Analysis (PCA) is a classical method for reducing dimensionality by projecting data onto a low-dimensional subspace capturing most of their variation, but it does not robustly recover underlying subspaces in the presence of heteroscedastic noise. Specifically, PCA suffers from treating all data samples as if they are equally informative. We will discuss the consequences of this on performance, which lead us naturally to consider weighting PCA in such a way that we give less influence to samples with larger noise variance. Doing so better recovers underlying principal components, but precisely how to choose the weights turns out to be an interesting problem. Surprisingly, we show that whitening the noise by using inverse noise variance is sub-optimal. Our analysis provides expressions for the asymptotic recovery of underlying low-dimensional components from samples with heteroscedastic noise in the high-dimensional regime. We derive optimal weights and characterize the performance of optimally weighted PCA. This is work in collaboration with David Hong and Jeff Fessler.

We study transformations of shapes into representations that allow for analysis using standard statistical tools. The transformations are based on Euler integration and are of interest for their mathematical properties as well as their applications to science and engineering, because they provide a way of summarizing shapes in a topological, yet quantitative, way. By using an inversion theorem, we show that both transforms are injective on the space of shapes---each shape has a unique transform. By making use of a stratified space structure on the sphere, induced by hyperplane divisions, we prove additional uniqueness results in terms of distributions on the space of Euler curves. The main theoretical result provides the first (to our knowledge) finite bound required to specify any shape in certain uncountable families of shapes, bounded below by curvature. This result is perhaps best appreciated in terms of shattering number or the perspective that any point in these particular moduli spaces of shapes is indexed using a tree of finite depth. We also show how these transformations can be used in practice for medical imaging applications as well as for evolutionary morphology questions.

### Friday, November 2nd, 2018

We present a new method for high-dimensional linear regression when a scale parameter of the error is unknown. The proposed estimator is based on a penalized Huber M-estimator, for which theoretical results on estimation error have recently been proposed in high-dimensional statistics literature. However, variance of the error term in the linear model is intricately connected to the parameter governing the shape of the Huber loss. The main idea is to use an adaptive technique, based on Lepski's method, to overcome the difficulties in solving a joint nonconvex optimization problem with respect to the location and scale parameters.

Consider an instance of Euclidean k-means or k-medians clustering. We prove that a dimension reduction projection onto a space of dimension d ~ log k preserves the cost of the optimal clustering within a factor of 1 + epsilon w.h.p. Crucially, the dimension d does not depend on the total number of points n in the instance. The result also applies to other variants of the k-clustering problem. The result strengthens the result by Cohen, Elder, Musco, Musco, and Persu, who showed that the value of k-means is approximately preserved when d ~ k. No bounds on d were previously known for k-medians. Joint work with Konstantin Makarychev and Ilya Razenshteyn.

Symmetric properties of distributions arise in multiple settings. For each of these, separate estimators and analysis techniques have been developed. Recently, Orlitsky et al showed that a single estimator that maximizes profile maximum likelihood (PML) is sample competitive for all symmetric properties. Further, they showed that even a 2^{n^{1-delta}}-approximate maximizer of the PML objective can serve as such a universal plug-in estimator. (Here n is the size of the sample). Unfortunately, no polynomial time computable PML estimator with such an approximation guarantee was known. We provide the first such estimator and show how to compute it in time nearly linear in n. We also present some preliminary experimental results. Joint work with Kiran Shiragur and Aaron Sidford.

Over the last few years, there has been significant theoretical work in robust high-dimensional statistical estimation. These results seem aligned with modern goals in practical machine learning, where high-dimensional data is ubiquitous, and robustness and security are paramount. This raises the question: are these advances purely theoretical, or can we reap their benefits in the real world? In this talk, I will describe some first steps towards answering this question positively, including evidence that these theoretical advances may be realizable. I will discuss applications to exploratory data analysis and robust stochastic optimization on synthetic and real-world datasets.

Based on joint works with Ilias Diakonikolas, Daniel Kane, Jerry Li, Ankur Moitra, Jacob Steinhardt, and Alistair Stewart. Papers available at https://arxiv.org/abs/1703.00893 and https://arxiv.org/abs/1803.02815.