Estimation of low-rank matrices is of significant interest in a range of contemporary applications. In this talk we introduce a rank-one projection model for low-rank matrix recovery and propose a constrained nuclear norm minimization method for stable recovery of low-rank matrices in the noisy case. The procedure is adaptive to the rank and robust against small perturbations. The proposed estimator is shown to be rate-optimal under certain conditions. Applications to phase retrieval and estimation of spiked covariance matrices will also be discussed.

### Monday, March 16th, 2015

We give algorithms for regression for a wide class of M-Estimator loss functions. These generalize l_p-regression to fitness measures used in practice such as the Huber measure, which enjoys the robustness properties of l_1 as well as the smoothness properties of l_2. For such estimators we give the first input sparsity time algorithms. Our techniques are based on the sketch and solve paradigm. The same sketch works for any M-Estimator, so the loss function can be chosen after compressing the data.

Joint work with Ken Clarkson.

Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. There has been considerable recent work on designing algorithms for sparse coding with provable guarantees, but somewhat surprisingly these simple heuristics outperform them in practice. Here we give a general framework for understanding alternating minimization which we leverage to analyze existing heuristics and to design new ones also with provable guarantees.

We study this problem in a natural generative model, and obtain a variety of new algorithmic results: We give the first neurally plausible algorithm — closely related to the original heuristic of Olshausen and Field — that (provably) converges to a globally optimal sparse code. We also give the first algorithm for sparse coding that works almost up to the information theoretic limit for sparse recovery on incoherent dictionaries. All previous algorithms that approached or surpassed this limit run in time exponential in some natural parameter. Finally, our algorithms improve upon the sample complexity of existing approaches. We believe that our framework will have applications beyond sparse coding, and could be used to show that simple, iterative algorithms can be powerful in other contexts as well by suggesting new ways to analyze them.

I will describe a general framework of inferring a hidden basis from imperfect measurements. It turns out that a number of problems from the classical eigendecompositions of symmetric matrices to such topics of recent interest as multiclass spectral clustering, Independent Component Analysis and certain problems in Gaussian mixture learning can be viewed as examples of hidden basis learning.

I will then describe algorithms for basis recovery and provide theoretical guarantees in terms of computational complexity and perturbation size. The proposed algorithms are based on what may be called "gradient iteration" and are simple to describe and to implement. They can be viewed as generalizations of both the classical power method for recovering eigenvectors of symmetric matrices as well as the recent work on power methods for tensors. Unlike these methods, our analysis is based not on tensorial properties, but on certain "hidden convexity" of contrast functions.

Joint work with L. Rademacher and J. Voss.

We consider the problems of aggregating ordinal data in the form of permutations under a number of constraints that arise in practical applications such as gene prioritization. To perform constrained aggregation under the constrained Kendall tau distance, we introduce an LP relaxation of the problem and prove that it provides a constant approximation solution. We proceed to describe a number of related methods that may also be used for solving constrained correlation clustering arising in community detection.

This is a joint work with Farzad Farnoud, Gregory Puleo and Fardad Raisali.

Distribution property testing has generated significant interest over the past decade. Given sample access to a distribution, the problem is to decide if the distribution has a property or is far from it. The problem of testing identity of a single distribution has been studied extensively and was solved recently. However, very few results are known for the problem of testing if a distribution belongs to a class of distributions. In this work we consider the well-studied class of Poisson Binomial distributions, which are sums of $n$ independent, but not necessarily identical Bernoullis. We provide a sample near-optimal algorithm for testing whether a distribution supported on $\{0,...,n\}$ to which we have sample access is an unknown Poisson Binomial distribution, or far from all Poisson Binomial distributions. The sample complexity of our algorithm is $O(n^{1/4})$ to which we provide a matching information theoretic lower bound. Our algorithm improves quadratically over the known results. This is one of the first optimal results on testing against a class of distributions. We will discuss some recent results on testing other families of distributions.

The stochastic block model (SBM) is a popular network model with community structures. It has (re)gained major attention in the past years due to new phase transition phenomena discovered for the two-cluster case. In this talk, we establish the fundamental limit of community recovery in the general SBM. We emphasize the strong analogy with the channel coding theorem, and provide a quasi-linear time algorithm that achieves the limit. Joint work with Colin Sandon.

Nonasymptotic estimates on the spectral norm of random matrices have proved to be indispensable in many problems in statistics, computer science, signal processing, and information theory. The noncommutative Khintchine inequality and its recent matrix concentration analogues are widely used in this context. Unfortunately, such simple bounds fail to be sharp in many applications. In this talk, I will discuss new bounds on the norm of random matrices with independent entries that are (provably) sharp in almost all cases of practical interest. Beside providing a substantial improvement on previously known results, the proof of these bounds is surprisingly illuminating and effectively explains when and why the norm of a random matrix is large. Joint work with Afonso Bandeira.

Sparse representations have emerged as powerful tools in signal processing theory, algorithms, and applications. However, real-world signal and image ensembles often exhibit rich structure beyond mere sparsity; for example, a natural image can be modeled as (approximately) tree-sparse in the wavelet domain. A general approach for capturing this type of additional structure is to model the signal of interest as belonging to a particular union-of-subspaces in the signal space. Such a modeling approach has proved to be beneficial in a number of applications including compression, denoising, and machine learning. However, enforcing prior structure often leads to more cumbersome and resource-intensive algorithms. This is undesirable in many scenarios where high-dimensional signal and image datasets are commonplace.

In this talk, I will outline some of our recent efforts towards building highly efficient algorithms for structured sparse modeling. Interestingly, our algorithms borrow several elements from combinatorial optimization, graph theory, and approximation algorithms, and are in some sense fundamentally non-convex. Despite this feature, our algorithms enjoy a (nearly) linear running time thereby enabling their applicability to very large datasets. In particular, I will show that our methods lead to the first nearly-linear time algorithms for stable compressive sensing recovery of signals and images which are tree-sparse in the wavelet-domain, as well as their generalizations.

Joint work with Chinmay Hegde and Ludwig Schmidt.

### Tuesday, March 17th, 2015

During the discussion on Monday afternoon, the audience will select a talk to continue on Tuesday morning.

In this talk I will re-state the results presented yesterday in my talk, concerning the recovery of the communities in the stochastic block model. I will spend some time on the connection between coding and clustering network models with latent variables. If time allows, I will discuss the algorithms and proofs, in particular our algorithms that recover clusters with quasi-linear time in the constant and logarithmic degree regimes.

Estimating the support size of a distribution is a classical problem in probability and statistics, dating back to the early work of Fisher, Good-Turing, and the famous paper on "how many words did Shakespeare know" by Efron-Thisted. When the samples are scarce, the challenge lies in how to extrapolate the number of unseen symbols based on what have been seen so far.

We consider the estimation of the support size of a discrete distribution using independent samples whose minimum non-zero mass is at least . We show that, to achieve an additive error of with probability at least 0.9, the sample complexity is within universal constant factors of , which improves the state-of-the-art result due to Valiant-Valiant. The apparatus of best polynomial approximation and its duality to the moment matching problem play a key role in both the impossibility result and the construction of linear-time estimators with provable optimality.

We also investigate the closely-related species problem where the goal is to estimate the number of distinct colors in an ball-urn model from repeated draws. Experimental results on real data will be presented. Time permitted, extensions to other distributional properties will be discussed. (Joint work with Pengkun Yang).

We propose a general approach to estimating functionals of discrete distributions, and illustrate its efficacy by analyzing the cases of estimating the Shannon entropy H(P) and the alpha-moment of the distribution, based on n i.i.d. samples from the unknown discrete distribution P with support size S. The resulting estimators achieve the minimax L2 rates in all regimes, including those where the support size S grows with n. They have linear complexity and do not assume knowledge of S. The performance of these estimators with n samples turns out to be essentially that of the respective Maximum Likelihood Estimators (MLE) with n log n samples. Approximation theory plays key roles in analyzing the MLE (the plug-in rule for functional estimation), and in both the construction and analysis of the minimax rate-optimal estimators. As corollaries, we show how our results are consistent with other recent results, such as the optimal sample complexity n = \Theta(S/log S) for entropy estimation, first discovered by Valiant and Valiant in 2011.

Furthermore, we show that our entropy estimator is an adaptive estimator in the sense that it achieves the minimax rates simultaneously over a nested sequence of subsets of distributions P, without knowing the support size S or which subset P lies in. Each subset is characterized by an upper bound on the distribution entropy. We also characterize the maximum risk of the MLE over this nested sequence, and show, for every subset in the sequence, that the performance of the minimax rate-optimal estimator with n samples is still essentially that of the MLE with n log n samples.

We discuss the extension of our approach to estimating divergence functions, such as the total variation distance. We demonstrate that the construction is far more complicated than the univariate case, and showcase a general interpretation of our methodology in estimating functionals in arbitrary statistical models.

We highlight the practical advantages of our schemes for the estimation of entropy and mutual information. We compare our performance with various existing approaches, and demonstrate that our approach reduces running time and boosts the accuracy. Moreover, we show that the minimax rate-optimal mutual information estimator leads to significant performance boosts over the Chow–Liu algorithm in learning graphical models. Given the wide applicability of information measure estimation, we believe that the new insights and near optimal estimators have the potential for far reaching applications.

Based on joint work with Kartik Venkat (Stanford EE), Yanjun Han (Tsinghua University), and Tsachy Weissman (Stanford EE).

Estimating the entropy of a distribution by observing independent samples from it is an important problem with varied applications such as neural signal processing, extraction of secret keys from noisy observations, and image segmentation. In particular, the estimation of Shannon entropy has received a lot of attention, and results have emerged recently that establish sharp bounds on the number of samples needed for its reliable estimation. In this talk, we will consider the sample complexity for estimating a generalization of Shannon entropy, namely the Rényi entropy, and will establish almost tight lower and upper bounds on it. As a consequence, we will note that estimating Rényi entropy of order 2 requires samples proportional to roughly the square root of the support-size of the underlying distribution, the least among all Rényi entropies of different orders.

Given low order moment information over the random variables $\bX = (X_1,X_2,\ldots,X_p)$ and $Y$, what distribution maximizes the Rényi maximal correlation coefficient between $\bX$ and $Y$, while remains faithful to the given moments? The answer to this question is important especially in order to fit models over $(\bX,Y)$ with minimum dependence among the random variables $\bX$ and $Y$. In this talk, we investigate this question first in the continuous setting by showing that the jointly Gaussian distribution achieves the minimum Rényi correlation coefficient among distributions with the same first and second order moments. Then, we pose a similar question in the discrete scenario by fixing the pairwise marginals of the random variables $\bX$ and $Y$. We first derive a lower bound for the Rényi correlation coefficient over the class of distributions with fixed pairwise marginals. Then we show that this lower bound is tight if there exists a distribution with certain additive model structure satisfying the given pairwise marginals. Moreover, the distribution with the additive structure achieves the minimum Rényi maximal correlation coefficient.

We study the problem of lossless universal source coding for stationary memoryless sources on countably infinite alphabets. This task is generally not achievable without restricting the class of sources over which universality is desired. We propose natural families of sources characterized by a common dominating envelope.

We particularly emphasize the notion of adaptivity, which is the ability to perform as well as an oracle knowing the envelope, without actually knowing it. This is closely related to the notion of hierarchical universal source coding, but with the important difference that families of envelope classes are not discretely indexed and there is no well-ordered nesting.

We extend the classes of envelopes over which adaptive universal source coding is possible, namely by including max-stable (heavy-tailed) and sub-exponential (light-tailed) envelopes which are relevant models in many applications, such as natural language modeling. Minimax lower bounds on the redundancy of any code on such envelope classes have been recently clarified. We propose constructive codes that do not use knowledge of the envelope. These codes are computationally efficient and structured to use an expanding threshold for auto-censoring (ETAC). We prove that the ETAC-code achieves the lower bound on the minimax redundancy within a factor logarithmic in the sequence length, and can be therefore qualified as a near-adaptive code over families of heavy-tailed envelopes. For finite and light-tailed envelopes the penalty is even less, and the same code follows closely related results that explicitly make the light-tailed assumption. Our technical results are founded on methods from regular variation theory and concentration of measure.

Joint work with D. Bontemps, A. Garivier, E. Gassiat, M Ohannessian.

One of the most natural and important questions in statistical learning is: how well can a distribution be approximated from its samples? Surprisingly, this question has so far been resolved for only a few approximation measures, for example the KL-divergence, and even there, the answer is ad hoc and not well understood. We resolve the question for other important approximation measures, such as chi squared divergence and L1 distance, and if the probabilities are bounded away from zero, we resolve the question for all smooth f-divergence approximation measures, thereby providing a coherent understanding of the rate at which a distribution can be approximated from its samples.

In game-theoretic formulations of prediction problems, a strategy makes a decision, observes an outcome and pays a loss. The aim is to minimize the regret, which is the amount by which the total loss incurred exceeds the total loss of the best decision in hindsight. This talk will focus on the minimax optimal strategy, which minimizes the regret, in three settings: prediction with log loss (a formulation of sequential probability density estimation that is closely related to sequential compression, coding, gambling and investment problems), sequential least squares (where decisions and outcomes lie in a subset of a Hilbert space, and loss is squared distance), and fixed-design linear regression (where the aim is to predict real-valued labels as well as the best linear function). We obtain the minimax optimal strategy for these problems under appropriate conditions, and show that it can be efficiently computed.

The stochastic multi-armed bandit model is a simple abstraction that has proven useful in many different contexts in statistics and machine learning. Whereas the achievable limit in terms of regret minimization is now well known, our aim is to contribute to a better understanding of the performance in terms of identifying the m-best arms. We introduce generic notions of complexity for the two dominant frameworks considered in the literature: fixed-budget and fixed-condence settings. In the fixed-confidence setting, we provide the first known distribution-dependent lower bound on the complexity that involves information-theoretic quantities and holds when m>=1 under general assumptions. In the specific case of two armed-bandits, we derive rened lower bounds in both the fixed-confidence and fixed-budget settings, along with matching algorithms for Gaussian and Bernoulli bandit models. These results show in particular that the complexity of the fixed-budget setting may be smaller than the complexity of the fixed-condence setting, contradicting the familiar behavior observed when testing fully specied alternatives. In addition, we also provide improved sequential stopping rules that have guaranteed error probabilities and shorter average running times. The proofs rely on two technical results that are of independent interest : a deviation lemma for self-normalized sums, and a novel change of measure inequality for bandit models.

Joint work with Emilie Kaufmann and Olivier Cappé.

### Wednesday, March 18th, 2015

During the discussion on Tuesday afternoon, the audience will select a talk to continue on Wednesday morning.

Estimating the support size of a distribution is a classical problem in probability and statistics, dating back to the early work of Fisher, Good-Turing, and the famous paper on "how many words did Shakespeare know" by Efron-Thisted. When the samples are scarce, the challenge lies in how to extrapolate the number of unseen symbols based on what have been seen so far.

We consider the estimation of the support size of a discrete distribution using independent samples whose minimum non-zero mass is at least . We show that, to achieve an additive error of with probability at least 0.9, the sample complexity is within universal constant factors of , which improves the state-of-the-art result due to Valiant-Valiant. The apparatus of best polynomial approximation and its duality to the moment matching problem play a key role in both the impossibility result and the construction of linear-time estimators with provable optimality.

We also investigate the closely-related species problem where the goal is to estimate the number of distinct colors in an ball-urn model from repeated draws. Experimental results on real data will be presented. Time permitted, extensions to other distributional properties will be discussed. (Joint work with Pengkun Yang).

There are two main paradigms in neighborhood-based collaborative filtering: the user-user paradigm and the item-item paradigm. In the user-user paradigm, to recommend items to a user, one first looks for similar users (a common measure of similarity is the so-called cosine similarity), and then recommends items liked by those similar users. In the item-item paradigm, to recommend items to a user, one first looks for items similar to items that the user has liked, and then recommends those similar items.

There is much empirical evidence that the item-item paradigm is more practical and works better in many cases (see Linden, Smith, and York [2003], Koren and Bell [2011]). We provide the theoretical justification to support this empirical observation by analyzing the item-item paradigm in a statistical decision theoretic framework. The performance loss or regret under the item-item collaborative filtering, along with appropriate exploration-exploitation, primarily depends on the 'item space dimensionality' rather than scaling with number of items or users.

This is based on joint work with Guy Bresler and Luis Voloch.

We propose a topic modeling approach to the prediction of preferences in pairwise comparisons. We develop a new generative model for pairwise comparisons that accounts for multiple shared latent global rankings that are prevalent in a population of users. This new model also captures inconsistent user behavior in a natural way. We establish a formal statistical equivalence between the new generative model and topic models. We leverage recent advances in the topic modeling literature to develop an algorithm that can learn shared latent rankings with provable consistency as well as sample and computational complexity guarantees. We demonstrate that the new approach is empirically competitive with the current state-of-the-art approaches in predicting preferences on some semi-synthetic and real world datasets.

This is joint work with Weicong Ding and Prakash Ishwar.

Human experts are crucial to data analysis. Their roles include sifting through large datasets for entries that are relevant to a particular task, identifying anomalous data, and annotating raw data to facilitate subsequent search, retrieval, and learning. Humans often perform much better than machines at such tasks. However, while the capacity to collect, transmit, and store big datasets has been increasing rapidly, the capacity of human experts to provide feedback and to leverage all available information has hardly changed. *Humans are the information bottleneck in data analysis *and will remain so in the future.

I will discuss new theory and algorithms that enable machines to learn efficiently from human experts, using a minimal amount of human interaction. The models so learned then inform the understanding of human cognition and the design of better algorithms for data processing. I will focus on *active learning from human experts *based on crowdsourcing training data *actively* to a pool of people. Rather than randomly selecting training examples for labeling, active crowdsourcing sequentially and adaptively selects the most informative examples for human evaluation, based on information gleaned from the analysis to date. By making optimal use of human expert judgment, these active learning method can speed up the training of a variety of machine learning algorithms.

Repeated patterns and related phenomena in words are known to play a central role in many facets of computer science, telecommunications, coding, intrusion detection, twitter classification, data compression, data mining, and molecular biology. We will first review several pattern matching problems (i.e., exact string matching, constrained pattern matching, generalized pattern matching, and subsequence pattern matching). Then we discuss selected applications (i.e., digital trees, suffix trees, Lempel-Ziv'77 and Lempel-Ziv'78 data compression schemes, and finally the string complexity).

This talk is based on yet unpublished book, co-authored by P. Jacquet, with the same title as the talk.

In this talk, I explore a few instances of the application of strong versions of data processing inequalities to different estimation problems where the estimation procedures are restricted in some way. In particular, I will survey several recent results, with applications in (1) private statistical inference, (2) large-scale distributed estimation, and (3) stochastic optimization, all of which rely on a few new—but related—information theoretic tools that help us to understand the fundamental limits of, and optimal algorithms for, a variety of inference problems.

Strong (or quantitative) data processing inequalities provide sharp estimates of the rate at which a noisy channel “destroys” information. In this talk, I will present some recent results on strong data processing inequalities in discrete settings, with a focus on their use for quantifying the mixing behavior of Markov Chain Monte Carlo (MCMC) algorithms and the decay of correlations in graphical models.

We discuss a general approach to hypothesis testing. The main "building block" of the approach is the notion of a "good" observation scheme (o.s.); examples include the cases where an observation is (a) corrupted by white Gaussian noise affine image of a vector ("signal"), (b) a vector with independent Poisson entries with parameters affinely depending on the signal, (c) random variable taking finitely many values with probabilities affinely parameterized by the signal, and (d) naturally defined direct products of o.s.'s (a) - (c) reflecting what happens when passing from a single observation to a collection of independent observations. We show that given a good o.s., a near-optimal test deciding on a (finite) collection of convex hypotheses (i.e., hypotheses stating that the signal underlying observations belongs to a convex compact set associated with the hypothesis) and the risk of the test can be computed efficiently via Convex Optimization. We discuss sequential and dynamical versions of the tests and outline some applications, including near-optimal in the minimax sense recovery of a linear form of a signal observed via a good o.s. The talk is based on joint research with Alexander Goldenshluger and Anatoli Iouditski.

We are motivated by applications that need rich model classes to represent the application, such as the set of all discrete distributions over large, countably infinite supports. But such rich classes may be too complex to admit estimators that converge to the truth with convergence rates that can be uniformly bounded over the entire model class as the sample size increases (uniform consistency). However, these rich classes may still allow for estimators with pointwise guarantees whose performance can be bounded in a model-dependent way. But the pointwise angle has a drawback as well—estimator performance is a function of the very unknown model that is being estimated, and is therefore unknown. Therefore, even if an estimator is consistent, how well it is doing may not be clear no matter what the sample size.

Departing from the uniform/pointwise dichotomy, we consider a a new framework that modifies the world of pointwise consistent estimators—retaining as far as possible the richness of model classes possible but ensuring that all information needed about the unknown model to evaluate estimator accuracy can be derived from the data. We call this "data-derived pointwise" consistency. As we delve deeper into this formulation, we will see that data-driven consistency shifts focus from the global complexity of model class to the local variation of properties within model classes.

### Thursday, March 19th, 2015

Devavrat Shah, Massachusetts Institute of Technology

During the discussion on Wednesday afternoon, the audience will select a talk to continue on Thursday morning.

We study agnostic active learning, where the goal is to learn a classifier in a pre-specified hypothesis class interactively with as few label queries as possible, while making no assumptions on the true function generating the labels. The main algorithms for this problem are *disagreement-based active learning*, which has a high label requirement, and *margin-based active learning*, which only applies to fairly restricted settings. A major challenge is to find an algorithm which achieves better label complexity, is consistent in an agnostic setting, and applies to general classification problems.

In this paper, we provide such an algorithm. Our solution is based on two novel contributions — a reduction from consistent active learning to confidence-rated prediction with guaranteed error, and a novel confidence-rated predictor.

Feature learning forms the cornerstone for tackling challenging classification problems in domains such as speech, computer vision and natural language processing. While traditionally, features were hand-crafted, the modern approach is to automatically learn good features through deep learning or other frameworks. Feature learning can exploit unlabeled samples, which are usually present in larger amounts, for improved classification performance.

In this talk, we provide a concrete theoretical framework for obtaining informative features which can be used to learn a discriminative model for the label given the input. We show that (higher order) Fisher score functions of the input are informative features, and we provide a differential operator interpretation. We show that given access to these score features, we can obtain the (expected) derivatives of the label as a function of the input (or some model parameters). Having access to these derivatives forms the key to learning complicated discriminative models such as multi-layer neural networks and mixture of classifiers. Thus, the main ingredient for learning discriminative models lies in accurate unsupervised estimation of (higher order) score functions of the input. This is joint work with my students Majid Janzamin and Hanie Sedghi.

Robust inference is an extension of probabilistic inference, where some of the observations are adversarially corrupted. We model it as a zero-sum game between the adversary, who can select a modification rule, and the predictor, who wants to accurately predict the state of nature.

There are two variants of the model, one where the adversary needs to pick the modification rule in advance and one where the adversary can select the modification rule after observing the realized input. For both settings we derive efficient near optimal policy runs in polynomial time. Our efficient algorithms are based on methodologies for developing local computation algorithms.

Based on joint works with Uriel Feige, Aviad Rubinstein, Robert Schapira, Moshe Tennenholtz, Shai Vardi.

Many optimization problems that arise in science and engineering are those in which we only have a stochastic approximation to the underlying problem at hand (e.g. linear regression or other problems where our objective function is a sample average). Such problems highlight some of the challenges we face at the interface of computer science and statistics: should we use a highly accurate algorithm (with costly time and space requirements yet returns numerically precise estimates) or a crude stochastic approximation scheme like stochastic gradient descent (which is light on memory and simple to implement, yet has a poor convergence rate)? The tension arises due to the statistical issue of how much accuracy we actually desire of our algorithm. In the absence of computational constraints, the minimizer on a sample average of observed data — the empirical risk minimizer (ERM) — is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties.

This talk will survey results on both these fronts (stochastic approximation algorithms and recent work on linearly convergent algorithms). Furthermore, this work will provide a simple to implement procedure (applicable to many estimation problems including linear regression) that, under mild regularity conditions, enjoys the following properties:

- The algorithm can be implemented in linear time with just a single pass of the observed data, using space linear in the size of a single sample.
- The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer, on every problem (even considering constant factors).
- The algorithm's performance depends on the initial error at a rate that decreases super-polynomially.

Furthermore, we quantify this rate of convergence, showing that after the size of our dataset exceeds some (problem dependent) threshold, the streaming algorithm rapidly approaches the rate of ERM in expectation.

Faced with massive data, is it possible to trade off statistical risk and computational time? This challenge lies at the heart of large-scale machine learning. I will show in this talk that we can indeed achieve such risk-time tradeoffs by strategically **summarizing** the data, in the unsupervised learning problem of probabilistic k-means, i.e. vector quantization. In particular, there exist levels of summarization for which as the data size **increases**, the running time **decreases**, while a given risk is maintained. Furthermore, there exists a constructive algorithm that provably finds such tradeoff levels. The summarization in question is based on coreset constructions from computational geometry. I will also show that these tradeoffs exist and may be harnessed for a wide range of real data. This adds data summarization to the list of methods, including stochastic optimization, that allow us to perceive data as a resource rather than an impediment.

Over the past fifteen years there has been much work tackling various aspects of the basic question: given independent draws from some fixed distribution (or a pair of distributions), and a statistical or property of interest (e.g. entropy, support size, distance between distributions, distance to a uniform distribution, etc.), how many draws does one need to accurately estimate the property value? The general punchline that has emerged from this body of work is that for most of these properties of interest, one can accurately estimate the property value, and decide whether the distribution possesses certain properties of interest, using far fewer draws from the distribution than would be necessary to actually learn the distribution. In this talk, I will briefly discuss some new results on these questions in the setting where the samples are not drawn independently from a fixed distribution, but instead are drawn from a hidden Markov model.

Increasingly, optimization problems in machine learning, especially those arising from high-dimensional statistical estimation tasks, involve a large number of variables. As regards the statistical estimation tasks themselves, methods developed over the past decade have been shown to have **statistical or sample complexity** that depends only weakly on the number of parameters, when there is some structure to the problem, such as sparsity. A central question is whether similar advances can be made in their **computational complexity** as well. In this talk, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a suite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. One specialization of our algorithm combines greedy coordinate descent with locality sensitive hashing, using which we are not only able to significantly speed up the vanilla greedy method, but also outperform cyclic descent when the problem size becomes large.

Joint work with Inderjit Dhillon and Ambuj Tewari.

I will focus on a core problem in statistical estimation: how to infer information about a distribution based on random samples. I will describe an algorithmic framework that yields new, provably efficient estimators for several natural and well-studied statistical models, including mixtures of structured distribution families (e.g., gaussian, log-concave, etc.). This framework provides a fairly complete picture of the sample and computational complexities for fundamental inference tasks, including density estimation and hypothesis testing.

PAC (proper) learning estimates a distribution in a class by some distribution in the same class to a desired accuracy. Using spectral projections we show that spherical Gaussian mixtures in $d$ dimensions can be PAC learned with $\tilde{O}(d)$ samples, and that the same applies for learning the distribution's parameters. Both results significantly improve previously known bounds.

Joint work with Jayadev Acharya, Ashkan Jafarpour, and Alon Orlitsky.

### Friday, March 20th, 2015

During the discussion on Thursday afternoon, the audience will select a talk to continue on Friday morning.

We describe the intuition behind the proof of our algorithmic characterization of a class of inequalities that generalizes Cauchy-Schwarz, Holder’s inequality, and the monotonicity of the Lp norm, and products of such inequalities. Such inequalities arise in many settings, and our algorithmic characterization provides an efficient algorithm that either proves or provides a witness that falsifies a given inequality in the class.Teaching a course on *Detection and Estimation* from the perspective of *Information Theory* is a fun thing to do. This talk collects some of its nuggets.

We study a recently introduced framework for property testing of probability distributions, by considering distribution testing algorithms that have access to a conditional sampling oracle. This is an oracle that takes as input a subset S of the domain [N]={1,2,...,N} of the unknown probability distribution D and returns a draw from the conditional probability distribution D restricted to S. This model allows considerable flexibility in the design of distribution testing algorithms; in particular, testing algorithms in this model can be adaptive.

The talk will survey a range of results on testing probability distributions with a conditional sampling oracle and compare what is achievable in this framework versus the standard framework (in which algorithms can only draw i.i.d. samples from the unknown distribution). We will see that testinga lgorithms in the conditional sample framework can be significantly more efficient than in the standard framework: for many testing problems where poly(N) many draws from D are required to solve the problem in the standard framework, there are conditional sampling algorithms that require only poly(log N, 1/epsilon) calls to the conditional oracle (or sometimes even just poly(1/epsilon) calls to the conditional oracle independent of N).

Joint work with Clement Canonne and Dana Ron.

Correlation mining is an emerging area of data mining where the objective is to learn patterns of correlation between a large number of observed variables based on a limited number of samples. Correlation mining can be framed as the mathematical problem of reliably reconstructing different attributes of the correlation matrix, or its inverse, from the sample covariance matrix empirically constructed from the data. Reconstructing some attributes requires relatively few samples, e.g., screening for the presence of variables that are hubs of high correlation in a sparsely correlated population, while others require many more samples, e.g., accurately estimating all entries of the inverse covariance matrix in a densely correlated population. We will discuss correlation mining in the context of sampling requirements and as a function of the inference task. This is joint work with Bala Rajaratnam at Stanford University.

For estimating a source's distribution histogram, Orlitsky and co-workers have proposed the pattern maximum likelihood (PML) estimate, which says that one should choose the distribution histogram that has the largest likelihood of producing the pattern of the observed symbol sequence. It can be shown that finding the PML estimate is equivalent to finding the distribution histogram that maximizes the permanent of a certain non-negative matrix.

However, in general this optimization problem appears to be intractable and so one has to compute suitable approximations of the PML estimate. In this talk, we discuss various efficient PML estimate approximation algorithms, along with their connections to the Valiant-Valiant estimate of the distribution histogram. These connections are established by associating an approximately doubly stochastic matrix with the Valiant-Valiant estimate and comparing this approximately doubly stochastic matrix with the doubly stochastic matrices that appear in the free energy descriptions of the PML estimate and its approximations.

Anonymous messaging platforms, such as Secret and Whisper and Yik Yak, have emerged as important social media for sharing one's thoughts without the fear of being judged by friends, family, or the public. Further, such anonymous platforms are crucial in nations with authoritarian governments, where the right to free expression and sometimes the personal safety of the message author depends on anonymity. Whether for fear of judgment or personal endangerment, it is crucial to keep anonymous the identity of the users who initially posted sensitive messages in these platforms.

In this paper, we consider an adversary who has a snapshot of the spread of the messages at a certain time and pose the problem of *designing* a messaging protocol that spreads the message fast while keeping the identity of the source hidden from the adversary. We present an anonymous messaging protocol, which we call *adaptive diffusion*, and show that it spreads fast and achieves *perfect obfuscation* of the source: all users with the message are nearly equally likely to have been the origin of the message.

Compressive sensing, in which a k-sparse signal in an ambient real space of much larger dimension (n) is to be reconstructed via linear measurements is a canonical example of sparse recovery problems. There are others — network tomography (in which the structure of a network is to be probed via measurements that are linear but are constrained by the (unknown) structure of the network), group testing (in which the sparse signal is binary, and the measurements are non-linear (disjunctive)), and phase-recovery (in which a real sparse signal is to be reconstructed via amplitude measurements). for each of these we provide algorithms that are information-theoretically near-optimal (up to at most poly-log(n) factors) in both computational- and sample-complexity, and can be made robust to noise. "Structured sparse measurements" play a key role in this framework, but there's a need to tweak the framework for the idiosyncrasies of each measurement process as well.

We consider a large alphabet channel model with noiseless feedback in which the channel is not known to the encoder or the decoder, and we provide several results to the effect that communication at capacity is possible only if it is possible for the encoder and decoder to completely learn the channel prior to the end of transmission.