We present decision/optimization problems driven by uncertain and online data, and show how analytical models and computational algorithms can be used to achieve solution efficiency and near optimality. We first describe the so-called Distributionally or Likelihood Robust optimization models and their algorithms in dealing stochastic decision problems when the exact uncertainty distribution is unknown but certain statistical moments and samples can be estimated. Then we describe an online combinatorial auction problem using online linear programming technologies. We discuss near-optimal algorithms for solving this surprisingly general class of online problems under the assumption of random order of arrivals and some conditions on the data and size of the problem. These algorithms have a feature of "learning while doing" by dynamically updating a threshold price vector at certain time intervals, where the dual prices learned from revealed data from the previous periods are used to determine the sequential decision making in the current period.

### Tuesday, November 28th, 2017

We describe a convergence acceleration technique for generic optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. We apply this acceleration technique to stochastic algorithms such as SGD, SAGA, SVRG and Katyusha in different settings, and show significant performance gains.

We consider the problem of learning a one-hidden-layer neural network: we assume the input x is from Gaussian distribution and the label $y = a \sigma(Bx) + \xi$, where a is a nonnegative vector and $B$ is a full-rank weight matrix, and $\xi$ is a noise vector. We first give an analytic formula for the population risk of the standard squared loss and demonstrate that it implicitly attempts to decompose a sequence of low-rank tensors simultaneously.

Inspired by the formula, we design a non-convex objective function $G$ whose landscape is guaranteed to have the following properties:

1. All local minima of $G$ are also global minima.

2. All global minima of $G$ correspond to the ground truth parameters.

3. The value and gradient of $G$ can be estimated using samples.

With these properties, stochastic gradient descent on $G$ provably converges to the global minimum and learn the ground-truth parameters. We also prove finite sample complexity results and validate the results by simulations.

Interactivity is an important feature of data analysis, as the choice of questions asked about a dataset often depends on previous interactions with the same dataset. Interaction invalidates most statistical methods for preventing overfitting and ensuring generalization. In this talk I will give an introduction to a recent line of work that aims to understand when and how it is possible to ensure generalization in interactive data analysis. The main theme of this work is that strong notions of algorithmic stability ensure generalization, even in interactive settings.

Statistical machine learning uses data to make inferences about the distribution from which that data was drawn. A key challenge is to prevent overfitting -- that is, inferring a property that only occurs in the data by chance and is not reflected in the underlying distribution. The problem of overfitting is exacerbated by adaptive re-use of data; if we have previously used the same data, then we can no longer assume that it is "fresh" when used again, as the later analysis may now depend on the data via the outcome of the earlier analysis.

This talk will discuss the use of information-theoretic stability as a method to prevent overfitting when data is re-used adaptively. Intuitively, an algorithm is stable if it "learns" very little about each individual datum (which does not preclude learning about the data as a whole). I will discuss how techniques from the differential privacy literature can be used to construct stable algorithms to perform data analysis tasks and how stability protects against overfitting.

We study the convergence of Hamiltonian Monte Carlo, a general (and practical) method for sampling Gibbs distributions. Our analysis shows that the rate of convergence is bounded in terms of natural smoothness parameters of an associated Riemannian manifold. For the problem of uniformly sampling polytopes, we show that RHMC using the log barrier function improves the state of the art. Then we use RHMC together with Gaussian Cooling on manifolds to obtain a faster algorithm for estimating the volume of a polytope. For polytopes in R^n specified by O(n) inequalities, the complexity is O*(n^{5/3}) steps. For volume computation, this is a considerable improvement over the previous best bound of O(n^4) steps. A key ingredient is the proof of a variant of the KLS isoperimetry conjecture for manifolds defined by barrier functions.

We study distributionally robust formulations for stochastic optimization problems and empirical risk minimization. By considering a worst-case notion of risk in non-parametric uncertainty sets around the empirical distribution, including Wasserstein balls and f-divergence balls, we are able to provide certificates of generalization and robustness for optimization procedures. As part of this, we can give principled approaches for selecting confidence regions for stochastic optimization problems, and we can regularize optimization problems by their variance, enabling computationally efficient trading between estimation and approximation error. We also develop efficient solution methods, and show applications to convex and non-convex optimization.

### Wednesday, November 29th, 2017

Topic models are extremely useful to extract latent degrees of freedom form large unlabeled datasets. Variational Bayes algorithms are the approach most

commonly used by practitioners to learn topic models. Their appeal lies in the promise of reducing the problem of variational inference to an optimization problem.

I will show that, even within an idealized Bayesian scenario, variational methods display an instability that can lead to misleading results.

In this talk, we will describe a few connections between deterministic inequalities that have an "optimization flavor" and probabilistic inequalities for martingales or i.i.d. processes. We will discuss the Burkholder special functions and their role in attaining optimal Rademacher-based regret bounds in online learning.

Recent years have seen astounding progress both in theory and practice of nonconvex optimization. Carefully designed nonconvex procedures simultaneously achieve optimal statistical accuracy and computational efficiency for many problems. Due to the highly nonconvex landscape, the state-of-the-art results often require proper regularization procedures (e.g. trimming, projection, or extra penalization) to guarantee fast convergence. For vanilla algorithms, however, the prior theory usually suggests conservative step sizes in order to avoid overshooting.

This talk uncovers a striking phenomenon: even in the absence of explicit regularization, nonconvex gradient descent enforces proper regularization automatically and implicitly under a large family of statistical models. In fact, the vanilla nonconvex procedure follows a trajectory that always falls within a region with nice geometry. This "implicit regularization" feature allows the algorithm to proceed in a far more aggressive fashion without overshooting, which in turn enables faster convergence. We will discuss several concrete fundamental problems including phase retrieval, matrix completion, blind deconvolution, and recovering structured probability matrices, which might shed light on the effectiveness of nonconvex optimization for solving more general structured recovery problems.

Consider the matrix completion problem: Given only a subset of the entries of a low-rank matrix, recover the remaining ones. How many entries do we need to see to complete a matrix of rank r? How can such entries be sampled? If we sample entries adaptively, do we need to see as many samples? The last question is still quite open, but we propose an two-phase sampling algorithm and show precisely how adaptive sampling can require fewer samples for low-rank completion under general conditions on the matrix leverage scores.

### Thursday, November 30th, 2017

In clustering and other unsupervised tasks, one often seeks information about the data from the top eigenvectors of a graph-based operator. However, these may not always be the informative eigenvectors, due to various outliers, resulting from degree variations, tangles and more. Graph powering is a technique that tries to modify the graph to get rid of such outliers and bring back the informative eigenvectors at the top. We will argue that powering can handle both stochastic block models, and some 'geometric block model' where short loops are much more present. Joint work with C. Sandon and E. Boix.

We provide new approximation guarantees for greedy low rank matrix estimation under standard assumptions of restricted strong convexity and smoothness. Our novel analysis also uncovers previously unknown connections between the low rank estimation and combinatorial optimization, so much so that our bounds are reminiscent of corresponding approximation bounds in submodular maximization. Additionally, we provide also provide statistical recovery guarantees. Finally, we present empirical comparison of greedy estimation with established baselines on two real-world problems.

We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback. In this problem, the objective is to minimize a global loss function of all the actions, not necessarily a cumulative loss. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To solve this problem, we analyze the Upper-Confidence Frank-Wolfe algorithm, inspired by techniques for bandits and convex optimization. We give theoretical guarantees for the performance of this algorithm over various classes of functions, and discuss the optimality of these results.

It is becoming increasingly clear that implicit regularization afforded by the optimization algorithms play a central role in machine learning, and especially so when using large, deep, neural networks. We have a good understanding of the implicit regularization afforded by stochastic approximation algorithms, such as SGD, and as I will review, we understand and can characterize the implicit bias of different algorithms, and can design algorithms with specific biases. But in this talk I will focus on implicit biases of deterministic algorithms on underdetermined problem. In an effort to uncover the implicit biases of gradient-based optimization of neural networks, which holds the key to their empirical success, I will discuss recent work on implicit regularization for matrix factorization and for linearly separable problems with monotone decreasing loss functions.

In the last few years we have seen remarkable progress in machine learning in large part due to neural networks trained on large data. While shallow methods, in particular methods based on positive definite kernels, show excellent results on smaller data sets, their performance suffers at large scale. It turns out that some of this lack of performance can be attributed to the computational limitations of kernel learning. The span of smooth kernel functions (such as Gaussians) is dense in the space of all functions and theoretically any function can be approximated arbitrarily well. However, the space of functions reachable within a fixed computational budget is quite limited. This can be formalized in terms of the fat shattering dimension and other measures of complexity. This computational barrier turns out to be related to the smoothness of the kernel and to the types of computations available for large data. One way to address this issue is by constructing new, less smooth kernels, designed for efficient computation. The resulting algorithms show significant improvements over the state of the art in large scale kernel methods, typically at a fraction of the computational budget reported in the literature. Finally, we hope that deeper understanding of kernel-based learning can also shed light on the success of neural networks.

Based on joint work with Siyuan Ma.

We consider the "Best Set" combinatorial pure exploration problem. In this problem, we are given n arms with unknown reward distributions, as well as a family F of feasible subsets over the arms. Our goal is to identify the feasible subset in F with the maximum total mean using as few samples as possible. The problem generalizes the classical top-k arm identification problems.

We discuss an instance-wise lower bound for the sample complexity of the problem, as well as a sampling algorithm that matches the lower bound up to a factor of ln|F| (which we show is existentially tight). We also discuss poltnomial-time implementations of these sampling algorithms, and other issues.

Joint work with Lijie Chen, Jian Li, Mingda Qiao, and Ruosong Wang.

### Friday, December 1st, 2017

From an information-theoretic viewpoint, randomized statistical decision procedures are channels (or Markov kernels) that map observations to probability distributions over actions. Any sufficiently complex statistical decision procedure is a composition of simpler procedures, and it is of both theoretical and practical interest to obtain a precise characterization of the overall procedure from local descriptions of the constituent subprocedures. In this talk, I will show how this problem can be addressed using information-theoretic methods.

As the scale and complexity of machine learning algorithms continue to increase, methods for fast and efficient optimization are becoming increasingly important. Gradient methods have proven to be central both in theory and in practice for a wide range of machine learning tasks. A thorough understanding of generalization performance of iterative algorithms is as crucial as those for their convergence. Algorithmic stability is a powerful tool for analyzing generalization performance of learning algorithms, leading to better bounds than classical uniform convergence analysis in many cases.

In this talk, we propose a new stability-based framework for deriving convergence lower bounds for a general class of iterative optimization algorithms. In particular, we characterize the trade-off between algorithmic stability and optimal convergence rate of iterative methods through information theoretic arguments. And it is shown that stable algorithms can not converge too fast. Further, we derive algorithmic stability bounds in the convex and smooth setting for many first order methods such as gradient descent and momentum methods. Applying this framework, we translate these stability bounds into lower bounds for convergence, of which some are believed to be new. These lower bounds match the well established upper bounds up to constants. Finally, the stability bounds observed in several numerical experiments agree with our theoretical findings. (This talk is based on joint work with Yuansi Chen (Statistics, UCB) and Chi Jin (CS, UCB)).

Traditional approaches to differential privacy assume a fixed privacy requirement ϵ for a computation, and attempt to maximize the accuracy of the computation subject to the privacy constraint. As differential privacy is increasingly deployed in practical settings, it may often be that there is instead a fixed accuracy requirement for a given computation and the data analyst would like to maximize the privacy of the computation subject to the accuracy constraint. This raises the question of how to find and run a maximally private empirical risk minimizer subject to a given accuracy requirement. We propose a general ""noise reduction"" framework that can apply to a variety of private empirical risk minimization (ERM) algorithms, using them to ""search"" the space of privacy levels to find the empirically strongest one that meets the accuracy constraint, incurring only logarithmic overhead in the number of privacy levels searched. The privacy analysis of our algorithm leads naturally to a version of differential privacy where the privacy parameters are dependent on the data, which we term ex-post privacy, and which is related to the recently introduced notion of privacy odometers. We also give an ex-post privacy analysis of the classical AboveThreshold privacy tool, modifying it to allow for queries chosen depending on the database. Finally, we apply our approach to two common objectives, regularized linear and logistic regression, and empirically compare our noise reduction methods to (i) inverting the theoretical utility guarantees of standard private ERM algorithms and (ii) a stronger, empirical baseline based on binary search.

Joint work with Seth Neel, Aaron Roth, Bo Waggoner, and Steven Wu.