No abstract available.

### Monday, October 2nd, 2017

First-order methods provide a simple general framework to quickly recover approximate solutions to smooth and non-smooth convex problems.In the context of linear programming, the running time of these algorithm has a quadratic dependence on the largest entry in the constraint matrix, I.e., the width of the LP. While this dependence is inevitable for algorithms solving general non-smooth convex problems, it can surprisingly be eliminated in the context of packing and covering LPs, when the constraint matrices are non-negative, yielding truly-polynomial time algorithms.

In this talk, I will review some of the rich history of width-independent algorithms, discuss some recent progress through an optimization perspective and highlight a number of open problems. Conceptually, I will attempt to provide a simple intuition of why non-negativity enables width-independence.

We present an algorithmic connection between nature and humans arising in entirely different contexts. The first is the famous Iteratively Reweighted Least Squares (IRLS) algorithm used for the sparse recovery problem while the second is the dynamics of a slime mold. Despite its simplicity the convergence of the IRLS method has been shown only for a certain regularization of it and remains an open problem. We show that the two dynamics can be obtained as projections of a single dynamical system in higher dimensions. Subsequently, we take inspiration from the analysis of the slime mold dynamics to show convergence and obtain complexity bounds for a damped version of the IRLS algorithm.

Joint work with Damian Straszak.

In the last few years, Michael Cohen has developed faster algorithms for many important problems in optimization such as Laplacian, directed Laplacian, min cost flow, geometric median, matrix scaling, linear/lp regression, clustering,

In the first few minutes of the talk, I will review some of the great achievements in optimization by Michael.

Then, I will present his yet another great achievement, the first input sparsity polynomial time algorithm for lp regression for 1<p<inf. Except for l2 regression, this is the first problem that we can solve in input sparsity time with running time polynomial to log(1/eps).

Joint work with Sébastien Bubeck, Michael Cohen and Yuanzhi Li.

Chordal graphs have deep roots in discrete optimization and also arise naturally in several areas of continuous optimization, including sparse matrix algorithms, positive semidefinite and Euclidean distance matrix completion, and the analysis of partial separability. In semidefinite optimization, properties of chordal graphs have been used in algorithms that exploit sparsity in the coefficients. Large semidefinite optimization problems often have coefficient matrices with a common, 'aggregate' sparsity pattern. The aggregate sparsity has important implications for the structure (sparsity and rank) of the solutions of the problem.

The talk will focus on applications of classical results in sparse matrix theory, and, in particular, the fact that for chordal sparsity patterns it is easy to compute zero-fill Cholesky factorizations, maximum-determinant and minimum-rank positive semidefinite completions, and minimum-dimension Euclidean distance matrix completions. These properties are very useful in theoretical analysis and in the development of algorithms (interior-point, first-order, and decomposition algorithms). Data structures and techniques from the sparse matrix literature, such as elimination trees, supernodal elimination trees, and multifrontal algorithms will play a unifying role in the derivation of the key results and algorithms.

Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-the-art results. It becomes increasingly important to understand why these algorithms can converge to high-quality solutions on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that "all (or most) local optima are (approximately) global optima", and thus they can be solved efficiently by local search algorithms. However, establishing such property can be difficult.

In this talk, I will showcase the mathematical techniques to establish results of this type. As a warm up, I will start with the top eigenvector problems and discuss its relation to the natural non-convex objective function for matrix completion. Then, we will focus on analyzing the optimization landscape of tensor decomposition using the Kac-Rice formula from random field and geometry. I will end with some conjectures along this line of research.

### Tuesday, October 3rd, 2017

Recent advances linking optimization, machine learning, and systems research have enabled the rapid solution of data analysis problems on terabytes of data. In this talk, I will survey both the algorithmic and computational state-of-the-art, describing some of the bottlenecks and complexities not captured by simple flop or message counting. I will also highlight new architectures and computing substrates that could potentially enable scalable implementations of algorithms with superlinear scaling, widening the scope of optimization problems we can solve on large data volumes.

Block coordinate descent (BCD) methods are widely-used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, and amenability to parallelization. Three main algorithmic choices influence the performance of BCD methods; the block partitioning strategy, the block selection rule, and the block update rule. We explore these three building blocks and propose variations for each that can lead to significantly faster BCD methods. We support all of our findings with numerical results for the classic machine learning problems of logistic regression, multi-class logistic regression, support vector machines, and label propagation.

We will present a very general trust-region framework for unconstrained stochastic optimization which is based on the standard deterministic trust region framework. In particular this framework retains the desirable features such step acceptance criterion, trust region adjustment and ability to utilize of second order models.

We show that this framework has favorable convergence properties, when the true objective function is smooth.

We then show that this framework can be used to directly optimize a 01-loss function of an AUC of a classifiers, which are considered NP-hard in the finite sum case.

Unconstrained optimization of a smooth nonconvex objective over many variables is a classic problem in optimization. Several effective techniques have been proposed over the years, along with results about global and local convergence. There has been an upsurge of interest recently on techniques with good global complexity properties. (This interest is being driven largely by researchers in machine learning, who want to solve the nonconvex problems arising from neural network training and robust statistics, but it has roots in the optimization literature.) In this talk we describe the algorithmic tools that can be used to design methods with appealing practical behavior as well as provably good global convergence properties. These tools include the conjugate gradient and Lanczos algorithms, Newton's method, cubic regularization, trust regions, and accelerated gradient. We show how these elements can be assembled into a comprehensive method, and compare a number of proposals that have been made to date. If time permits, we will consider the behavior of first-order methods in the vicinity of saddle points, showing that accelerated gradient methods are as unlikely as gradient descent to converge to saddle points, and escapes from such points faster.

This talk presents joint work with Clement Royer and Mike O'Neill (both of U Wisconsin-Madison).

In the first part of the talk we study the problem of minimizing a noisy function when derivatives are not available. In order to obtain scalability, the algorithm updates a quadratic model of the objective in order O(n) work using noise estimation techniques. Next we discuss a technique for dynamically increasing the accuracy in gradient approximations to achieve optimal complexity as well as efficiency in practice.

A simple yet powerful idea to deal with large-scale problems is to decompose them into smaller subproblems that are much easier to solve. We have seen the success of this idea such as SGD and coordinate descent. It is natural to apply the same idea to solve linearly constrained problems by an algorithm called ADMM (Alternating Direction Method of Multipliers). Unfortunately, when there are three blocks, ADMM was previously shown to be possibly divergent even for solving a linear system. In this talk, we discuss a new variant of ADMM, called RP-ADMM (Randomly Permuted ADMM) and prove its expected convergence for solving linear systems.

A main technical lemma is that the eigenvalues of the expected iteration matrix of RP-CD (Randomly Permuted coordinate descent) lie in (-1/3, 1). As a byproduct, the convergence speed of RP-CD can be reduced to an open question on matrix AM-GM inequality. Motivated by the proof technique, we also propose a new update rule named Bernollin-randomized rule. We will discuss some interesting numerical observations about RP-ADMM and a few other variants, which further validate the effectiveness of random permutation.

### Wednesday, October 4th, 2017

We will discuss how sketching tools impacted the design of fast algorithms in a variety of settings. The sketching concept has been originally proposed in the context of streaming algorithms, as a computational generalization of the classic dimension reduction method, for the purpose of reducing *space* usage. However, in recent years, sketching has been increasingly used to speed-up algorithms in, e.g., high-dimensional geometry, numerical linear algebra, iterative methods, and linear programs. A typical application of sketching proceeds by reducing either the size of the input or the size of some intermediate computations to be iterated over by an algorithm. In this talk we will survey some examples of such applications of sketching to obtain fast(er) algorithms.

We show how to compute a {\it relative-error} low-rank approximation to any positive semidefinite (PSD) matrix in {\it sublinear time}, i.e., for any $n \times n$ PSD matrix $A$, in $\tilde O(n \cdot \poly(k/\epsilon))$ time we output a rank-$k$ matrix $B$, in factored form, for which $\|A-B\|_F^2 \leq (1+\epsilon)\|A-A_k\|_F^2$, where $A_k$ is the best rank-$k$ approximation to $A$. When $k$ and $1/\epsilon$ are not too large compared to the sparsity of $A$, our algorithm does not need to read all entries of the matrix. Hence, we significantly improve upon previous $\nnz(A)$ time algorithms based on oblivious subspace embeddings, and bypass an $\nnz(A)$ time lower bound for general matrices (where $\nnz(A)$ denotes the number of non-zero entries in the matrix). We prove time lower bounds for low-rank approximation of PSD matrices, showing that our algorithm is close to optimal. Finally, we extend our techniques to give sublinear time algorithms for low-rank approximation of $A$ in the (often stronger) spectral norm metric $\norm{A-B}_2^2$ and for ridge regression on PSD matrices.

Joint work with Cameron Musco.

Given a set of points in d dimensions, imagine putting a Gaussian distribution around each of them. How quickly can we evaluate the sum of these Gaussian densities at a new point? This computational problem (and its generalization for densities other than the Gaussian) is called kernel density estimation. This problem arises as a basic primitive in statistics (non-parametric density estimation), machine learning (kernel methods) and scientific computing (physical simulations).

The batch version of this question (compute the sum of n kernels at m given query points) is addressed by the celebrated fast multiple method from the late 80s which has linear or near-linear complexity for points in 2 and 3 dimensions. The high dimensional case has been challenging because at a high level, typical space partitioning approaches have an exponential dependence on the dimension.

In this talk, I will show that locality sensitive hashing (introduced in the late 90s for the approximate nearest neighbor problem in high dimensions) can be adapted to devise unbiased estimators for kernel density in high dimensions.

Consider Markov decision processes with S states and A actions per state.

We provide the first set of nearly-linear and sublinear running time upper bounds and a nearly matching lower bound for the discounted Markov decision problem. For the upper bounds, we propose a randomized linear programming algorithm that takes advantage of the linear duality between value and policy functions. The method achieves nearly-linear run time for general problems, and sublinear run-time for ergodic decision process when the input is given in specific ways. To further improve the complexity, we propose a randomized value iteration method that leverages the contractive property of the Bellman operator and variance reduction techniques. For the lower bound, we show that any algorithm needs at least Omega(S^2 A) run time to get a constant-error approximation to the optimal policy. Surprisingly, this lower bound reduces to Omega(SA/epsilon) when the input is specified in suitable data structures. The upper and lower bounds suggest that the complexity of MDP depends on data structures of the input. These results provide new complexity benchmarks for solving stochastic dynamic programs.

In this talk, we consider a fundamental class of convex matrix optimization problems with low-rank solutions. We show it is possible to solve these problem using far less memory than the natural size of the decision variable when the problem data has a concise representation. Our proposed method, SketchyCGM, is the first algorithm to offer provable convergence to an optimal point with an optimal memory footprint. SketchyCGM modifies a standard convex optimization method — the conditional gradient method — to work on a sketched version of the decision variable, and can recover the solution from this sketch. In contrast to recent work on non-convex

methods for this problem class, SketchyCGM is a convex method; and our convergence guarantees do not rely on statistical assumptions.

Based on joint work with Alp Yurtsever, Volkan Cevher, and Joel Tropp.

### Thursday, October 5th, 2017

The focus of this talk is on nonconvex optimization, especially the large-scale setting. Our starting point is stochastic gradient descent (SGD), a cornerstone of machine learning that was introduced over 60 years ago! Then we move onto an important new development in the recent years: stochastic variance reduced methods. These methods excel in settings where more than one pass through the training data is allowed and converge faster than SGD both in theory and practice. Typical theoretical guarantees ensure convergence to stationary points; however, we will also talk about recent work trends on methods for escaping saddle points and targeting local minima. Beyond the usual SGD setup, I will also comment on some fruitful setups where the nonconvexity has sufficient geometric structure to permit even efficient global optimization. Ultimately, the aim of my talk is to provide a briefy survey of this fast moving area. We hope to unify and simplify its presentation, outline common pitfalls, and raise awareness about open research challenges.

Computing the stationary distribution of a Markov Chain is one of the most fundamental problems in optimization. It lies at the heart of numerous computational tasks including computing personalized PageRank vectors, evaluating the utility of policies in Markov decision process, and solving asymmetric diagonally dominant linear systems. Despite the ubiquity of these problems, until recently the fastest known running times for computing the stationary distribution either depended polynomially on the mixing time or appealed to generic linear system solving machinery, and therefore ran in super-quadratic time.

In this talk I will present recent results showing that the stationary distribution and related quantities can all be computed in almost linear time. I will present new iterative methods for extracting structure from directed graphs and and show how they can be tailored to achieve this new running time. Moreover, I will discuss connections between this work and recent developments in solving Laplacian systems and optimizing over directed graphs.

This talk reflects joint work with Michael B. Cohen (MIT), Jonathan Kelner (MIT), John Peebles (MIT), Richard Peng (Georgia Tech) and Adrian Vladu (MIT).

Many problems of contemporary interest in signal processing and machine learning involve highly non-convex optimization problems. While nonconvex problems are known to be intractable in general, simple local search heuristics such as (stochastic) gradient descent are often surprisingly effective at finding global optima on real or randomly generated data. In this talk I will discuss some results explaining the success of these heuristics focusing on two problems.

The first problem is about learning the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). I will discuss this problem in the high-dimensional regime where the number of observations are fewer than the ReLU weights. I will show that projected gradient descent on a natural least-squares objective, when initialization at 0, converges at a linear rate to globally optimal weights with a number of samples that is optimal up to numerical constants. The second problem focuses on maximizing continuous submodular functions that emerge in a variety of areas in machine learning, including utility functions in matrix approximations and network inference. Despite the apparent lack of convexity/concavity in such functions, I will show that stochastic projected gradient methods can provide strong approximation guarantees for maximizing continuous submodular functions with convex constraints. In particular, this result allows us to approximately maximize discrete, monotone submodular optimization problems via projected gradient descent on a continuous relaxation, directly connecting the discrete and continuous domains.

Parts of this talk is based on joint work with Hamed Hassani and Amin Karbasi.

No abstract available.

No abstract available.

No abstract available.

No abstract available.

### Friday, October 6th, 2017

Much empirical work in deep learning has gone into avoiding vanishing gradients, a necessary condition for the success of stochastic gradient methods. This raises the question of whether we can provably rule out vanishing gradients for some expressive model architectures? I will point out several obstacles, as well as positive results for some simplified architectures, specifically, linearized residual networks, and linear dynamical systems.

Based on joint works with Ma and Recht.

Submodular function minimization is a fundamental optimization problem that arises in several applications in machine learning and computer vision. The problem is known to be solvable in polynomial time, but general purpose algorithms have high running times and are unsuitable for large-scale problems. Recent work aims to obtain very practical algorithms for minimizing functions that are sums of "simple" functions. This class of functions provides an important bridge between functions with a rich combinatorial structure -- notably, the cut function of a graph -- and general submodular functions, and it brings along powerful combinatorial structure reminiscent of graphs as well as a fair bit of modeling power that greatly expands its applications. In this talk, we describe recent progress on designing very efficient algorithms for minimizing decomposable functions and understanding their combinatorial structure.

We design a stochastic algorithm to train any smooth neural network to eps-approximate local minima, using O(e^{-3.25}) backpropagations. The best result was essentially O(e^{-4}) by SGD.

More broadly, it finds eps-approximate local minima of any smooth nonconvex function in rate O(e^{-3.25}), with only oracle access to stochastic gradients and Hessian-vector products.