The introduction of randomization in the design and analysis of algorithms for matrix computations (such as matrix multiplication, least-squares regression, the Singular Value Decomposition (SVD), etc.) over the past 20 years provided a new paradigm and a complementary perspective to traditional numerical linear algebra approaches. These novel approaches were motivated by technological developments in many areas of scientific research that permit the automatic generation of large data sets, which are often modeled as matrices. In this talk, we will present a brief overview of Randomized Numerical Linear Algebra (RandNLA).

### Monday, September 24th, 2018

We consider the solution of full-rank column regression problems, arising from Gaussian linear models, by randomized (row-wise sampling and sketching) algorithms. Motivated by the ground breaking work of Ma, Mahoney and Yu, we analyze expectation and variance of the minimum-norm solution with regard to both, model- and algorithm-induced uncertainties. Our expressions are exact, hold for general sketching matrices, and make no assumptions on the rank of the sketched matrix. We show that the relative differences between the total bias and variance, compared to the model bias and variance, respectively, are governed by two factors: (i) the expected rank deficiency of the sketched matrix, and (ii) the expected difference between projectors associated with the original and the sketched problems. The structural perturbation bounds imply that least squares problems are less sensitive to multiplicative perturbations than to additive perturbations. This is joint work with Jocelyn Chi.

Low-rank matrix approximations have become a technique of central importance in large scale data science. In this talk, we discuss a set of novel low-rank matrix approximation algorithms that are tailored for all levels of accuracy requirements for maximum computational efficiency. These algorithms include spectrum-revealing matrix factorizations that are optimal up to dimension-dependent constants, and an efficient truncated SVD (singular value decomposition) that is accurate up to a given tolerance. We provide theoretical error bounds and numerical evidence that demonstrate the superiority of our algorithms over existing ones, and show their usefulness in a number of data science applications.

In this talk, I will discuss recent advances in sampling methods for positive semidefinite (PSD) matrix approximation. I will start by discussing how fast leverage score approximation schemes can be combined with the popular Nystrom method for PSD kernel matrix approximation. These techniques yield the first provably accurate, linear time approximation algorithms for fundamental kernel methods like kernel ridge regression and kernel PCA, avoiding a traditional quadratic time bottleneck. I will then discuss how the same sampling techniques can be leveraged to give a surprising result: a relative error low-rank approximation to any positive semidefinite matrix can be computed in sublinear time, without any assumptions on incoherence or condition number. This result demonstrates the ability of randomized methods to exploit the structure of PSD matrices and go well beyond what is possible with traditional algorithmic techniques. I will conclude with a discussion of open questions, especially focused on oblivious sketching methods for kernel matrices.

Reconstructing signals based on a small number of potentially noisy samples is a fundamental problem in signal processing. In practice, we are often interested in signals with ``simple'' Fourier transforms -- e.g. those involving frequencies within a bounded range, a small number of frequencies, or a few blocks of frequencies (bandlimited, sparse, and multiband signals, respectively). We generally find that "simpler" signals require fewer samples to learn. In this talk, I will introduce a general framework for learning continuous signals that formalizes this tradeoff between sample complexity and simplicity. Our framework is based on a connection between the signal reconstruction problem, low-rank matrix approximation, and randomized numerical linear algebra. In particular, we use tools based on leverage score sampling, and show that the number of samples required for learning a signal depends on a natural statistical dimension parameter related to these scores. Using this framework, I will also present a simple and efficient algorithm for signal reconstruction. For well-studied classes of signals (bandlimited, sparse, and multiband) our algorithm essentially matches the runtime and sample complexity of state-of-the-art methods. At the same time, it extends to a much broader class of ``simple'' signals. The main computation cost of our method is a black-box call to any existing algorithm for kernel ridge regression.

Large samples have been generated routinely from various sources. Classic statistical and analytical methods are not well equipped to analyze such large samples due to expensive computational costs. In this talk, I will present an asympirical (asymptotic + empirical) analysis in large samples. The proposed method can significantly reduce computational costs in high-dimensional and large-scale data. We show the estimator based on the proposed methods achieves the optimal convergence rate. Extensive simulation studies will be presented to demonstrate numerical advantages of our method over competing methods. I will further illustrate the empirical performance of the proposed approach using two real data examples.

Randomized algorithms have shown great success in many large-scale applications. However, beyond algorithmic advantages, randomization can offer insight in understanding many data scientific models. In this talk, we discuss two such examples.

In the context of analyzing neural networks, we show how randomized analysis can be used to answer questions such as “why training neural networks gets harder as the depth of the network increases?” and “what could potentially be appropriate strategies for initialization of weights in the training process?”

We then discuss an example where such style of analysis can help study graph spectral clustering algorithms. In particular, we show that, under certain assumptions, given an existing clustering, with high probability, one can efficiently and yet accurately, perform an out-of-sample extension to observations not seen in the initial clustering procedure.

Random Matrix Theory (RMT) and Randomized Numerical Linear Algebra (RandNLA) are applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a ``size scale'' separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. This implicit self-regularization can depend strongly on the many knobs of the training process. In particular, by exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that---all else being equal---DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena.

### Tuesday, September 25th, 2018

We develop a new family of variance reduced stochastic gradient descent methods for minimizing the average of a very large number of smooth functions. Our method---JacSketch---is motivated by novel developments in randomized numerical linear algebra, and operates by maintaining a stochastic estimate of a Jacobian matrix composed of the gradients of individual functions. In each iteration, JacSketch efficiently updates the Jacobian matrix by first obtaining a random linear measurement of the true Jacobian through (cheap) sketching, and then projecting the previous estimate onto the solution space of a linear matrix equation whose solutions are consistent with the measurement. The Jacobian estimate is then used to compute a variance-reduced unbiased estimator of the gradient, followed by a stochastic gradient descent step. Our strategy is analogous to the way quasi-Newton methods maintain an estimate of the Hessian, and hence our method can be seen as a {\em stochastic quasi-gradient method}. Indeed, quasi-Newton methods project the current Hessian estimate onto a solution space of a linear equation consistent with a certain linear (but non-random) measurement of the true Hessian. Our method can also be seen as stochastic gradient descent applied to a {\em controlled stochastic optimization reformulation} of the original problem, where the control comes from the Jacobian estimates. We prove that for smooth and strongly convex functions, JacSketch converges linearly with a meaningful rate dictated by a single convergence theorem which applies to general sketches. We also provide a refined convergence theorem which applies to a smaller class of sketches, featuring a novel proof technique based on a {\em stochastic Lyapunov function}. This enables us to obtain sharper complexity results for variants of JacSketch with importance sampling. By specializing our general approach to specific sketching strategies, JacSketch reduces to the celebrated stochastic average gradient (SAGA) method, and its several existing and many new minibatch, reduced memory, and importance sampling variants. Our rate for SAGA with importance sampling is the current best-known rate for this method, resolving a conjecture by Schmidt et al (2015). The rates we obtain for minibatch SAGA are also superior to existing rates. Moreover, we obtain the first minibatch SAGA method with importance sampling. This is joint work with Robert M. Gower (Telecom ParisTech) and Francis Bach (INRIA). https://arxiv.org/abs/1805.02632

We consider optimization problems of the form max f(x) s.t. x^T B x=1 where B is a symmetric positive definite matrix which is the Gram matrix of some input n-by-d data matrix X. This problem naturally occurs in quite a few application, such as finding the largest or smallest singular value, linear discriminant analysis and canonical correlation analysis. Since the constraint set is a manifold, Riemannian optimization is natural tool for tackling such problems. However, the prevalent method in which this method is materilized leads to an O(nd^2) cost, which is not attractive if n is large and/or X is sparse. The underlying reason is that these methods are based on a Riemannian metric that require the factorization of the Gram matrix B. We propose to use an alternative metric which is based on randomized precoditioning of X (e.g. the construction used in Blendenpik). Theoretical and empirical aspects will be discussed.

Randomized methods for computing eigenvalues of large matrices are influenced by a number of fundamental spectral properties that affect algorithm performance and the quality of the resulting answers. Should one gauge convergence by eigenvalue errors, residual norms, or subspace angles? What matrix properties dictate the convergence rate? How does performance depend on the location of the sought-after eigenvalues, relative to the rest of the spectrum? How sensitive are eigenvalues and eigenvectors to changes in the matrix? How does such sensitivity influence convergence? We shall address these questions in the context of traditional (deterministic) eigensolvers, in the hope that such an overview will provide a helpful framework for designers of randomized algorithms.

At this point in time, we understand fairly well how randomized projections can be used to very efficiently compute a low rank approximation to a given matrix. We have seen that randomized methods are often substantially faster than traditional deterministic methods, and that they enable the processing of matrices that are simply too large to be handled using traditional deterministic methods. In this talk, we will describe how randomization can also be used to accelerate the computation of a *full* factorization (e.g. a column pivoted QR decomposition) of a matrix. These methods achieve high speed primarily by reducing the amount of communication that is required, as compared to existing methods. This makes them particularly competitive in communication constrained environments such as data stored out-of-core, GPU computing, distributed memory machines, etc. While the methods presented were developed primarily for the purpose of computing a full factorization, they are also very convenient to use for computing partial factorizations in situations where we have no prior information about the numerical rank of the input matrix.

I'll discuss our experience in using randomized methods to sample the vertices of a zonotope, which arose when we were studying fast algorithm to solve low-rank correlation clustering problems. I'll also discuss some experience with a new type of Monte Carlo algorithms that appears to have better empirical convergence properties and better parallalization potential when solving linear systems of equations. This work is based on the following papers: https://arxiv.org/abs/1611.07305 and https://arxiv.org/abs/1608.04361

In recent years, many randomized algorithms have been proposed for computing approximate solutions to large-scale problems in numerical linear algebra. However, the user rarely knows the actual error of a randomized solution. For this reason, it is common to rely on theoretical worst-case error bounds as a source of guidance. As a more practical alternative, we propose bootstrap methods to obtain direct error estimates for randomized solutions. Specifically, in the contexts of matrix multiplication and least-squares, we show that bootstrap error estimates are theoretically justified, and incur modest computational cost.

The study of large scale matrices and networks often utilize high-level algorithmic primitives for them: tools such as solving linear systems and computing eigenvectors are now taught in courses on machine learning and statistics. On the other hand, the rapid growth in data sizes exposed significant shortcomings in the scalability of such primitives, and accentuated the need for provably efficient algorithms. Here the Laplacian paradigm for graph algorithms has led to improvements on many fundamental problems in scientific computing and combinatorial optimization, as well as highly efficient code packages that have been applied to image processing and data mining. These works started from the rich mathematical connections between graphs and matrices provided by spectral graph theory, but increasingly rely on fine-grained coupling between combinatorial and numerical components. This talk will discuss some of the key ideas in these algorithms, with focus on the uses of efficient structure-preserving sampling and high dimensional concentration.

### Wednesday, September 26th, 2018

A common statistical analysis problem is to determine the mean and the variance of a (few) parameter(s), marginalizing over a large number of latent variables, which are all correlated, so that the Hessian is a full rank matrix. This requires inverting the Hessian, which becomes impossible using linear algebra in a very high number of dimensions. I will present a method to obtain the Hessian inverse matrix elements using parametric bootstrap samples (simulations), where only a few samples already give a reliable estimate. I will present an application of this method to the cosmological data analysis problem, where we operate with up to 10^{10} fully correlated observations of galaxy positions, and we wish to determine a handful of cosmological parameters.

Ridge leverage scores provide a balance between low-rank approximation and regularization, and are ubiquitous in randomized linear algebra and machine learning. Deterministic algorithms are also of interest in the moderately big data regime, because deterministic algorithms provide interpretability to the practitioner by having no failure probability and always returning the same results. We provide provable guarantees for deterministic column sampling using ridge leverage scores. The matrix sketch returned by our algorithm is a column subset of the original matrix, yielding additional interpretability. Like the randomized counterparts, the deterministic algorithm provides $(1+\epsilon)$ error column subset selection, $(1+\epsilon)$ error projection-cost preservation, and an additive-multiplicative spectral bound. We also show that under the assumption of power-law decay of ridge leverage scores, this deterministic algorithm is provably as accurate as randomized algorithms. Lastly, ridge regression is frequently used to regularize ill-posed linear least-squares problems. While ridge regression provides shrinkage for the regression coefficients, many of the coefficients remain small but non-zero. Performing ridge regression with the matrix sketch returned by our algorithm and a particular regularization parameter forces coefficients to zero and has a provable $(1+\epsilon)$ bound on the statistical risk. As such, it is an interesting alternative to elastic net regularization.

The next milestone for machine learning is the ability to train on massively large datasets. The de facto method used for training neural networks is stochastic gradient descent, a sequential algorithm with poor convergence properties. One approach to address the challenge of large scale training, is to use large mini-batch sizes which allows parallel training. However, large batch size training often results in poor generalization performance. The exact reasons for this are still not completely understood, and all the methods proposed so far for resolving it (such as scaling learning rate or annealing batch size) are specific to a particular problem and do not generalize.

In the first part of the talk, I will show results analyzing large batch size training through the lens of the Hessian operator. The results rule out some of the common theories regarding large batch size training, such as problems with saddle points. In the second part, I will present our results on a novel Hessian based method in combination with robust optimization, that avoids many of the issues with first order methods such as stochastic gradient descent for large scale training.

A wide range of phenomena in the physical, engineering, biological, and social sciences feature rich dynamics that give rise to multiscale structures in both space and time, including fluid dynamics, atmospheric-ocean interactions, climate modeling, epidemiology, and neuroscience. Constrained matrix factorizations are powerful techniques for modern data analysis, providing improved interpretation of low-rank structures by identifying localized spatial structures in the data and disambiguating between distinct time scales. However, the emergence of large-scale datasets has severely challenged our ability to analyze data using such techniques. We discuss randomized methods as a computational strategy to accelerate techniques such as sparse principal component analysis (SPCA) and nonnegative matrix factorization (NMF). The proposed algorithms are demonstrated using both synthetic and real world data, showing great computational efficiency and diagnostic performance.

Sampling and Projection (or sketching) are the two fundamental hammers in Randomized Numerical Linear Algebra for reducing the size and scale of the problem at hand. At extreme scale, conventional sampling and projections themselves are prohibitively expensive.

In this talk, I will discuss some of my recent and surprising findings on the use of hashing algorithms for large-scale estimations. Locality Sensitive Hashing (LSH) is a hugely popular algorithm for sub-linear near neighbor search. However, it turns out that fundamentally LSH is a constant time (amortized) adaptive sampler from which efficient near-neighbor search is one of the many possibilities. LSH offers a unique capability to do smart sampling and statistical estimations at the cost of few hash lookups. Our observation bridges data structures (probabilistic hash tables) with efficient unbiased statistical estimations. I will demonstrate how this dynamic and efficient sampling beak the computational barriers in adaptive estimations where it is possible that we pay roughly the cost of uniform sampling but get the benefits of adaptive sampling. I will demonstrate the power of one simple idea for three favorite problems 1) Partition function estimation for large NLP models such as word2vec, 2) Adaptive Gradient Estimations for efficient SGD and 3) Sub-Linear Deep Learning with Huge Parameter Space.

In the end, if time permits, I will switch to memory cost show a simple hashing algorithm, which is a count-min sketch in disguise, can shrink memory requirements associated with k-class classification exponentially! Using our algorithms, we can train 100,000 classes with 400,000 features, on a single Titan X while only needing 5% or less memory required to store all the weights. Running a simple logistic regression on this data, the model size of 320GB is unavoidable.

We study the following basic machine learning task: Given a fixed set of n input points in a d-dimensional linear regression problem, we wish to predict a hidden response value for each of the points. We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset. The performance of the predictions is evaluated by the total square loss on all responses. We show that a good approximate solution to this least squares problem can be obtained from just dimension d many responses by using a joint sampling technique called volume sampling. Moreover, the least squares solution obtained for the volume sampled subproblem is an unbiased estimator of optimal solution based on all n responses. This unbiasedness is a desirable property that is not shared by standard subset selection techniques.

Motivated by these basic properties, we develop a theoretical framework for studying volume sampling, which leads to a number of new expectation formulas and statistical guarantees that are of importance not only to least squares regression but also numerical linear algebra in general. Our methods lead to several extensions of volume sampling, including a regularized variant, and we propose the first efficient algorithms which make this technique a practical tool in the machine learning toolbox.

Semidefinite programs (SDP) are important in learning and combinatorial optimization with numerous applications. In pursuit of low-rank solutions and low complexity algorithms, we consider the Burer--Monteiro factorization approach for solving SDPs. We show that all approximate local optima are global optima for the penalty formulation of appropriately rank-constrained SDPs as long as the number of constraints scales sub-quadratically with the desired rank of the optimal solution. Our result is based on a simple penalty function formulation of the rank-constrained SDP along with a smoothed analysis to avoid worst-case cost matrices.

### Thursday, September 27th, 2018

An active research area in recent years is the development of fast hierarchical matrix tools for linear and eigen solvers. Although the theoretical foundation for the hierarchical matrices has been solidified, there is a lack of robust algorithms and software that can handle many practical issues. For example, randomized sampling has been shown to be an effective tool to reveal the low rank structures, but choosing the right number of samples is difficult.

In this talk we present new results for HSS compression using an Adaptive Randomized Sampling strategy. In particular, we developed a robust stopping criterion based on a new stochastic relative error estimation. We present results for situations when the dense matrix is given and in the matrix-free setting. We show some practical difficulties about implementation and present some solutions that get around the problems.

In the presence of measurement noise, the quality of fast randomized low-rank approximations deteriorates with noise level. Under suitable assumptions, this phenomenon can be quantified (as function of the noise level, sketch size and underlying matrix rank) and mitigated.

A fundamental algorithmic result for matroids is that the maximum weight base can be computed using the greedy algorithm. For linear matroids, an important question is the time complexity of computing such a base. In this work, we give a fast algorithm that computes the same output as the greedy algorithm for linear matroids.

A matrix martingale is a sequence of random matrices where the expectation of each matrix in the sequence equals the preceding matrix, conditional on the earlier parts of the sequence. Concentration results for matrix martingales have become an important tool in the analysis of algorithms in randomized numerical linear algebra. Simple and fast algorithms for many well-studied problems can be analyzed in using martingales. We survey recent results on using matrix martingales for randomized constructions of sparse approximations of graphs and discuss in detail the use of matrix martingales for analyzing approximate Gaussian elimination on Laplacian matrices associated with undirected and directed graphs.

Sampling logconcave functions arising in statistics and machine learning has been a subject of intensive study. Recent developments include analyses for Langevin dynamics and Hamiltonian Monte Carlo (HMC). While both approaches have dimension-independent bounds for the underlying continuous processes under sufficiently strong smoothness conditions, the resulting discrete algorithms have complexity and number of function evaluations growing with the dimension. Here we give a nearly linear implementation of HMC for sampling from a broad class of densities, with the number of iterations and function evaluations being polylogarithmic in the dimension (rather than polynomial as in previous work). This class includes the widely-used loss function for logistic regression with incoherent weight matrices. Our main contribution is a fast solver for multivariate Ordinary Differential Equations whose solution is close to a low-degree polynomial. We establish logarithmic bounds on the degree for the differential equations arising in HMC. Joint work with Yin Tat Lee and Zhao Song.

We discuss two generalizations of the graph isomorphism problem:

(a) Spectral Dominance: On input of two PSD matrices A and B, does there exist a permutation matrix P such that P'*B*P is spectrally dominated by A?

(b) Spectrally Robust Isomorphism: On input of two PSD matrices A and B, and a constant k, does there exists a permutation matrix P such that the condition number of the pair (A,P'*B*P) is bounded above by k?

We show that Spectral Dominance is NP-hard, even when A and B are graph Laplacians. On the other hand, we show that Spectrally Robust Isomorphism is tractable when A and B are Laplacians of trees that have bounded degree. We also discuss several related open questions.

[The talk is based on joint work with A. Kolla, V. Madan and A. K. Sinop that appeared in ICALP '18.]

We discuss a novel approach to the subspace clustering problem that leverages ensembles of the K-subspaces (KSS) algorithm with random initializations. We extend existing results in subspace clustering to show that correct clustering can be achieved by any algorithm that obtains (possibly perturbed) measurements of any monotonic function of the inner product between points -- so it is possible to cluster with missing data, compressed data, or in other scenarios where only an approximation of inner products is available. We then show that a single iteration of KSS with random initialization produces such a measurement, and hence our algorithm can provably recover subspaces under the same conditions as state-of-the-art algorithms. The finding is, to the best of our knowledge, the first recovery guarantee for evidence accumulation clustering and for KSS variants. It also provides insight into random initializations of subspace arrangements. We show on synthetic data that our method performs well in the traditionally challenging settings of subspaces with large intersection, subspaces with small principal angles, and noisy data. Finally, we show our algorithm achieves state-of-the-art performance across several benchmark datasets, including a resulting error for the COIL-20 database that is less than half that achieved by existing algorithms.