Abstract
Motivated by problems in large-scale data analysis, randomized algorithms for matrix problems such as regression and low-rank matrix approximation have been the focus of a great deal of attention in recent years. These algorithms exploit novel random sampling and random projection methods; and implementations of these algorithms have already proven superior to traditional state-of-the-art algorithms, as implemented in Lapack and high-quality scientific computing software, for moderately-large problems stored in RAM on a single machine. Here, we describe the extension of these methods to computing high-precision solutions in parallel and distributed environments that are more common in very large-scale data analysis applications.
In particular, we consider both the Least Squares Approximation problem and the Least Absolute Deviation problem, and we develop and implement randomized algorithms that take advantage of modern computer architectures in order to achieve improved communication profiles. Our iterative least-squares solver, LSRN, is competitive with state-of-the-art implementations on moderately-large problems; and, when coupled with the Chebyshev semi-iterative method, scales well for solving large problems on clusters that have high communication costs such as on an Amazon Elastic Compute Cloud cluster. Our iterative least-absolute-deviations solver is based on fast ellipsoidal rounding, random sampling, and interior-point cutting-plane methods; and we demonstrate significant improvements over traditional algorithms on MapReduce. In addition, this algorithm can also be extended to solve more general convex problems on MapReduce.