Abstract

In this talk I will describe a family of randomized parallel coordinate descent methods for minimizing a convex loss/objective function. The methods can be applied to smooth and composite (e.g., L1-regularized) functions [1, 2] as well as nonsmooth functions via a pre-processing smoothing step [3]. The family includes methods updating an (almost) arbitrary random subset of coordinates per iteration. Depending on the way the coordinates are selected and updated, the generic method specializes to, for instance, serial coordinate descent, partially parallel coordinate descent, distributed coordinate descent and gradient descent.

The critical observation is that parallelization of a coordinate descent method does not necessarily lead to acceleration. The amount of acceleration depends on the way coordinates are chosen at each iteration (which, thankfully, we can control) and, more importantly, on certain structural separability/sparsity properties of the loss function. I will hence describe several classes of 'good' functions which do admit parallelization speedup.

I will describe computational results involving billions of variables and terabyte data using ACDC: a fast implementation of the methods. Time permitting, I will also briefly comment on extensions to distributed and asynchronous settings and on the connection of the methods with stochastic gradient descent.

References:

[1] Peter Richtárik and Martin Takáč, Parallel coordinate descent methods for big data optimization, arXiv:1212.0873, 2012

[2] Martin Takáč, Avleen Bijral, Peter Richtárik and Nati Srebro, Mini-batch primal and dual methods for SVMs, ICML 2013

[3] Olivier Fercoq and Peter Richtárik, Smooth minimization of nonsmooth functions with parallel coordinate descent methods, arXiv:1309.5885, 2013

[4] ACDC: http://code.google.com/p/ac-dc/

Video Recording