Communication-Efficient Distributed SGD with Sketching and Streaming Coresets for M-Estimators

Abstract

Large-scale distributed training of neural networks is often limited by
network bandwidth, wherein the communication time overwhelms the local
computation time. Motivated by the success of sketching methods in
sub-linear/streaming algorithms, we introduce Sketched-SGD, an algorithm for
carrying out distributed SGD by communicating sketches instead of full
gradients. We show that Sketched-SGD has favorable convergence rates on
several classes of functions. When considering all communication -- both of
gradients and of updated model weights -- Sketched-SGD reduces the amount of
communication required compared to other gradient compression methods from
O(d) or O(W) to O(logd), where d is the number of model parameters and W is
the number of workers participating in training. We run experiments on a
transformer model, an LSTM, and a residual network, demonstrating up to a
40x reduction in total communication cost with no loss in final model
performance. We also show experimentally that Sketched-SGD scales to at
least 256 workers without increasing communication cost or degrading model
performance.

Joint work with Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Ion Stoica,
Raman Arora (NeurIPS 2019).

If time permits we will also discuss streaming corsets for M-estimators.

We introduce a new method of maintaining a (k,epsilon)-coreset for
clustering M-estimators over insertion-only streams. Let (P,w) be a weighted
set (where w : P - > [0,infty) is the weight function) of points in a
rho-metric space (meaning a set X equipped with a positive-semidefinite
symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z
in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p)
min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v)
such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C)
<= (1+epsilon)COST(P,w,C). M-estimators are functions D(x,y) that can be
written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric)
space. Special cases of M-estimators include the well-known k-median (psi(x)
=x) and k-means (psi(x) = x^2) functions. Our technique takes an existing
offline construction for an M-estimator coreset and converts it into the
streaming setting, where n data points arrive sequentially. Our streaming
construction does not rely on the merge-and-reduce tree. For example, our
coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n)
points of storage. The previous state-of-the-art required storing at least
O(epsilon^{-2} k log k log^{4} n) points.

Joint work with Dan Feldman, Harry Lang and Daniela Rus (RANDOM 2019).