Abstract

High-dimensional matrix computations are a key component of various algorithms within machine learning and scientific computing. Such computations are often deployed on large scale distributed computing clusters. The widespread usage of these clusters presents several advantages over traditional computing paradigms. However, they also present new challenges, e.g., such clusters are well known to suffer from the problem of “stragglers” (slow or failed nodes in the system) which can end up dominating the overall job execution time. In the past few years, ideas from coding theory have been adapted to these problems; these allow for recovery of the intended result as long as a minimum number (threshold) of worker nodes complete their assigned tasks.

Video Recording