Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. I'll present a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all other parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice.
A critical bottleneck in parallelization of Strassen’s algorithm is the communication between the processors. We have proved lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal.
I'll talk about the algorithm and its theoretical computation and communication costs, show performance results from several machines, and also discuss the practicality of using Strassen instead of classical parallel algorithms.