Abstract

The focus of this talk is on nonconvex optimization, especially the large-scale setting. Our starting point is stochastic gradient descent (SGD), a cornerstone of machine learning that was introduced over 60 years ago! Then we move onto an important new development in the recent years: stochastic variance reduced methods. These methods excel in settings where more than one pass through the training data is allowed and converge faster than SGD both in theory and practice. Typical theoretical guarantees ensure convergence to stationary points; however, we will also talk about recent work trends on methods for escaping saddle points and targeting local minima. Beyond the usual SGD setup, I will also comment on some fruitful setups where the nonconvexity has sufficient geometric structure to permit even efficient global optimization. Ultimately, the aim of my talk is to provide a briefy survey of this fast moving area. We hope to unify and simplify its presentation, outline common pitfalls, and raise awareness about open research challenges.

Video Recording