Abstract

Consider learning a two-layers neural network. This requires optimizing a nonconvex high-dimensional objective, and is often done—successfully—by running stochastic gradient descent (SGD). However, the reason for the success of SGD and its domain of applicability are still poorly understood. Recently, it was shown that in a suitable scaling limit, SGD dynamics is captured by a nonlinear partial differential equation, the `distributional dynamics'. In this mean-field limit, the evolution of the network weights can be well approximated by an evolution in the space of probability distributions defined by a Wasserstein gradient flow. During this talk, I will present the distributional dynamics and show how to derive non-asymptotic approximation guarantees between the SGD process and the limiting PDE. In particular, the mean field description will be accurate as soon as the number of hidden units is larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions. I will then consider an example to illustrate how this description allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove global convergence results for SGD.

Attachment

Video Recording