Toward a Theory of Optimization for Over-Parameterized Systems of Non-Linear Equations: The Lessons of Deep Learning

Abstract

The success of deep learning is due, to a great extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. I will discuss some general mathematical principles, inspired by that success, allowing for efficient optimization in over-parameterized systems of non-linear equations, a setting that includes deep neural networks. In particular, optimization problems corresponding to such systems are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD. We connect the PL condition of these systems to the condition number associated to the tangent kernel and develop a non-linear theory parallel to classical analyses of over-parameterized linear equations. Finally, I will discuss the remarkable separate phenomenon of "transition to linearity", when certain large non-linear systems approach linearity as the number of variables increases, and show how our analysis sheds light on the recently observed properties of Neural Tangent Kernels. Joint work with Chaoyue Liu and Libin Zhu