What is it that enables learning with multi-layer networks? What makes it possible to optimize the error, despite the problem being
hard in the worst case? What causes the network to generalize well despite the model class having extremely high capacity? In this talk I will explore these questions through experimentation, analogy to matrix factorization (including some new results on the energy
landscape and implicit regularization in matrix factorization), and study of alternate geometries and optimization approaches.