Abstract
Stochastic gradient descent (SGD) has been the core optimization method for deep neural networks, contributing to their resurgence. While some progress has been made, it remains unclear why SGD leads the learning dynamics in overparameterized networks to solutions that generalize well. Here we show that for overparameterized networks with a degenerate valley in their loss function, SGD on average decreases the trace of the Hessian. We also show that isotropic noise in the non-degenerate subspace of the Hessian de-creases its determinant. This opens the door to anew optimization approach that guides the model to solutions with better generalization. We test our results with experiments on toy models and deep neural networks.