How Noise during Training Affects the Hessian Spectrum

Workshop

The Brain and Computation Reunion

Speaker(s)

David Schwab (City University of New York)

Location

Date

Tuesday, June 18, 2019

Time

11:30 a.m. – 12 p.m. PT

Abstract

Stochastic gradient descent (SGD) has been the core optimization method for deep neural networks, contributing to their resurgence. While some progress has been made, it remains unclear why SGD leads the learning dynamics in overparameterized networks to solutions that generalize well. Here we show that for overparameterized networks with a degenerate valley in their loss function, SGD on average decreases the trace of the Hessian. We also show that isotropic noise in the non-degenerate subspace of the Hessian de-creases its determinant. This opens the door to anew optimization approach that guides the model to solutions with better generalization. We test our results with experiments on toy models and deep neural networks.