Abstract

The next milestone for machine learning is the ability to train on massively large datasets.  The de facto method used for training neural networks is stochastic gradient descent, a sequential algorithm with poor convergence properties. One approach to address the challenge of large scale training, is to use large mini-batch sizes which allows parallel training. However, large batch size training often results in poor generalization performance. The exact reasons for this are still not completely understood, and all the methods proposed so far for resolving it (such as scaling learning rate or annealing batch size) are specific to a particular problem and do not generalize.

In the first part of the talk, I will show results analyzing large batch size training through the lens of the Hessian operator. The results rule out some of the common theories regarding large batch size training, such as problems with saddle points. In the second part, I will present our results on a novel Hessian based method in combination with robust optimization, that avoids many of the issues with first order methods such as stochastic gradient descent for large scale training.

Video Recording