Abstract

Stochastic Gradient Descent (SGD) is the basic first-order stochastic optimization algorithm behind powerful deep learning architectures that are becoming increasingly omnipresent in society.   In this lecture, we motivate the use of stochastic first-order methods and recall some convergence results for SGD. We then discuss the notion of importance sampling for SGD and how it can improve the convergence rate.  Finally, we discuss methods for making SGD more "robust" to hyper-parameters of the algorithm, such as the step size, using "on the fly" adaptive step size methods such as AdaGrad, and present some theoretical results.

Video Recording