Abstract

Gradient descent (GD) and stochastic gradient descent (SGD) are the fundamental algorithms for optimizing machine learning models, particularly in the context of deep learning. However, certain observed behaviors of GD and SGD cannot be fully explained by classic optimization and statistical learning theories. For example, (1) the training loss induced by GD often oscillates locally yet still converges in the long run and (2) SGD-trained models often generalize well even when the number of training samples is less than the number of parameters. I will discuss two new understandings about the risk convergence and algorithmic regularization effects of GD and SGD:

(1) Large-stepsize GD can minimize risk in a non-monotonic manner for logistic regression with separable data.
(2) Online SGD (and its variant) can effectively learn linear regression and a ReLU neuron in the overparameterized regime.

Video Recording