Mikhail Belkin (The Ohio State University)
A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance. As I will discuss in the talk, this apparent contradiction is key to understanding the practice of modern machine learning.
While classical methods rely on a trade-off balancing the complexity of predictors with training error, modern models are best described by interpolation, where a predictor is chosen among functions that fit the training data exactly, according to a certain (implicit or explicit) inductive bias. Furthermore, classical and modern models can be unified within a single "double descent" risk curve, which extends the classical U-shaped bias-variance curve beyond the point of interpolation. This understanding of model performance delineates the limits of the usual ''what you see is what you get" generalization bounds in machine learning and points to new analyses required to understand computational, statistical, and mathematical properties of modern models.
I will proceed to discuss some important implications of interpolation for optimization, both in terms of "easy" optimization (due to the scarcity of non-global minima), and to fast convergence of small mini-batch SGD with fixed step size.