Playlist: 17 videos

### Deep Learning Theory Workshop and Summer School

0:59:25

Spencer Frei (UC Berkeley)

https://simons.berkeley.edu/talks/tutorial-statistical-learning-theory-and-neural-networks-i

Deep Learning Theory Workshop and Summer School

In the first tutorial, we review tools from classical statistical learning theory that are useful for understanding the generalization performance of deep neural networks. We describe uniform laws of large numbers and how they depend upon the complexity of the class of functions that is of interest. We focus on one particular complexity measure, Rademacher complexity, and upper bounds for this complexity in deep ReLU networks. We examine how the behaviors of modern neural networks appear to conflict with the intuition developed in the classical setting.

In the second tutorial, we review approaches for understanding neural network training from an optimization perspective. We review the classical analysis of gradient descent on convex and smooth objectives. We describe the Polyak--Lojasiewicz (PL) inequality and discuss how to interpret such an inequality in the context of neural network training. We describe a particular regime of neural network training that is well-approximated by kernel methods, known as the neural tangent kernel (NTK) regime. We show how to establish a PL inequality for neural networks using two approaches: a general approach based on the NTK approximation, the other in the particular setting of linearly-separable data.

Visit talk page
https://simons.berkeley.edu/talks/tutorial-statistical-learning-theory-and-neural-networks-i

Deep Learning Theory Workshop and Summer School

In the first tutorial, we review tools from classical statistical learning theory that are useful for understanding the generalization performance of deep neural networks. We describe uniform laws of large numbers and how they depend upon the complexity of the class of functions that is of interest. We focus on one particular complexity measure, Rademacher complexity, and upper bounds for this complexity in deep ReLU networks. We examine how the behaviors of modern neural networks appear to conflict with the intuition developed in the classical setting.

In the second tutorial, we review approaches for understanding neural network training from an optimization perspective. We review the classical analysis of gradient descent on convex and smooth objectives. We describe the Polyak--Lojasiewicz (PL) inequality and discuss how to interpret such an inequality in the context of neural network training. We describe a particular regime of neural network training that is well-approximated by kernel methods, known as the neural tangent kernel (NTK) regime. We show how to establish a PL inequality for neural networks using two approaches: a general approach based on the NTK approximation, the other in the particular setting of linearly-separable data.

1:2:20

Spencer Frei (UC Berkeley)

https://simons.berkeley.edu/talks/tutorial-statistical-learning-theory-and-neural-networks-ii

Deep Learning Theory Workshop and Summer School

In the first tutorial, we review tools from classical statistical learning theory that are useful for understanding the generalization performance of deep neural networks. We describe uniform laws of large numbers and how they depend upon the complexity of the class of functions that is of interest. We focus on one particular complexity measure, Rademacher complexity, and upper bounds for this complexity in deep ReLU networks. We examine how the behaviors of modern neural networks appear to conflict with the intuition developed in the classical setting.

In the second tutorial, we review approaches for understanding neural network training from an optimization perspective. We review the classical analysis of gradient descent on convex and smooth objectives. We describe the Polyak--Lojasiewicz (PL) inequality and discuss how to interpret such an inequality in the context of neural network training. We describe a particular regime of neural network training that is well-approximated by kernel methods, known as the neural tangent kernel (NTK) regime. We show how to establish a PL inequality for neural networks using two approaches: a general approach based on the NTK approximation, the other in the particular setting of linearly-separable data.

Visit talk page
https://simons.berkeley.edu/talks/tutorial-statistical-learning-theory-and-neural-networks-ii

Deep Learning Theory Workshop and Summer School

In the first tutorial, we review tools from classical statistical learning theory that are useful for understanding the generalization performance of deep neural networks. We describe uniform laws of large numbers and how they depend upon the complexity of the class of functions that is of interest. We focus on one particular complexity measure, Rademacher complexity, and upper bounds for this complexity in deep ReLU networks. We examine how the behaviors of modern neural networks appear to conflict with the intuition developed in the classical setting.

In the second tutorial, we review approaches for understanding neural network training from an optimization perspective. We review the classical analysis of gradient descent on convex and smooth objectives. We describe the Polyak--Lojasiewicz (PL) inequality and discuss how to interpret such an inequality in the context of neural network training. We describe a particular regime of neural network training that is well-approximated by kernel methods, known as the neural tangent kernel (NTK) regime. We show how to establish a PL inequality for neural networks using two approaches: a general approach based on the NTK approximation, the other in the particular setting of linearly-separable data.

1:0:23

Aditi Raghunathan (Stanford)

https://simons.berkeley.edu/node/21926

Deep Learning Theory Workshop and Summer School

Deep networks often fail catastrophically under the presence of distribution shift—when the test distribution differs in some systematic way from the training distribution. Robustness to distribution shifts is typically studied for its crucial role in reliable real-world deployment of deep networks. In this talk, we will see that robustness can also provide new insights into the functioning of deep networks, beyond the standard generalization puzzle. First, we will dive into the popular setting of transferring a pre-trained model to a downstream task. We study the optimization dynamics of the transfer process in a stylized setting that replicates observed empirical observations and allows us to devise a new heuristic that outperforms previous methods. Next, we will go over several observations from robustness in the standard supervised setting that provide a new perspective on the role of overparameterization and the inductive biases of deep networks.

Visit talk page
https://simons.berkeley.edu/node/21926

Deep Learning Theory Workshop and Summer School

Deep networks often fail catastrophically under the presence of distribution shift—when the test distribution differs in some systematic way from the training distribution. Robustness to distribution shifts is typically studied for its crucial role in reliable real-world deployment of deep networks. In this talk, we will see that robustness can also provide new insights into the functioning of deep networks, beyond the standard generalization puzzle. First, we will dive into the popular setting of transferring a pre-trained model to a downstream task. We study the optimization dynamics of the transfer process in a stylized setting that replicates observed empirical observations and allows us to devise a new heuristic that outperforms previous methods. Next, we will go over several observations from robustness in the standard supervised setting that provide a new perspective on the role of overparameterization and the inductive biases of deep networks.

0:57:45

Matus Telgarsky (University of Illinois at Urbana-Champaign)

https://simons.berkeley.edu/node/21927

Deep Learning Theory Workshop and Summer School

This work studies gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint). The first regime is near initialization, specifically until the weights have moved by O(sqrt(m)), where m denotes the network width, which is in sharp constrast to the O(1) weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains the NTK margin itself, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups, and the main contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast with prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions which will hopefully aid future work.

Visit talk page
https://simons.berkeley.edu/node/21927

Deep Learning Theory Workshop and Summer School

This work studies gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint). The first regime is near initialization, specifically until the weights have moved by O(sqrt(m)), where m denotes the network width, which is in sharp constrast to the O(1) weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains the NTK margin itself, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups, and the main contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast with prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions which will hopefully aid future work.

1:20:40

Nati Srebro (Toyota Technological Institute at Chicago)

https://simons.berkeley.edu/talks/tutorial-implicit-bias-i

Deep Learning Theory Workshop and Summer School

Visit talk page
https://simons.berkeley.edu/talks/tutorial-implicit-bias-i

Deep Learning Theory Workshop and Summer School

1:36:30

Nati Srebro (Toyota Technological Institute at Chicago)

https://simons.berkeley.edu/talks/tutorial-implicit-bias-ii

Deep Learning Theory Workshop and Summer School

Visit talk page
https://simons.berkeley.edu/talks/tutorial-implicit-bias-ii

Deep Learning Theory Workshop and Summer School

0:54:41

Niladri Chatterji (Stanford)

https://simons.berkeley.edu/node/21930

Deep Learning Theory Workshop and Summer School

In this talk, I shall present two research vignettes on the generalization of interpolating models.

Prior work has presented strong empirical evidence demonstrating that importance weights can have little to no effect on interpolating neural networks. We show that importance weighting fails not because of the interpolation, but instead, as a result of using exponentially-tailed losses like the cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in interpolating models trained by gradient descent. Surprisingly, our theory reveals that using biased importance weights can improve performance in interpolating models.

Second, I shall present lower bounds on the excess risk of sparse interpolating procedures for linear regression. Our result shows that the excess risk of the minimum L1-norm interpolant can converge at an exponentially slower rate than the minimum L2-norm interpolant, even when the ground truth is sparse. Our analysis exposes the benefit of an effect analogous to the "wisdom of the crowd", except here the harm arising from fitting the noise is ameliorated by spreading it among many directions.

Based on joint work with Tatsunori Hashimoto, Saminul Haque, Philip Long, and Alexander Wang.

Visit talk page
https://simons.berkeley.edu/node/21930

Deep Learning Theory Workshop and Summer School

In this talk, I shall present two research vignettes on the generalization of interpolating models.

Prior work has presented strong empirical evidence demonstrating that importance weights can have little to no effect on interpolating neural networks. We show that importance weighting fails not because of the interpolation, but instead, as a result of using exponentially-tailed losses like the cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in interpolating models trained by gradient descent. Surprisingly, our theory reveals that using biased importance weights can improve performance in interpolating models.

Second, I shall present lower bounds on the excess risk of sparse interpolating procedures for linear regression. Our result shows that the excess risk of the minimum L1-norm interpolant can converge at an exponentially slower rate than the minimum L2-norm interpolant, even when the ground truth is sparse. Our analysis exposes the benefit of an effect analogous to the "wisdom of the crowd", except here the harm arising from fitting the noise is ameliorated by spreading it among many directions.

Based on joint work with Tatsunori Hashimoto, Saminul Haque, Philip Long, and Alexander Wang.

0:57:45

Neil Mallinar (UC San Diego) & Jamie Simon (UC Berkeley)

https://simons.berkeley.edu/node/21931

Deep Learning Theory Workshop and Summer School

The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.

Joint Work With: Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran.

Link: https://arxiv.org/abs/2207.06569

Visit talk page
https://simons.berkeley.edu/node/21931

Deep Learning Theory Workshop and Summer School

The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.

Joint Work With: Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran.

Link: https://arxiv.org/abs/2207.06569

0:58:40

Ahmed El Alaoui (Cornell)

https://simons.berkeley.edu/talks/methods-statistical-physics-i

Deep Learning Theory Workshop and Summer School

Visit talk page
https://simons.berkeley.edu/talks/methods-statistical-physics-i

Deep Learning Theory Workshop and Summer School

1:6:17

Ahmed El Alaoui (Cornell)

https://simons.berkeley.edu/talks/methods-statistical-physics-ii

Deep Learning Theory Workshop and Summer School

Visit talk page
https://simons.berkeley.edu/talks/methods-statistical-physics-ii

Deep Learning Theory Workshop and Summer School