**Abstract: **In the first tutorial, we review tools from classical statistical learning theory that are useful for understanding the generalization performance of deep neural networks. We describe uniform laws of large numbers and how they depend upon the complexity of the class of functions that is of interest. We focus on one particular complexity measure, Rademacher complexity, and upper bounds for this complexity in deep ReLU networks. We examine how the behaviors of modern neural networks appear to conflict with the intuition developed in the classical setting.

In the second tutorial, we review approaches for understanding neural network training from an optimization perspective. We review the classical analysis of gradient descent on convex and smooth objectives. We describe the Polyak--Lojasiewicz (PL) inequality and discuss how to interpret such an inequality in the context of neural network training. We describe a particular regime of neural network training that is well-approximated by kernel methods, known as the neural tangent kernel (NTK) regime. We show how to establish a PL inequality for neural networks using two approaches: a general approach based on the NTK approximation, the other in the particular setting of linearly-separable data.

### Monday, August 1st, 2022

**Abstract: **In the first tutorial, we review tools from classical statistical learning theory that are useful for understanding the generalization performance of deep neural networks. We describe uniform laws of large numbers and how they depend upon the complexity of the class of functions that is of interest. We focus on one particular complexity measure, Rademacher complexity, and upper bounds for this complexity in deep ReLU networks. We examine how the behaviors of modern neural networks appear to conflict with the intuition developed in the classical setting.

In the second tutorial, we review approaches for understanding neural network training from an optimization perspective. We review the classical analysis of gradient descent on convex and smooth objectives. We describe the Polyak--Lojasiewicz (PL) inequality and discuss how to interpret such an inequality in the context of neural network training. We describe a particular regime of neural network training that is well-approximated by kernel methods, known as the neural tangent kernel (NTK) regime. We show how to establish a PL inequality for neural networks using two approaches: a general approach based on the NTK approximation, the other in the particular setting of linearly-separable data.

**Abstract: **Deep networks often fail catastrophically under the presence of distribution shift—when the test distribution differs in some systematic way from the training distribution. Robustness to distribution shifts is typically studied for its crucial role in reliable real-world deployment of deep networks. In this talk, we will see that robustness can also provide new insights into the functioning of deep networks, beyond the standard generalization puzzle. First, we will dive into the popular setting of transferring a pre-trained model to a downstream task. We study the optimization dynamics of the transfer process in a stylized setting that replicates observed empirical observations and allows us to devise a new heuristic that outperforms previous methods. Next, we will go over several observations from robustness in the standard supervised setting that provide a new perspective on the role of overparameterization and the inductive biases of deep networks.

**Abstract: **This work studies gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint). The first regime is near initialization, specifically until the weights have moved by O(sqrt(m)), where m denotes the network width, which is in sharp constrast to the O(1) weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains the NTK margin itself, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups, and the main contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast with prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions which will hopefully aid future work.

### Tuesday, August 2nd, 2022

No abstract available.

No abstract available.

**Abstract: **In this talk, I shall present two research vignettes on the generalization of interpolating models.

Prior work has presented strong empirical evidence demonstrating that importance weights can have little to no effect on interpolating neural networks. We show that importance weighting fails not because of the interpolation, but instead, as a result of using exponentially-tailed losses like the cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in interpolating models trained by gradient descent. Surprisingly, our theory reveals that using biased importance weights can improve performance in interpolating models.

Second, I shall present lower bounds on the excess risk of sparse interpolating procedures for linear regression. Our result shows that the excess risk of the minimum L1-norm interpolant can converge at an exponentially slower rate than the minimum L2-norm interpolant, even when the ground truth is sparse. Our analysis exposes the benefit of an effect analogous to the "wisdom of the crowd", except here the harm arising from fitting the noise is ameliorated by spreading it among many directions.

Based on joint work with Tatsunori Hashimoto, Saminul Haque, Philip Long, and Alexander Wang.

**Abstract: **The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.

Joint Work With: Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran.

### Wednesday, August 3rd, 2022

Panelists: Peter Bartlett, Bin Yu, Nathan Srebro

Moderator: Preetum Nakkiran

**Abstract: **Causal inference from high-dimensional observational studies poses intriguing challenges. In this context, the augmented inverse probability weighting estimator is widely used for average treatment effect estimation. This estimator exhibits fascinating properties, such as double robustness. However, existing statistical guarantees rely on some form of sparsity in the underlying model, and may fail to apply in practical settings when these assumptions are violated.

In this talk, we present a new central limit theorem for this estimator, that applies in high dimensions, without sparsity-type assumptions on underlying signals. Specifically, we work in the proportional asymptotics regime, where the number of features and samples are both large and comparable. Our work uncovers novel high-dimensional phenomena that are strikingly different from their classical counterparts.

To conclude, we discuss opportunities that arise in our framework, when modern machine-learning-based estimators are used for learning the high-dimensional nuisance parameters. On the technical front, our work utilizes a novel interplay between three distinct tools---the theory of deterministic equivalents, approximate message passing theory, and the leave-one-out approach (alternately known as the cavity method in statistical physics).

This is based on joint work with Kuanhao Jiang, Rajarshi Mukherjee, and Subhabrata Sen (Harvard).

### Thursday, August 4th, 2022

**Abstract: **Modern multi-modal models such as CLIP require significant engineering efforts to efficiently train, evaluate, and deploy. Furthermore, such models typically serve as a backbone feature extractor for many downstream tasks. This talk will provide an overview of how we’ve accomplished this at Apple, where CLIP now powers a large number of user experiences on iOS. We’ll cover concepts such as multi-node multi-gpu distributed training on billions of examples, transfer learning for downstream tasks, model pruning, efficient on-device inference with transformers, and more.

**Abstract: **Recent empirical work has shown that hierarchical convolutional kernels inspired by convolutional neural networks (CNNs) significantly improve the performance of kernel methods in image classification tasks. A widely accepted explanation for the success of these architectures is that they encode hypothesis classes that are suitable for natural images. However, understanding the precise interplay between approximation and generalization in convolutional architectures remains a challenge.

In this talk, we consider the stylized setting of covariates (image pixels), and fully characterize the RKHS of kernels composed of single layers of convolution, pooling, and downsampling operations. We then study the gain in sample efficiency of kernel methods using these kernels over standard inner-product kernels. In particular, we show that 1) the convolution layer breaks the curse of dimensionality by restricting the RKHS to `local' functions; 2) global average pooling enforces the learned function to be translation invariant; 3) local pooling biases learning towards low-frequency functions. Notably, our results quantify how choosing an architecture adapted to the target function leads to a large improvement in the sample complexity.

**Abstract: **Approximate Message Passing (AMP) is a class of efficient iterative algorithms that have been extensively utilized for signal recovery in high-dimensional inference problems. At each iteration, the algorithm involves a vector-matrix product, followed by an application of a non-linear map coordinate-wise to the vector obtained. The main attraction of AMP arises from the fact that the limiting empirical distributions of AMP iterates are gaussian, with means and variances that can be characterized in terms of a low-dimensional recursion known as \emph{state evolution}. These guarantees are usually derived under very specific distributional assumptions on the matrix e.g. iid gaussian entries, or orthogonally invariant matrices. However, numerical investigations indicate that AMP algorithms have a remarkable degree of universality to the data distribution. We will discuss universality of AMP algorithms on a class of \emph{semi-random} matrices, which can be significantly less random than matrices with iid entries. Time permitting, I will discuss the implications for statistical learning problems.

This is based on joint work with Rishabh Dudeja and Yue Lu (Harvard).

### Friday, August 5th, 2022

**Abstract: **Deep learning continues its march of performance progress as models and datasets are scaled up. This talk will discuss work investigating performance predictability with model, dataset, and compute scale for deep learning in general and large language models in particular. I will review scaling in linear models -- a simple analytic system exhibiting many of the phenomena characteristic of realistic networks. I will also discuss empirical work attempting to investigate what types of problems can practically be solved by scale alone and what types cannot.

Tutorial: Deep Learning Applications in Structural Biology and Protein Engineering

**Abstract: **There are about 20,000 different proteins in each one of us, humans. These proteins carry out a diverse set of functions to keep us all alive and healthy. Recently, deep learning has been increasingly used to both 1) help us visualize and gain insights into naturally existing proteins and 2) design novel proteins for therapeutic and environmental applications. In this talk, we will take a deep dive into the inner workings of AlphaFold2 and other emerging deep learning methods in structural biology and protein design. We will also examine the assumptions on biological data distributions and discuss hypotheses for the crucial ingredients of successful deep learning applications.