MPG/LLM Seminar

Parent Program

Modern Paradigms in Generalization, Special Year on Large Language Models and Transformers, Part 1

Location

Calvin Lab CR116

Date

Weekly | Wednesday, Dec. 11, 2024

Time

11 a.m. – 12 p.m. PT

Back to calendar

All scheduled dates:

Upcoming

No Upcoming activities yet

Past

Vatsal Sharan (USC), Yuchen Li (Carnegie Mellon University)

Speaker: Vatsal Sharan

Title: On the Statistical Complexity of Sample Amplification: Increasing Dataset Size even when Learning is Impossible

Abstract:

Is learning a distribution always necessary for generating new samples from the distribution? To study this, we introduce the problem of “sample amplification”: given n independent draws from an unknown distribution, D, to what extent is it possible to output a set of m > n datapoints that are indistinguishable from m i.i.d. draws from D? Curiously, we show that nontrivial amplification is often possible in the regime where the number of datapoints n is too small to learn D to any nontrivial accuracy. We prove upper bounds and matching lower bounds on sample amplification for a number of distribution families including discrete distributions, Gaussians, and any continuous exponential family.

Speaker: Yuchen Li (lightning talk)

Title: Towards Mathematical Understanding of Modern Language Models

Abstract: To mathematically reason about how neural networks learn languages, our methodology involves three major components: (1) mathematically characterizing key structures in language data distributions, (2) theoretically proving how neural networks capture such structures through self-supervision during pre-training, and (3) conducting controlled experiments using synthetic data. In this talk, I will survey a few applications of this methodology: understanding Transformers training dynamics via the lens of topic models, proving pitfalls in common Transformer interpretability heuristics via the lens of a formal language (Dyck), proving limitations of expressivity of autoregressive models via the lens of another formal language (PCFG), and analyzing the potentials of a non-autoregressive alternative (masked language models) in terms of their learning and sampling efficiencies via the lens of Ising models.

Speaker: Stephen Huan (lightning talk)

Title: Computing transport maps by cumulant matching

Abstract: In this talk, we design algorithms for statistical inference and generative modeling by exploiting interplays between linear algebra, information theory, and combinatorics. In the first part, we study fast inference in Gaussian processes, or equivalently, fast computation with symmetric positive-definite matrices. Previous state of the art methods for constructing sparse approximate inverse Cholesky factors by minimizing Kullback-Leibler divergence rely only on the geometry of the evaluation points to construct the ordering and sparsity pattern. Instead, we propose to use a greedy selection algorithm that maximizes conditional mutual information, improving over k-nearest neighbors selection in a variety of experimental settings. A randomized ordering is competitive in high-dimensional, low-rank settings. We propose a deterministic ordering based on mutual information which improves upon these orderings. We apply our approximation framework to earthquake modeling and (entropy-regularized) optimal transport. In the second part of this talk, we attempt to generalize the previous results to non-Gaussian random variables through the formalism of measure transport. The Cholesky factor becomes the Knothe-Rosenblatt rearrangement, a unique transport map whose Jacobian is lower triangular with positive diagonal. In the non-Gaussian case, we cannot hope to match only the mean and covariance, as in Gaussians. Instead, we will use higher-order cumulants, which can be estimated directly from samples, to characterize probability distributions. Expansions based on cumulants and Möbius inversion give simple polynomial expressions for approximate entropy, densities, log-densities, and transport maps. We can also estimate functionals such as entropy.

Ramya Korlakai Vinayak (University of Wisconsin-Madison)

Speaker: Ramya Vinayak

Title: Towards Plurality: Learning from Diverse Human Preferences

Abstract: Large pre-trained models trained on internet-scale data are often not ready for safe deployment out-of-the-box. They are heavily fine-tuned and aligned using large quantities of human preference data, usually elicited using pairwise comparisons. While aligning an AI/ML model to human preferences or values, it is worthwhile to ask whose preference and values we are aligning it to? The current approaches of alignment are severely limited due to their inherent uniformity assumption. There is also rich literature on learning preferences from human judgements using comparison queries. It plays a crucial role in several applications ranging from cognitive and behavioral psychology, crowdsourcing democracy, surveys in social science applications, and recommendation systems. However, the models in this literature often focus on learning average preference over the population due to the limitations on the amount of data available per individual or on learning an individual's preference using a lot of queries. Furthermore, the knowledge of the metric, i.e., the way humans judge similarity and dissimilarity, is assumed to be known which does not hold in practice. We aim to overcome these limitations by building mathematical foundations for learning from diverse human preferences.

In this talk, I will present a personalizable reward modeling framework for pluralistic alignment (PAL), capturing diversity in preferences while pooling together data from individuals to learn commonalities. I will also discuss some theoretical results on per user sample complexity for generalization to preference prediction and fundamental limitations when there are limited pairwise comparisons.

This talk is based on the following papers:

Chen, Chen, Rege, Vinayak, PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences Arxiv preprint, 2024
https://arxiv.org/pdf/2406.08469
Canal, Mason, Vinayak, Nowak, One-for-all: Simultaneous Metric and Preference Learning over Multiple Users (Neurips 2022).
https://arxiv.org/pdf/2207.03609
Wang, So, Vinayak, Metric learning via limited pairwise comparison (UAI 2024)
https://arxiv.org/pdf/2403.19629

Frauke Kreuter (LMU Munich and University of Maryland)

Speaker: Frauke Kreuter

Title: Large Language Models in Social Science: Opportunities and Challenges

Abstract:

This talk explores the growing role of Large Language Models (LLMs) in the social sciences, where they are transforming not only data collection and analysis but also pushing the limits of our theoretical understanding. While LLMs offer powerful tools for working with unstructured data, they raise fundamental questions around interpretability, bias, and robustness in the study of human behavior and social phenomena. This talk will address how these challenges present new opportunities for interdisciplinary collaboration, inviting insights from theoretical computer science to improve the alignment, fairness, and reliability of LLMs in social research.

Speaker: Reinhard Heckel

Title: Measuring Bias of Web-filtered Text Datasets and Bias Propagation Through Training

Abstract:

In this talk, we first discuss how pre-trained datasets for LLMs are sourced from the web. We then investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining text datasets derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and others. Despite those datasets being obtained with similar filtering and deduplication steps, LLMs can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints.

Russell Impagliazzo (UC San Diego), Johannes Schmidt-Hieber (University of Twente)

Speaker: Russell Impagliazzo

Title: Smoothed Boosting with simple weak learners: a fable about out-of-diistribution generalization.
Joint work with Maya Josyula

Abstract:

Boosting is a highly successful technique to convert weak learners to strong learners, introduced by Shapire in 1990.
Although there is a well-known lower bound of $\log 1/ \epsilon/ \gamma^2$ for the number of calls to a weak learner for
boosting, Alon, Gozen, Hazan and Moran showed that boosting could be done in $\tilde{O} (1 \gamma}$ iterations if the
weak learner produced concepts from a bounded VC dimension class.

Here, we show a similar result for {\em smoothed boosting} algorithms (I95, Klivans and Servidio 99, Holenstein 01, Servidio 03, Barak, Hardt and Kale 09) . In smoothed boosting, we restrict the boosting algorithm to reweightings of the input distribution that have a limit to the amount of weight that can be given to small portions of the original distribution. This is significant in using boosting , since otherwise a small fraction of outliers can gather inordinate weight, and in computational complexity theory, where boosting techniques are used to prove hardcore distribution lemmas.

Moreover, we show that the same algorithm can be used for batch boosting for an arbitrary set of learning problems as long as the data distributions for each is not too sparse within a common reference distribution. The total number of uses of all weak learners for this set
has a similar upper bound to that for a single run of the boosting algorithm, independent of the number of learning problems. The intuition is that the hypotheses found by the weak learners for other problems are also useful for later problems in the sequence. In this sense, the boosting algorithm is learning features that are useful for any learning problem related to the small VC class; i.e., learning a cover for functions in the class. This is an analog of some conjectures about what happens in the first few layers of a neural net in deep learning which might help explain why the features learned by training for one problem are helpful in solving other ``out-of-distribution'' problems.

Under different assumptions about the weak learning algorithm, we can also use the same approach to show similar results for either no-access or sample-access Outcome indistinguishability (Dwork, Kim, Reingold, Rothblum, Yona) which were shown essentially equivalent to multi-accuracy and multi-calibration respectively. As a consequence, we get batch multi-accuracy and batch multi-calibration algorithms for fixed VC dimension classes with similar numbers of iterations of an approximate optimization sub-routine.

Speaker: Johannes Schmidt-Hieber

Title: Statistical learning in biological neural networks

Abstract:

Compared to artificial neural networks (ANNs), the brain learns faster, generalizes better to new situations and consumes much less energy. ANNs are motivated by the functioning of the brain, but differ in several crucial aspects. For instance, ANNs are deterministic while biological neural networks (BNNs) are stochastic. Moreover, it is biologically implausible that the learning of the brain is based on gradient descent. In this talk we look at biological neural networks as a statistical method for supervised learning and derive some first statistical risk bounds.

Sivaraman Balakrishnan (Carnegie Mellon University), Denny Wu

Speaker: Sivaraman Balakrishnan

Title: Yatracos, Birge and Le Cam

Abstract:

A central question in robust estimation is the following: given samples from some (arbitrary) distribution, and a set of candidates -- select the candidate which is "closest" to sampling distribution. I will discuss some of the history of this problem, and explain its connections to robust estimation. When "closeness" is measured by the total variation distance, many variants of this problem have been studied for many years. I will then discuss some new results when we measure closeness by the Hellinger distance. The emphasis of the talk will mostly be on introducing some interesting, old ideas.

Speaker: Denny Wu

Title: Learning single-index models with neural networks

Abstract:

Single-index models are given by a univariate link function applied to a one-dimensional projection of the input. Recent works have shown that the statistical complexity of learning this function class with online SGD is governed by the information exponent of the link function. In this talk, we discuss two variations of prior analyses. First, we consider the learning of single-index polynomials via SGD, but with reused training data. We show that two-layer neural networks optimized by an SGD-based algorithm can learn this target with almost linear sample complexity, regardless of the information exponent; this complexity surpasses the CSQ lower bound and matches the information-theoretic limit up to polylogarithmic factors. Next, we consider an in-context learning (ICL) setting where a nonlinear transformer is pretrained via gradient descent. We show that when the single-index target is drawn from a rank-r subspace, the in-context sample complexity of the pretrained transformer scales with the subspace dimension r but not the ambient dimension d, which outperforms estimators that only have access to the in-context data.

Yusu Wang (UCSD), Wei Hu (University of Michigan)

Speaker: Yusu Wang

Title: Size generalization of neural models via Algorithmic alignment

Abstract:

Extensive literature on classical algorithm design has led to the development of many elegant frameworks. With the recent surge in the capabilities of modern AI, a natural question arises: can we combine traditional algorithmic ideas with neural networks to create more powerful frameworks that can adapt to data? In particular, when can a neural model with bounded complexityin fact generalize to problem instances of arbitrary size? In this talk, I will present three examples of achieving such size generalization by "aligning" the neural models with algorithmic structures: (1) a mixed neural algorithmic framework for the (NP-hard) Steiner tree problem, (2) neural approximation of Wasserstein distances between point sets (discrete measures), and (3) a neural Bellman-Ford model for computing shortest path. The last example (the neural Bellman-Ford model) shows an interesting phenomenon due to the alignment of the neural model with algorithmic structure: we can construct a set of only a constant number of small graphs, such that if the neural Bellman-Ford model (even when over-parameterized) has low loss over these graphs, then this model will provably generalize to arbitrary graphs with positive weights.

Speaker: Wei Hu

Title: Toward Demystifying and Eliminating Grokking

Abstract:

Grokking is a surprising "emergent" phenomenon in which a neural network first memorizes the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. I will talk about some recent work toward theoretically explaining the grokking phenomenon as well as finding ways to eliminate grokking. First, I will show that a dichotomy of early and late-phase implicit biases can provably induce grokking. Then, I will present a new method that leverages the embedding learned by a weaker, smaller model to eliminate grokking for a larger target model.

Christina Baek (Carnegie Mellon University), Arsen Vasilyan (Simons Institute)

Speaker: Jiaqi Li (lightning talk)

Title: Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models

Abstract: In this talk, we propose an asymptotic theory for online inference of the stochastic gradient descent (SGD) iterates with dropout regularization in linear regression. Specifically, we establish the geometric-moment contraction (GMC) for constant step-size SGD dropout iterates to show the existence of a unique stationary distribution of the dropout recursive function. By the GMC property, we provide quenched central limit theorems (CLT) for the difference between dropout and $\ell^2$-regularized iterates, regardless of initialization. The CLT for the difference between the Ruppert-Polyak averaged SGD (ASGD) with dropout and $\ell^2$-regularized iterates is also presented. Based on these asymptotic normality results, we further introduce an online estimator for the long-run covariance matrix of ASGD dropout to facilitate inference in a recursive manner with efficiency in computational time and memory. The numerical experiments demonstrate that for sufficiently large samples, the proposed confidence intervals for ASGD with dropout nearly achieve the nominal coverage probability. This is a joint work with Johannes Schmidt-Hieber and Wei Biao Wu.

Speaker: Christina Baek (lightning talk)

Title: Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance

Abstract: Large Language Model's are instruction-finetuned to enhance their ability to follow user instructions and comprehend input context. Still, state-of-the-art models often struggle to follow the input context, especially when it is ``counterfactual'' to the model's parametric knowledge. This manifests as various failures, such as outdated and biased responses or hallucinations, where the model inserts unwarranted facts into its response. In this work, we try to understand the underlying reason for this poor counterfactual context reliance, especially even after instruction tuning. We observe an intriguing phenomenon: during instruction tuning, the context reliance initially increases as expected, but then gradually decreases as instruction finetuning progresses. In contrast, the performance of standard benchmarks continues to improve. We call this phenomenon ``context-parametric inversion'' and observe this behavior on multiple general purpose instruction tuning datasets such as TULU, Alpaca and Ultrachat, across multiple model families like Llama, Mistral and Pythia. In a simple theoretical setup, we are able to isolate why context reliance decreases after an initial increase along the gradient descent trajectory of instruction finetuning. We tie this phenomena to examples in the instruction finetuning data mixture where models can answer the query using just its parametric knowledge, even if an input context is present. Our analysis naturally suggests potential mitigation strategies which we see improves performance on real-world counterfactual examples, without significant losses on standard benchmarks. By highlighting the cause of this drop in context reliance, we hope that our work serves as a starting point in addressing this failure mode in a staple part of LLM training.

Speaker: Arsen Varsilyan (lightning talk)

Title: Efficient Testable Agnostic Learning for Halfspaces

Abstract: We give the first efficient algorithm for learning halfspaces in the testable agnostic learning model recently defined in [Rubinfeld, Vasilyan 23]. In this model, a learner certifies that the accuracy of its output hypothesis is near optimal whenever the training set passes an associated test, and training sets drawn from some target distribution -- e.g., the Gaussian -- must pass the test. This model is more challenging than the distribution-specific agnostic noise model where the learner is allowed to fail arbitrarily if the distributional assumption does not hold.

We consider the setting where the target distribution is Gaussian (or more generally any strongly log-concave distribution) in d dimensions and the noise model is adversarial (agnostic). Our testable agnostic learning algorithm obtains error O(opt)+ϵ in polynomial time, where opt is the smallest classification error achievable by a halfspace.

Joint work with Aravind Gollakota, Adam R. Klivans and Konstantinos Stavropoulos.

Andrew Ilyas (MIT), Geelon So (UC San Diego)

Speaker: Andrew Ilyas

Title: Predicting and Optimizing the Behavior of Large-Scale Models
Abstract: In this talk, we study the problem of estimating (and optimizing) the counterfactual behavior of large-scale ML models. We start by focusing on “data counterfactuals,” where the goal is to estimate the effect of modifying a training dataset on the resulting machine learning outputs. We introduce a method that almost perfectly estimates such counterfactuals (under a “smoothness” condition that we show is fundamental), and also unlocks some new possibilities in the design and evaluation of ML models.
Based on several joint works with collaborators at MIT, primarily Logan Engstrom.

Speaker (lightning talk): Geelon So

Title: The online consistency of the nearest neighbor rule
Abstract: We study the 1-nearest neighbor learner for online classification in the realizable setting (R* = 0). It is consistent under conditions far broader than previously known (i.e. its average mistake rate eventually vanishes).
Stated in the language of non-worst-case online learning, sequences on which 1-NN fails to learn are extremely rare---under suitable classes of measures, they have measure zero. The “suitable” non-worst-case classes we introduce bridge two recent, independent lines of research in smoothed online learning and optimistic universal learning.
Joint work with Sanjoy Dasgupta.

Shuheng Zhou (University of California, Riverside), Spencer Frei (UC Davis)

Speaker: Shuheng Zhou

Title: Concentration of measure bounds for matrix-variate data with missing values
Abstract: We consider the following data perturbation model, where the covariates incur multiplicative errors. For two random matrices U, X, we denote by (U \circ X) the Hadamard or Schur product, which is defined as (U \circ X)_{i,j} = (U_{i,j}) (X_{ij}). We study the subgaussian matrix variate model, where we observe the matrix variate data through a random mask U: \mathcal{X} = U \circ X, where X = B^{1/2} Z A^{1/2}, where Z is a random matrix with independent subgaussian entries, and U is a mask matrix with either zero or positive entries, where $E[U_{ij] \in [0,1]$ and all entries are mutually independent. Under the assumption of independence between X and U, we introduce componentwise unbiased estimators for estimating covariance A and B, and prove the concentration of measure bounds in the sense of guaranteeing the restricted eigenvalue(RE) conditions to hold on the unbiased estimator for B, when columns of data matrix are sampled with different rates. We further develop multiple regression methods for estimating the inverse of B and show statistical rate of convergence. Our results provide insight for sparse recovery for relationships among entities (samples, locations, items) when features (variables, time points, user ratings) are present in the observed data matrix X with heterogeneous rates. Our proof techniques can certainly be extended to other scenarios. We provide simulation evidence illuminating the theoretical predictions.

Speaker: Spencer Frei

Title: Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context
Abstract: Transformers have the capacity to act as supervised learning algorithms: by properly encoding a set of labeled training ("in-context") examples and an unlabeled test example into an input sequence of vectors of the same dimension, the forward pass of the transformer can produce predictions for that unlabeled test example. A line of recent work has shown that when linear transformers are pre-trained on random instances for linear regression tasks, these trained transformers make predictions using an algorithm similar to that of ordinary least squares. In this work, we investigate the behavior of linear transformers trained on random linear classification tasks. Via an analysis of the implicit regularization of gradient descent, we characterize how many pre-training tasks and in-context examples are needed for the trained transformer to generalize well at test-time. We further show that in some settings, these trained transformers can exhibit "benign overfitting in-context": when in-context examples are corrupted by label flipping noise, the transformer memorizes all of its in-context examples (including those with noisy labels) yet still generalizes near-optimally for clean test examples. Based on joint work with Gal Vardi.