Results 501 - 510 of 23762

Talk By

Abstract not available.

What Makes Collaborative Learning Worth it?

The era of “free” or loosely regulated data is drawing to a close as copyright constraints and user-consent requirements tighten, legal challenges to large models grow, and data-driven competitive advantages become more pronounced. This shift makes collaborative and federated learning systems central, since these protocols can reduce harmful data silos in healthcare, science, and finance. Yet deployments often fail because systems hit hard efficiency limits, models lose reliability under heterogeneity, and incentives fail to sustain high-quality participation under acute information asymmetry. This talk asks a simple question: what makes collaborative learning worth it in such settings? I will describe how my work tackles these barriers through three lenses: (i) efficiency, via communication-efficient distributed optimization; (ii) reliability and safety, via privacy-preserving personalization; and (iii) sustainability, via mechanism design that improves the truthfulness, quality, and quantity of client contributions. I will close with open questions that I hope will spark discussion during the program.

Meet the Fellows Talks Spring 2026

A welcome event for all new Simons fellows to introduce them to the Simons Institute community. All new fellows will present a 10-minute talk followed by 5 minutes for Q&A with the aim of making introductions to each other, program participants, and the...

Talk By

Abstract not available.

Talk By

Abstract not available.

Talk By

Abstract not available.

Talk By

Abstract not available.

Deep sequence models tend to memorize geometrically; it is unclear why.

In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We will contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an ℓ-fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we will extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that—in contrast to prevailing theories—indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.

Talk By

Abstract not available.

Training Dynamics of Softmax Self-Attention: Global Convergence and Neural Scaling Laws

We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and obtain the first mathematically rigorous derivation of a neural scaling law for softmax self-attention. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a matrix factorization problem. Second, we exploit this connection to design a tuned variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our new optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a specific data-dependent initialization point which lies near the manifold of global minima with high probability. We show that when our algorithm is run on the empirical loss, it identifies parameters which are globally optimal for the population loss, up to a small additive error which quickly tends to zero as more data and compute are used to train the model. Remarkably, we show that self-attention is able to match the minimax-optimal statistical rate achieved by the ordinary least-squares estimator, despite the nonconvexity of the loss in the model parameters. Additionally, our new algorithm attains a fast geometric convergence rate instead of the slow power law rate which is empirically observed using standard gradient descent with random initialization.