Results 461 - 470 of 23756

Arnaud Doucet

Constantinos Daskalakis

(Massachusetts Institute of Technology)

Sinho Chewi

(Yale University)

Joan Bruna

(NYU Courant)

Sitan Chen

(Harvard University; chair)

Registration - Richard M. Karp Distinguished Lecture | Brendan McMahan

When do spectral gradient updates help in deep learning?

Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.

Parallelizing autoregression and diffusion models

Many generative models—autoregressive models and denoising diffusion models in particular—are thought to be inherently sequential, requiring $\Omega(n)$ time to sample where $n$ is the size of the output. I’ll show that both can be parallelized to take $\tilde{O}(n^{1/2})$ parallel time under very mild assumptions. The main idea is a new method we call speculative rejection sampling, which builds a speculative distribution from the model’s own oracle and validates entire sequences in parallel. This improves the best known bounds for autoregressive sampling and gives the first provable parallel speedup for diffusion models in the high-accuracy regime

Talk by

Abstract not available.

Joao Doriguello

João is currently a postdoctoral researcher at the Alfréd Rényi Institute of Mathematics, Budapest. His main research interests are communication complexity, query complexity, Boolean analysis, quantum algorithms, and quantum finance.