Abstract

State-of-the-art performance in Deep Learning is usually achieved via a series of modifications to existing neural architectures and their training procedures. A common feature of these networks is their large-scale: modern neural networks usually have billions - if not hundreds of billions - of trainable parameters. While empirical evaluations generally support the claim that increasing the scale of neural networks (width, depth, etc) boosts model performance if done correctly, optimizing the training process across different scales remains a significant challenge, and practitioners tend to follow extrapolated scaling rules .
In this talk, I will present a theoretical framework for efficient learning at large scale. The framework allows us to derive efficient learning rules that automatically adjust to model scale, ensuring stability and optimal performance. The results offer new insights into the fundamental principles governing neural network and provide practical guidelines for training them efficiently.