Abstract: This work studies gradient flow (GF) and stochastic gradient descent (SGD)  on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint). The first regime is near initialization, specifically until the weights have moved by O(sqrt(m)), where m denotes the network width, which is in sharp constrast to the O(1) weight motion allowed by the  Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains the NTK margin itself, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups, and the main contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and not  rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast with prior work, which required infinite width and a tricky dual convergence assumption.  As purely technical contributions, this work develops a variety of potential functions which will hopefully aid future work.

Video Recording