Abstract

Normalization methods such as batch normalization and its variants are commonly used in overparametrized models like neural networks, for their beneficial effects on training. Yet, their theoretical understanding is only emerging. Here, we study a variant of normalization in overparametrized least squares regression, where we reparametrize the weights with a scale $g$ and a unit vector such that the objective function becomes \emph{non-convex}. We show theoretically that for initializations \emph{not necessarily close to zero}, this algorithm adaptively regularizes the weights and converges to the minimum $\ell_2$ norm solution (or close to it)\emph{ with sufficiently small learning rates for $g$}. We verify this experimentally and suggest that there may be a similar phenomenon for nonlinear problems such as matrix sensing.