Abstract

Deep residual networks have been widely adopted for computer vision applications because they exhibit fast training, even for very deep networks.  The key difference between these networks and other feedforward networks is that a residual network with near-zero parameters computes a near-identity function, rather than a near-zero function. We consider mappings of this kind: compositions of near-identity functions. Rather than fixing a finite parameterization, such as fixed-size ReLU or sigmoid layers, we allow arbitrary Lipschitz deviations from the identity map; these might be Lipschitz-constrained sigmoid layers of unbounded width, for instance.  We investigate representational and optimization properties of these near-identity compositions. In particular, we show that a smooth bi-Lipschitz function can be represented exactly as a composition of functions that are Lipschitz-close to the identity, and greater depth allows a smaller Lipschitz constant.  We also consider the optimization problem that arises from regression with a composition of near-identity nonlinear maps. For functional gradients with respect to these nonlinear maps, we show that any critical point of a convex objective in the near-identity region must be a global minimizer.
 
Joint work with Steve Evans and Phil Long.

Video Recording