Diving into the shallows: a computational perspective on Large-Scale Shallow Learning
Remarkable recent success of deep neural networks has not been easy to analyze theoretically. It has been particularly hard to disentangle relative significance of architecture and optimization in achieving accurate classification on large datasets. Rather than attacking this issue directly, it may be as useful to understand the limits of what is achievable with shallow learning. It turns out that the smoothness of kernels typically used in shallow learning presents an obstacle toward effective scaling to large datasets using iterative methods. While shallow kernel methods have the theoretical capacity to approximate arbitrary functions, typical scalable algorithms such as GD/SGD drastically restrict the space of functions which can be obtained on a fixed computational budget. The issue is purely computational and persists even if infinite data are given.
In an attempt to address this issue we introduce EigenPro, a preconditioning technique based on direct (approximate) spectral modification. Despite this direct approach, EigenPro was able to match or beat a number of state of the art classification results in the kernel literature (which typically used large computational resources) with only a few hours of GPU time. I will continue with some speculation on the architecture/optimization dichotomy in deep and shallow networks. Finally, I will argue that optimization analysis concentrating on the optimality within a fixed computational budget may be more reflective of the realities of machine learning than the more traditional approaches based on rates/complexity of obtaining a given error.
Based on joint work with Siyuan Ma.