Abstract
The recent dramatic success of Deep Neural Networks (DNNs) in many applications highlights the statistical benefits of marrying near-nonparametric models with large datasets, using efficient optimization algorithms running in distributed computing environments. In the 1990's, Kernel methods became the toolset of choice for a wide variety of machine learning problems due to their theoretical appeal and algorithmic roots in convex optimization, replacing neural nets in many settings. So what changed between then and the modern deep learning revolution? Perhaps the advent of "big data" or perhaps the notion of "depth" or perhaps better DNN training algorithms, or all of the above. Or perhaps also that the development of kernel methods has somewhat lagged behind in terms of scalable training techniques, effective mechanisms for kernel learning and parallel implementations.
I will describe new efforts to resolve scalability challenges of kernel methods, for both scalar and multivariate prediction settings, allowing them to be trained on big data using a combination of randomized data embeddings, Quasi-Monte Carlo (QMC) acceleration, distributed convex optimization and input-output kernel learning. I will report that on classic speech recognition and computer vision datasets, randomized kernel methods and deep neural networks turn out to have essentially identical performance. Curiously, though, randomized kernel methods begin to look a bit like neural networks, but with a clear mathematical basis for their architecture. Conversely, invariant kernel learning and matrix-valued kernels may offer a new way to construct deeper architectures. This talk will describe recent research results and personal perspectives on points of synergy between these fields.