Faced with massive data, is it possible to trade off statistical risk and computational time? This challenge lies at the heart of large-scale machine learning. I will show in this talk that we can indeed achieve such risk-time tradeoffs by strategically summarizing the data, in the unsupervised learning problem of probabilistic k-means, i.e. vector quantization. In particular, there exist levels of summarization for which as the data size increases, the running time decreases, while a given risk is maintained. Furthermore, there exists a constructive algorithm that provably finds such tradeoff levels. The summarization in question is based on coreset constructions from computational geometry. I will also show that these tradeoffs exist and may be harnessed for a wide range of real data. This adds data summarization to the list of methods, including stochastic optimization, that allow us to perceive data as a resource rather than an impediment.

Video Recording