Big Data All Hands Meeting

Parent Program

Theoretical Foundations of Big Data Analysis

Location

Calvin Lab

Speaker(s)

Yuval Rabani (Hebrew University of Jerusalem)

Eric Price (University of Texas, Austin)

Date

Tuesday, Dec. 10, 2013

Time

12:30 – 1:30 p.m. PT

Back to calendar

Description

Yuval Rabani

Learning mixtures of arbitrary distributions over large discrete domains

We give an algorithm for learning a mixture of unstructured distributions. This problem arises in various unsupervised learning scenarios, for example in learning topic models from a corpus of documents spanning several topics. We show how to learn the constituents (the topic distributions and the mixture weights) of a mixture of k (constant) arbitrary distributions over a large discrete domain [n] = {1,2,...,n}, using O(n polylog n) samples. This task is information-theoretically impossible for k > 1 under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the "sampling aperture", is a crucial parameter of the problem. We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of 2k - 1. (Independent work by others places certain restrictions on the model, which enables learning with smaller aperture, albeit using, in general, a significantly larger sample size.)

A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.

Joint work with Leonard Schulman and Chaitanya Swamy

Eric Price

Sparse Fourier Transforms: Optimizing Time and Sample Complexity

The Fast Fourier Transform (FFT) is a fundamental algorithm that computes the Discrete Fourier Transform of an n-dimensional signal in O(n log n) time. In applications such as image, audio, and video compression where the output is "sparse" (i.e., k << n coordinates are "large" and contain most of the energy), it is possible to estimate the large coordinates in less than O(n log n) time. This talk will describe two related algorithms, one achieving the best known time complexity of

O(k log (n/k) log n) and the other achieving the best known sample complexity of O(k log n (log log n)^c).

Based on http://arxiv.org/abs/1201.2501 (STOC 2012) and http://www.mit.edu/~ecprice/papers/fouriermeasurements.pdf (SODA 2014).

Joint work with Haitham Hassanieh, Piotr Indyk, Michael Kapralov and Dina Katabi.

All scheduled dates:

Upcoming

No Upcoming activities yet

Big Data All Hands Meeting

All scheduled dates:

Upcoming

Past