Abstract
Probabilistic models are widely used in statistical inference, and for understanding the computational tractability of several unsupervised learning tasks. However, a common criticism of most existing learning algorithms is that their guarantees strongly rely on the unrealistic assumption that the instance is generated exactly from the model. Semi-random models provide a framework to reason about robustness of algorithms to modeling errors by incorporating both adversarial and random choices in instance generation. In this talk, I will describe new semi-random models for two different problems in unsupervised learning: dictionary learning, and clustering mixtures of Gaussians. In dictionary learning the task is to learn a hidden incoherent basis/dictionary A given data Y=AX that represents unknown sparse linear combinations (corresponding to columns of X) of the basis elements. Existing algorithms learn A and X efficiently, assuming strong distributional assumptions about both the supports of the columns of X and the values of these non-zero entries. In the first part of the talk, I will describe a more general semi-random model for dictionary learning aimed at capturing almost arbitrary distributions for the sparse supports, and give a new polynomial time algorithm for learning incoherent over-complete dictionaries under the semi-random model. Finally (time permitting), I will describe a natural semi-random model for k-means clustering that generalizes mixtures of k Gaussians, and demonstrate how the Lloyds algorithm successfully recovers the ground-truth clustering up to near optimal accuracy. Based on joint works with Pranjal Awasthi.