# Semi-Random Models and Robustness in Unsupervised Learning

Aravindan Vijayaraghavan (Northwestern University)

Probabilistic models are widely used in statistical inference, and for understanding the computational tractability of several unsupervised learning tasks. However, a common criticism of most existing learning algorithms is that their guarantees strongly rely on the unrealistic assumption that the instance is generated exactly from the model. Semi-random models provide a framework to reason about robustness of algorithms to modeling errors by incorporating both adversarial and random choices in instance generation. In this talk, I will describe new semi-random models for two different problems in unsupervised learning: dictionary learning, and clustering mixtures of Gaussians. In dictionary learning the task is to learn a hidden incoherent basis/dictionary A given data Y=AX that represents unknown sparse linear combinations (corresponding to columns of X) of the basis elements. Existing algorithms learn A and X efficiently, assuming strong distributional assumptions about both the supports of the columns of X and the values of these non-zero entries. In the first part of the talk, I will describe a more general semi-random model for dictionary learning aimed at capturing almost arbitrary distributions for the sparse supports, and give a new polynomial time algorithm for learning incoherent over-complete dictionaries under the semi-random model. Finally (time permitting), I will describe a natural semi-random model for k-means clustering that generalizes mixtures of k Gaussians, and demonstrate how the Lloyds algorithm successfully recovers the ground-truth clustering up to near optimal accuracy. Based on joint works with Pranjal Awasthi.