Abstract

SVD is known to solve the inference problem for Topic Modeling when each document is pure (on a single topic). Known provable algorithms for non-pure models require strong assumptions on the data. We formulate a model with two natural assumptions: (i) each document has a dominant topic and (ii) each topic has dominant words. We provably solve the model fitting problem. Our algorithm has three main steps: Thresholding, SVD and k- means. We test both the assumptions and the performance of the algorithm on real corpora with success. Pre-SVD thresholding can be used in other contexts. For the “Planted Gaussian” problem, they offer an exponential advantage in terms of the signal-to-noise ratio.

Joint with T. Bansal and C. Bhattacharyya, Indian Institute of Science.

Video Recording