Abstract

We consider the problem of clustering unlabelled data using a crowd of non-experts. Take, for example, a data base of dog (or bird) images that we would like to cluster into breeds (or species) with the collective help of workers who are not experts in dog breeds (or bird species). Questions of interest include what queries to ask, how many, and what algorithms to use to reliably cluster the data. We focus on simple comparison queries (on pairs, triples and, perhaps, more items) and on the use of convex optimization clustering algorithms (that attempt to decompose a partially-observed adjacency matrix into low rank and sparse components). In this setting, we obtain necessary and sufficient conditions for success of the clustering algorithm as a function of the number of queries, the number of data points, the number of clusters, the number of outliers, the size of the minimum cluster and the quality of the workers. We contrast this with information-theoretic lower bounds on the required number of queries obtained by viewing the crowdsourced clustering problem as one of joint source channel coding. A key feature of the approach is to quantify the cost of a query with its entropy. We also present various empirical results to put the theoretical results into perspective.

Video Recording