Abstract

Clustering is a fundamental technique in data analysis: the objective is to identify a set of k points (centroids) that minimize the clustering cost (a function of the distances of points to the nearest centroid). With very large data sets, we seek efficient constructions of small samples that act as surrogates to the full data. Existing methods that provide quality guarantees are either worst-case, unable to benefit from structure of real data set, or make explicit assumptions on the structure.  We show here how to avoid both these pitfalls using adaptive algorithms and demonstrate experimentally large gains of our adaptive versus the worst-case constructions.