Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

Abstract

Single-linkage clustering is a fundamental method for data analysis. A single-linkage clustering is obtained by iteratively merging the closest pair of clusters, where initially every data point forms its own cluster and the distance between clusters is defined by their closest pair of points. Algorithmically, one can compute a single linkage $k$-clustering (a partition into $k$ clusters) by computing a minimum spanning tree and dropping the $k-1$ most costly edges. The resulting forest spans the clusters. This motivates to define the cost $\mathrm{cost}_k$ of a single linkage $k$-clustering by the weight of the corresponding spanning forest. If we consider single linkage clustering as computing a hierarchy of clusterings, the cost of the hierarchy is defined as the sum of the individual clusterings. We assume that the distances between data points are given as a graph $G$ with average degree $d$ and edge weights from $\{1,\dots, W\}$. If there is no edge, we assume the distance to be infinite. We give a sampling based algorithm that is given access to the adjacency list representation of $G$ and that computes in $\tilde O(d\sqrt{W}/\varepsilon^3)$ time a succinct representation of estimates $\widehat{\mathrm{cost}}_k$ of the cost of every $k$-clustering such that $\sum_{i=1}^n |\widehat{\mathrm{cost}} - \mathrm{cost}_k| \le \varepsilon \mathrm{cost}(G)$, where $\mathrm{cost}(G) = \sum_{i=1}^n \mathrm{cost}_k$ is the cost of the hierarchical clustering and $1>\varepsilon >0$. Thus we can approximate the cost of every $k$-clustering upto an absolute error that \emph{on average} is a $(1+\varepsilon)$-approximation of the true cost. In particular, our result also implies that we can estimate $\mathrm{cost}(G)$ upto a factor of $1\pm \varepsilon$ in the same running time. We extend our results to the setting when edges represent similarities and the clusters are defined by a maximum spanning tree. In this setting, the running time of our algorithm is $\tilde O(dW/\varepsilon^3)$. We show that our algorithm are almost optimal by giving near matching lower bounds for estimating $\cost(G)$. We conduct extensive experiments that validate our theoretical findings. Joint work with Christian Sohler and Yi Xu.