Abstract

We initiate the study of coresets for clustering in graph metrics, i.e., the shortest-path metric of edge-weighted graphs. Such clustering problems (on graph metrics) are essential to data analysis and used for example in road networks and data visualization. Specifically, we consider k-Median clustering, where given a metric space (V, d), the goal is to find a k-point center set C that minimizes the objective sum_{x in V} d(x, C). A coreset is a compact summary of the data that (1+epsilon)-approximates the objective for every possible center set C. Coresets offer significant efficiency improvements in terms of running time, storage, and communication, including in streaming and distributed settings. I will show that every graph G of treewidth tw(G) has a coreset for k-Median whose size is O(tw(G)) (omitting dependence on k and epsilon). Moreover, the coreset can be constructed in near-linear time. This construction is based on the framework of Feldman and Langberg [STOC 2011], and our main technical contribution, as required by this framework, is a uniform bound of O(tw(G)) on the shattering dimension under any point weights. Previously, the only coreset construction applicable to graph metrics was a generic one, of size O(log n) for n=|V| [Feldman and Langberg, STOC 2011]. We further show a matching lower bound of Omega(tw(G)) on the coreset size. Thisis actually the first proof that the O(log n) factor in the generic upper bound is indeed necessary, and also justifies restricting the graph's topology. Joint work with Vladimir Braverman, Lingxiao Huang, Shaofeng H.-C. Jiang, and Xuan Wu.