Note: The event time listed is set to Pacific Time.
Many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the $k$-means and $k$-medians objectives. In terms of negative results, we show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and any clustering based on a tree with $k$ leaves must incur an $\Omega(\log k)$ approximation factor compared to the optimal clustering. On the positive side, for two means/medians, we show that a single threshold cut can achieve a constant factor approximation, and we give nearly-matching lower bounds; for general $k \geq 2$, we design an efficient algorithm that leads to an $O(k)$ approximation to the optimal $k$-medians and an $O(k^2)$ approximation to the optimal $k$-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size. This is a joint work with Sanjoy Dasgupta, Nave Frost, and Cyrus Rashtchian.
To register for this event and receive the Zoom link, please email organizers bendavid.shai [at] gmail.com (subject: Inquiry%20to%20register%20for%20Interpretable%20Machine%20Learning%20event%20June%2029%2C%202020) (Shai Ben-David) or ruth.urner [at] gmail.com (subject: Inquiry%20to%20register%20for%20Interpretable%20Machine%20Learning%20event%20June%2029%2C%202020) (Ruth Urner).