Abstract

In this talk, we discuss the first improvement in approximation for nearest neighbor search under the Earth Mover’s Distance (EMD) in over a decade. Given two sets of s vectors A,B in high dimensional space (R^d), the EMD between A and B is the minimum cost of a perfect matching between the vectors in A and B where the edge weights are given by the distances in Euclidean space. EMD is a classic metric in computer science (dating back over 100 years to the Hungarian algorithm of Jacobi), and a standard distance between two sets of points. In nearest neighbor search, one has a collection of n such sets A1,...,An, which one pre-processes so that given a query set Q, one can quickly return a set Ai whose distance (under EMD) is within a C-factor of the nearest neighbor to Q.

To date, the only algorithm with sublinear O(n^eps) query time was given by Andoni, Indyk, and Krauthgamer ([AIK, SODA 08]), who designed a (data-independent) locality sensitive hash function (LSH) for EMD with approximation O(log^2 s). In this work, we improve this approximation to Õ(log s) in the same runtime by designing the first data-dependent LSH functions for EMD. The talk will discuss the main techniques behind sublinear algorithms for EMD, including the tree embeddings techniques of [AIK’08] and [Chen, Jayaram, Levi, Waingarten STOC ‘22], as well as the key insights needed to glue them together into a improved LSH for EMD.

Joint work with Erik Waingarten and Tian Zhang (STOC ‘24, https://arxiv.org/abs/2403.05041).