Distribution-Specific Analysis of Nearest Neighbor Search and Classification

Abstract

In both statistical and algorithmic analysis, a bound f(C) in terms of instance-specific characteristics C (e.g. data distribution) is often described as "optimal" if there exists some C* for which it happens to be tight. And yet, the bound might be extremely loose for the types of C that arise in practice.

Fortunately, for nonparametric methods - that is, algorithms or statistical estimators whose behavior is essentially local - it is frequently possible to do much better and obtain bounds f(C) that are tight for all C.

We'll discuss two such results, one statistical and one algorithmic, relating to nearest neighbor.

1. What properties of the data distribution influence the statistical rate of convergence of nearest neighbor classification?

For any data distribution, we can obtain distribution-specific upper and lower bounds that are closely matching. These yield, as by-products, solutions to several open questions around nearest neighbor methods, such as a characterization of the metric measure spaces in which nearest neighbor is universally consistent. They also suggest a new notion of smoothness, different from the usual Lipschitz or Holder conditions, that is well-suited to nearest neighbor.

2. What properties of the data distribution influence the computational complexity of finding the nearest neighbor?

Here, we can obtain a tight characterization of the performance of tree-based NN search, as a simple function of the specific data configuration. This general result is then easily specialized to common types of structure in data.

We'll end with some statistical and algorithmic open problems.

Attachment

Distribution-Specific Analysis of Nearest Neighbor Search and Classification

Distribution-Specific Analysis of Nearest Neighbor Search and Classification

Abstract

Attachment

Video Recording