Section: Scientific Foundations
Nearest neighbor estimates
In pattern recognition and statistical learning, nearest neighbor
algorithms are amongst the most simple available. Nevertheless, they
are also very powerful and since the pioneering works by Fix
and Hodges [41] , [42] they have generated a large amount of
literature and developments.
Basically, given a training set of data, i.e. an N –sample of i.i.d. object–feature pairs (Xi, Yi) for , with real–valued
features, we want to be able to generalize,
that is to guess the feature Y associated with any new object X ,
with the same probability distribution as the Xi 's.
To achieve this, one chooses some integer k smaller than N , and
takes the mean–value of the k features associated with the k objects
that are nearest to the new object X , for some given metric.
From the beginning it was clear that even simple, this method is very
powerful.
In general, there is no way to guess exactly the value of Y , and the minimal
error that can be done is that of the Bayes estimate , which
cannot be computed by lack of knowledge of the distribution of the
pair, but the Bayes estimate will help us to characterize the strength of the
method. So the best we can wish is that our estimate converges, say
when the sample size grows, to the Bayes estimate. This is what has been
proved in great generality by Stone [71] for the mean square
convergence, provided that X is a d -dimensional vector, Y is
square–integrable, and the ratio k/N goes to 0.
Nearest neighbor estimate is not the only local averaging estimate having
this property, but it is arguably the simplest.
The situation is radically different in general infinite dimensional
spaces. In this respect, Cérou and Guyader [3] present
counterexamples indicating that the estimate is not consistent,
and they argue that restrictions on the state space and the
distribution of (X, Y) cannot be dispensed with. First of all, it
must be separable for the norm used to compute the neighbors, as already
noticed by Cover and Hart [33] . But this is
not enough. By working out arguments in Preiss [67] ,
Cérou and Guyader [3] exhibit a random variable X
with Gaussian distribution in a separable Hilbert space
for which the estimate fails to be consistent. On the positive side, these
authors provide a general condition, called the –continuity
condition, which ensures the consistency of the estimate. Even with
this recent results, the situation in infinite dimension is not
completly clear, and this is still an interesting field for
investigation.
In settings for which the estimate is convergent, there is still the question of the rate of convergence, and how to choose the parameter k in order to achive the best rate of convergence. As noticed by Kulkarni and Posner [57] , the rate of convergence of the nearest neighbors is closely related to the notion of entropy, introduced in the late fifties by Kolmogorov and Tikhomirov [56] . These tools are to be used to study cases and algorithm refinements that are not yet to be found in the literature.