## Section: Scientific Foundations

### Multidimensional Indexing Techniques

Techniques for indexing multimedia data are needed to preserve the efficiency of search processes as soon as the data to search in becomes large in volume and/or in dimension. These techniques aim at reducing the number of I/Os and CPU cycles needed to perform a search. Two classes of multi-dimensional indexing methods can be distinguished: exact nearest neighbor (NN) searches and approximate NN-search schemes.

Traditional multidimensional indexing techniques typically divide the
data space into cells containing vectors [47] .
Cell construction strategies can be classified in two broad categories:
*data-partitioning* indexing methods that divide the data space according
to the distribution of data, and *space-partitioning* indexing methods that
divide the data space along predefined lines and store each descriptor in
the appropriate cell. NN-algorithms typically use the geometrical properties of (minimum
bounding) cells to eliminate cells that cannot have any impact on the
result of the current query [48] .

Many data-partitioning index methods derive from the seminal R-Tree [53] , and their differences lie in the properties of the shapes used to build cells and/or in the degree of overlapping between cells. Well known space-partitioning techniques are somehow related to the K-D-B-Tree [63] , and differ on the way space is split and cells encoded.

Unfortunately, the “curse of dimensionality” phenomenon makes these
traditional approaches ineffective in high-dimensional
spaces [46] .
This phenomenon is particularly prevalent when performing *exact*
NN-searches. There is therefore an increasing interest in performing
*approximate* NN-searches, where result quality is traded for
reduced query execution time. Many approaches to approximate
NN-searches have been published; their description can be found
in [46] .

Some approaches simply rely on dimensionality reduction techniques, such as PCA, but their use remains problematic when facing very high-dimensional datasets. Other approaches abort the search process early, after having accessed an arbitrary and predetermined number of cells. While this is highly effective, it does not give any clue on the quality of the result returned to the user. Some other approaches consider an approximation of the sizes of cells instead of considering their exact sizes, making somehow cells “smaller”. Shrunk cells increase efficiency of retrievals as they reduce overlap in space, but interesting vectors might be missed, however.

Recently, several approaches have transformed costly nearest neighbor searches
in multidimensional space into efficient uni-dimensional accesses.
One approach using locality sensitive hashing (LSH)
techniques [52] uses several hash functions such that co-located
vectors are likely to collide in buckets. Fagin *et al.* [50]
proposed a framework based on projecting
the descriptors onto a limited set of random lines, each
line giving a ranking of the database descriptors with
respect to the query descriptor.