Section: Scientific Foundations
Multidimensional Indexing Techniques
Techniques for indexing multimedia data are needed to preserve the efficiency of search processes as soon as the data to search in becomes large in volume and/or in dimension. These techniques aim at reducing the number of I/Os and CPU cycles needed to perform a search. Two classes of multi-dimensional indexing methods can be distinguished: exact nearest neighbor (NN) searches and approximate NN-search schemes.
Traditional multidimensional indexing techniques typically divide the data space into cells containing vectors [47] . Cell construction strategies can be classified in two broad categories: data-partitioning indexing methods that divide the data space according to the distribution of data, and space-partitioning indexing methods that divide the data space along predefined lines and store each descriptor in the appropriate cell. NN-algorithms typically use the geometrical properties of (minimum bounding) cells to eliminate cells that cannot have any impact on the result of the current query [48] .
Many data-partitioning index methods derive from the seminal R-Tree [53] , and their differences lie in the properties of the shapes used to build cells and/or in the degree of overlapping between cells. Well known space-partitioning techniques are somehow related to the K-D-B-Tree [63] , and differ on the way space is split and cells encoded.
Unfortunately, the “curse of dimensionality” phenomenon makes these traditional approaches ineffective in high-dimensional spaces [46] . This phenomenon is particularly prevalent when performing exact NN-searches. There is therefore an increasing interest in performing approximate NN-searches, where result quality is traded for reduced query execution time. Many approaches to approximate NN-searches have been published; their description can be found in [46] .
Some approaches simply rely on dimensionality reduction techniques, such as PCA, but their use remains problematic when facing very high-dimensional datasets. Other approaches abort the search process early, after having accessed an arbitrary and predetermined number of cells. While this is highly effective, it does not give any clue on the quality of the result returned to the user. Some other approaches consider an approximation of the sizes of cells instead of considering their exact sizes, making somehow cells “smaller”. Shrunk cells increase efficiency of retrievals as they reduce overlap in space, but interesting vectors might be missed, however.
Recently, several approaches have transformed costly nearest neighbor searches in multidimensional space into efficient uni-dimensional accesses. One approach using locality sensitive hashing (LSH) techniques [52] uses several hash functions such that co-located vectors are likely to collide in buckets. Fagin et al. [50] proposed a framework based on projecting the descriptors onto a limited set of random lines, each line giving a ranking of the database descriptors with respect to the query descriptor.