## Section: Scientific Foundations

Keywords : Statistics, Data Analysis, Data Quality, Indexing.

### Efficient Exploitation of Descriptors and Metadata

Even if the description of the documents can be done automatically, this is not enough to build a complete indexing and retrieval system usable in practice. As a matter of fact, the system must be able to answer a query in a reasonable amount of time, and thus needs tools to guarantee this aspect. The section is devoted to some of these tools.

On-line and off-line processing define the two main categories of exploitation. On one hand, off-line processing corresponds usually to techniques which need to consider all the data, and the complexity in time is thus not the main issue. On the other hand, on-line processing needs to go really fast. To gain such a performance, these procedures use the result of the off-line processing to limit the treatment to the smallest data subset necessary to answer the query.

#### Statistics and Data Quality over Huge Datasets

Keywords : Exploratory Data Analysis, Statistics, Sampling, Data Quality Metrics.

The situation where we have few available data has been well studied but a huge amount of data generates different kinds of problems: for instance, the use of classical inferential statistics results in hypothesis testing concludes rather often to reject the null hypothesis. Besides, the methods of models identification fail very often or the quality of the model is overestimated. The question is: how can we set a representative sampling in such datasets? We must add also that some clustering algorithms are unusable with such large datasets. Therefore, it is clear that working with huge datasets is difficult because of their computational complexity, because of the data quality and because of the scaling problem in inferential statistics.

However, statistical methods can be used, with caution, if the data quality is good. So the first step is the cleaning and the checking of data to be sure of their coherence. The second step depends on our goal. Either we want to build a global model, or we are looking for hidden structures in the data. In the first case, we can work on a sample of the data and use methods such as clustering, segmentation, regression models. In case we are looking for hidden structures, sampling is not appropriate and we need to use other heuristics.

Exploratory data analysis (EDA) is an essential tool to deal with huge
amounts of data. EDA describes data in an interactive way, without
*a priori* hypothesis and provides useful graphical representations.
Visualization methods when the dimension of the data is greater than three is
also necessary: for instance, parallel coordinates. All these previous
methods watch the data to discover their properties.

Let us add that most of the available data mining programs are very expensive, and that their contents are very disappointing and poor for most of them.

#### Multidimensional Indexing Techniques

Keywords : Multidimensional Indexing Techniques, Databases, Curse of Dimensionality, Approximate Searches, Nearest-Neighbors (NN).

This section gives an overview of the techniques used in databases for indexing multimedia data (focusing on still images, however). Database indexing techniques are needed as soon as the space required to store all the descriptors gets too big to fit in main memory. Database indexing techniques are therefore used for storing descriptors on disks and for accelerating the search process by using multi-dimensional indexing structures. Their goal is mainly to minimize the resulting number of I/Os. This section first gives an overview of traditional multidimensional indexing approaches achieving exact nearest-neighbors searches. We especially focus on the filtering rules these techniques use to dramatically reduce their response times. We then move to approximate NN-search schemes.

##### Traditional Approaches, Cells and Filtering Rules

Traditional database multidimensional indexing techniques typically
divide the data space into cells containing vectors. Cell construction
strategies can be classified in two broad categories:
*data-partitioning* indexing methods [45] , [105] that
divide the data space according to the distribution of data and
*space-partitioning* [78] , [104]
indexing methods that divide the data space along predefined lines
regardless of the actual values of data and store each descriptor in
the appropriate cell.

Data-partitioning index methods, like the SS-Tree [105] or the SR-Tree [82] , all derive from the seminal R-Tree [75] , originally designed for indexing bi-dimensional data used in Geographical Information Systems.

Space-partitioning techniques like grid-file [93] , K-D-B-Tree [97] , LSD -Tree [78] typically divide the data space along predetermined lines regardless of data clusters. Actual data are subsequently stored in the appropriate cells.

NN-algorithms typically use the geometrical properties of cells to
eliminate cells that cannot have any impact on the result of the
current query [49] . Eliminating irrelevant cells avoids
having to subsequently analyze all the vectors they contain, which, in
turn, reduces response times. Eliminating irrelevant cells is often
enforced at run-time by applying two rather similar *filtering
rules* .

The first rule is applied at the very beginning of the search process and identifies irrelevant cells as follows:

where is the minimum distance between the query point
q and the cell C_{i} and the maximum distance
between q and cell C_{j} .

The search process ranks the remaining cells on their increasing distances to q. It then accesses the cells, one after the other, fetches all the vectors each cell contains, and computes the distance between q and each vector of the cell. This may possibly update the current set of the k best neighbors found so far.

The second filtering rule is applied to stop the search as soon as it is detected that none of the vectors in any remaining cell can possibly impact the current set of neighbors; all remaining cells are skipped. This second rule is:

where C_{i} is the cell to process next, d(q, nn_{k}) is the distance
between q and the current k^{th} -NN.

Unfortunately, the ``curse of dimensionality'' phenomenon makes these filtering rules ineffective in high-dimensional spaces [104] , [46] , [94] , [49] , [83] .

##### Approximate NN-Searches

This phenomenon is particularly prevalent when performing *exact*
NN-searches. There is therefore an increasing interest in performing
*approximate* NN-searches, where result quality is traded for
reduced query execution time. Many approaches to approximate
NN-searches have been published.

**Dimensionality Reduction Approaches**.Dimension reduction techniques have been used to overcome the "curse of dimensionality" phenomenon. These techniques, such as PCA, SVD or DFT [70] exploit the underlying correlation of vectors and/or their self similarity [83] , frequent with real datasets. NN-search schemes using dimension reduction techniques are approximated because the reduction only coarsely preserves the distances between vectors. Therefore, the neighbors of query points found in the transformed feature space might not be the ones that would be found using the original feature space. These techniques introduce imprecision on the results of NN-searches which cannot be controlled nor precisely measured. In addition, such techniques are effective only when the number of dimensions of the transformed space become very small, otherwise the "curse of dimensionality" phenomenon remains. This makes their use problematic when facing very high-dimensional datasets.

**Early Stopping Approaches**.Weber and Böhm with their approximate version of the VA-File [103] and Li

*et al.*with Clindex [85] perform approximate NN-searches by interrupting the search after having accessed an arbitrary, predetermined and fixed number of cells. These two techniques are efficient in terms of response times, but give no clue on the quality of the result returned to the user. Ferhatosmanoglu*et al.*[65] combine this approach with a dimensionality reduction technique: it is possible to improve the quality of an approximate result by either reading more cells or by increasing the number of dimensions for distance calculations. Yet, this scheme suffers from the drawbacks mentioned here and above.**Geometrical Approaches**.Geometrical approaches typically consider an approximation of the sizes of cells instead of considering their exact sizes. They typically account for an additional value when computing the minimum and maximum distances to cells, making somehow cells ``smaller''. Shrunk cells make the filtering rules more effective, which, in turn, increases the number of irrelevant cells. Cells containing interesting vectors might be filtered out, however.

The VA-BND scheme [103] empirically estimates by sampling database vectors. It is shown that this is big enough to increase the filtering power of the rules while small enough in the majority of cases to avoid missing the true nearest-neighbors. The main drawback of this approach is that the same is applied to all existing cells. It does not account for the very different data distributions possible in cells.

The AC-NN scheme for M-Trees [57] also relies on a single value set by the user. Here, represents the maximum relative error allowed between the distance from q to its exact NN and the distance from q to its approximate NN. In this scheme, setting is far from being intuitive. The experiments showed that, in general, the actual relative error is always much smaller than . Ciaccia and Patella also present an extension to AC-NN called PAC-NN which uses a probabilistic technique to determine an estimation of the distance between q and its NN. It then stops the search as soon as it finds a vector closer than this estimated distance. Unfortunately, AC-NN and PAC-NN cannot search for k neighbors.

**Hashing-based Approaches**.Approximate NN-searches using locality sensitive hashing (LSH) techniques [71] project the vectors into the Hamming cube and then use several hash functions such that co-located vectors are likely to collide in buckets. LSH techniques tune the hash functions based on a value for which drives the precision of searches. As for the above schemes, setting the right value for is key and tricky. The maximum distance between any query point and its NN is also key for tuning the hash functions. While finding the appropriate setting is, in general, very hard, it was observed [71] that choosing only one value for this maximum distance gives good results in practice. This, however, makes more difficult any assessment on the quality of the returned result. Finally, the LSH scheme [71] might, in certain cases, return less than k vectors in the result.

**Probabilistic Approaches**.DBIN [44] clusters data using the EM (Expectation Maximization) algorithm. It aborts the search when the estimated probability for a remaining database vector to be a better neighbor than the one currently known falls below a predetermined threshold. DBIN bases its computations on the assumption that the points are IID samples from the estimated mixture-of-Gaussians probability density function. Unfortunately, DBIN can not search for k neighbors.

P-Sphere Trees [73] investigate the trading of (disk) space for time when searching for the approximate NN of query points. In this scheme, some vectors are first picked from a sample of the DB, and each picked vector becomes the center of one hypersphere. Then, the DB is scanned and all the vectors that have one particular center as nearest neighbor go into the corresponding hypersphere. Vectors belonging to overlapping hyperspheres are replicated. Hyperspheres are built in such a manner that the probability of finding the true NN can be enforced at run time by solely scanning the sphere whose center is the closest to the query point. P-Sphere Trees can neither search for k neighbors.

To our knowledge, no technique linking the precision of the search to a probability of improving the result can search for k neighbors.

**Rank Aggregation-based Approaches**.Recently, Fagin

*et al.*[64] proposed a framework for very efficiently evaluating single descriptor nearest-neighbor queries over high-dimensional collections. This framework is based on projecting the descriptors onto a limited set of random lines. Each random line is used to give a ranking of the database descriptors with respect to the query descriptor. These rankings are then efficiently aggregated to produce a fairly good approximation of the actual Euclidean k-nearest neighbors. The fastest algorithm to aggregate the rankings was called OMEDRANK.The OMEDRANK algorithm has several nice properties: it is based on a cheap aggregation of rankings instead of a complex distance function; it uses standard B

^{ + }-trees to index the data, therefore handling updates gracefully; and it allows for a clever dimensionality reduction, by varying the number of random lines that are indexed.