# Project : texmex

## Section: Scientific Foundations

Keywords : Statistics , Data Analysis , Indexing .

### Efficient Exploitation of Descriptors

Even if the description of the documents can be done automatically, this is not enough to build a complete indexing and retrieval system usable in practice. As a matter of fact, the system must be able to answer a query in a reasonable amount of time, and thus requires tools in order to guarantee this aspect. The section is devoted to some of these tools.

On-line and off-line processing define the two main categories of exploitation. Off-line processing correspond usually to all techniques which need to consider all the data, and the complexity in time is thus not the main issue. On the other hand, on-line processing need to go really fast. To gain such a performance, these procedures use the result of the off-line processing to limit the treatment to the smallest data subset necessary to answer the query.

#### Statistics for Huge Datasets

Keywords : exploratory data analysis , statistics , sampling .

The situation where we have few available data has been well studied but a huge amount of data generates different kinds of problems : for instance, the use of classical inferential statistics results in hypothesis testing conclude rather often to reject the null hypothesis. Besides, the methods of models identification fail very often or we overestimate the quality of the model. The question is : how can we set a representative sampling in such datasets? We must also add that some clustering algorithms are unusable with such large datasets. Therefore, it is clear that working with huge datasets is difficult because of their computational complexity, because of the data quality and because of the scaling problem in inferential statistics.

However, statistical methods can be used with caution if the data quality is good. So the first step is the cleaning and the checking of data to be sure of their coherence. The second step depends on our goal. Either we want to build a global model, either we are looking for hidden structures in the data. In the first case, we can work on a sample of the data and use methods such as clustering, segmentation, regression models. In case we are looking for hidden structures, sampling is not appropriate and we need to use other heuristics.

Exploratory data analysis (EDA) is an essential tool to deal with huge amount of data. EDA describes data in an interactive way, without a priori hypothesis and provides useful graphical representations. Visualization methods when the dimension of the data is greater than three is also indispensable: for instance, parallel coordinates. All these previous methods analyse the data to discover their properties.

We can add that most of the available data mining programs are very expensive, and that their contents are very disappointing and poor for most of them.

#### Multidimensional Indexing Techniques

Keywords : Multidimensional Indexing Techniques , Databases , Curse of Dimensionality , Approximate Searches , Nearest-Neighbors .

This section gives an overview of the techniques used in databases for indexing multimedia data (often focusing on still images). Database indexing techniques are needed as soon as the space required to store all the descriptors gets too big to fit in main memory. Database indexing techniques are therefore used for storing descriptors on disks and for accelerating the search process by using multi-dimensional index structures. Their goal is mainly to minimize the resulting number of I/Os. This section first gives an overview of traditional multidimensional indexing approaches achieving exact NN-searches. We especially focus on the filtering rules theses techniques use to dramatically reduce their response times. We then move to approximate NN-search schemes.

##### Traditional Approaches, Cells and Filtering Rules.

Traditional database multidimensional indexing techniques typically
divide the data space into cells containing vectors. Cell construction
strategies can be classified in two broad categories:
*data-partitioning* indexing methods [34]
[79] that
divide the data space according to the distribution of data and
*space-partitioning*
[57]
[78] indexing methods that
divide the data space along predefined lines regardless of the actual
values of data and store each descriptors in the appropriate cell.

Data-partitioning index methods all derive from the seminal R-Tree [54], originally designed for indexing bi-dimensional data used in Geographical Information Systems. The R-tree was latter extended to cope with multi-dimensional data. The SS-Tree [79] is an extension that rely on spheres instead of rectangles. The SR-Tree [60] specifies its cells as being the intersection of a bounding sphere and a bounding rectangle.

Space-partitioning techniques like grid-file [68], K-D-B-Tree [72], LSD ${}^{\text{h}}$ -Tree [57] typically divide the data space along predetermined lines regardless of data clusters. Actual data are subsequently stored in the appropriate cells.

NN-algorithms typically use the geometrical properties of cells to
eliminate those cells that can niether have any impact on the result of
the current query [37]. Eliminating irrelevant cells avoids
having to subsequently analyze all the vectors they contain, which, in
turn, reduces response times. Eliminating irrelevant cells is often
enforced at run-time by applying two rather similar *filtering
rules*. The first rule is applied at the very beginning of the
search process and identifies irrelevant cells as follows:

$\begin{array}{cc}\hfill \text{if}\phantom{\rule{0.5em}{0ex}}dmin(q,{C}_{i})& \u2a7edmax(q,{C}_{j})\phantom{\rule{0.5em}{0ex}}\text{then}\phantom{\rule{0.5em}{0ex}}{C}_{i}\phantom{\rule{0.5em}{0ex}}\text{is}\phantom{\rule{0.5em}{0ex}}\text{irrelevant,}\hfill \end{array}$ | (1) |

where
$dmin(q,{C}_{i})$
is the minimum distance between the query point
*q*
and the cell
${C}_{i}$
and
$dmax(q,{C}_{j})$
the maximum distance between
*q*
and cell
${C}_{j}$
.

The search process then ranks the remaining cells on their increasing
distances to
*q*
. It then accesses the cells, one after the other,
fetches all the vectors each cell contains, and computes the distance
between
*q*
and each vector of the cell. This may possibly update the
current set of the
*k*
best neighbors found so far.

The second filtering rule is applied to stop the search as soon as it is detected that none of the vectors in any remaining cell can possibly impact the current set of neighbors ; all remaining cells are skipped. This second rule is:

$\text{if}\phantom{\rule{0.5em}{0ex}}dmin(q,{C}_{i})\u2a7ed(q,n{n}_{k})\phantom{\rule{0.5em}{0ex}}\text{then}\phantom{\rule{0.5em}{0ex}}\text{stop,}$ | (2) |

where
${C}_{i}$
is the cell to process next,
$d(q,n{n}_{k})$
is the
distance between
*q*
and the current
${k}^{th}$
-NN.

The ``curse of dimensionality'' phenomenon makes these filtering rules ineffective in high-dimensional spaces [78] [35] [69] [37] [61].

##### Approximate NN-Searches.

This phenomenon is particularly prevalent when performing *exact*
NN-searches. There is therefore an increasing interest in performing
*approximate* NN-searches, where result quality is traded for
reduced query execution time. Many approaches to approximate
NN-searches have been published.

##### Dimensionality Reduction Approaches.

Dimension reduction techniques have been used to overcome the ``curse of dimensionality'' phenomenon. These techniques, such as PCA, SVD or DFT (see [50]), exploit the underlying correlation of vectors and/or their self similarity [61], frequent with real datasets. NN-search schemes using dimension reduction techniques are approximate because the reduction only coarsely preserves the distances between vectors. Therefore, the neighbors of query points found in the transformed feature space might not be the ones that would be found using the original feature space. These techniques introduce imprecision on the results of NN-searches which can not be controlled nor precisely measured. In addition, such techniques are effective only when the number of dimensions of the transformed space become very small, otherwise the ``curse of dimensionality'' phenomenon remains. This makes their use problematic when facing very high-dimensional datasets.

##### Early Stopping Approaches.

Weber and Böhm with their approximate version of theVA-File [77] and Li et al. with Clindex [63] perform approximate NN-searches by interrupting the search after having accessed an arbitrary, predetermined and fixed number of cells. These two techniques are efficient in terms of response times, but give no clue on the quality of the result returned to the user. Ferhatosmanoglu et al., in [45], combine this with a dimensionality reduction technique: it is possible to improve the quality of an approximate result by either reading more cells or by increasing the number of dimensions for distance calculations. This scheme suffers from the drawbacks mentioned here and above.

##### Geometrical Approaches.

Geometrical approaches typically consider an approximation of the sizes of cells instead of considering their exact sizes. They typically account for an additional $\epsilon $ value when computing the minimum and maximum distances to cells, making somehow cells ``smaller''. Shrunk cells make the filtering rules more effective, which, in turn, increases the number of irrelevant cells. However, cells containing interesting vectors might be filtered out.

In [77], the VA-BND scheme empirically estimates $\epsilon $ by sampling database vectors. It is shown that this $\epsilon $ is big enough to increase the filtering power of the rules while small enough in the majority of cases to avoid missing the true nearest-neighbors. The main drawback of this approach is that the same $\epsilon $ is applied to all existing cells. This does not account for the possibly very different data distributions in cells.

The AC-NN scheme for M-Trees presented in [41] also
relies on a single value
$\epsilon $
set by the user. Here,
$\epsilon $
represents the maximum relative error allowed between
the distance from
*q*
and its exact NN and the distance from
*q*
and
its approximate NN. In this scheme, setting
$\epsilon $
is far from
being intuitive. Their experiments showed that, in general, the actual
relative error is always much smaller than
$\epsilon $
. Ciaccia and
Patella also present an extension to AC-NN called PAC-NN which uses a
probabilistic technique to determine an estimation of the distance
between
*q*
and its NN. It then stops the search as soon as it founds
a vector closer than this estimated distance. Unfortunately, AC-NN and
PAC-NN can not search for
*k*
neighbors.

##### Hashing-based Approaches.

Approximate NN-searches using locality sensitive
hashing (LSH) techniques are described in [51]. These schemes project the vectors into the
Hamming cube and then use several hash functions such that co-located
vectors are likely to collide in buckets. LSH techniques tune the hash
functions based on a value for
$\epsilon $
which drives the
precision of searches. As for the above schemes, setting the right
value for
$\epsilon $
is key and tricky. The maximum distance
between any query point and its NN is also key for tuning the hash
functions. While finding the appropriate setting is, in general, very
hard, [51] observes that choosing only one value for this
maximum distance gives good results in practice. This, however, makes
more difficult any assessment on the quality of the returned result.
Finally, the LSH scheme presented in [51] might, in
certain cases, return less than
*k*
vectors in the result.

##### Probabilistic Approaches.

DBIN [33] clusters data using the EM (Expectation Maximization) algorithm. It
aborts the search when the estimated probability for a remaining
database vector to be a better neighbor than the one currently known
falls below a predetermined threshold. DBIN bases its computations on
the assumption that the points are IID samples from the estimated
mixture-of-Gaussians probability density function. Unfortunately, DBIN
can not search for
*k*
neighbors.

P-Sphere Trees [52] investigate the trading of (disk)
space for time when searching for the approximate NN of query points.
In this scheme, some vectors are first picked from a sample of the DB,
and each picked vector becomes the center of one hypersphere. Then,
the DB is scanned and all the vectors that have one particular center
as nearest neighbor go into the corresponding hypersphere. Vectors
belonging to overlapping hyperspheres are replicated. Hyperspheres are
built in such a manner that the probability of finding the true NN can
be enforced at run time by solely scanning the sphere whose center is
the closest to the query point. P-Sphere Trees can also not search for
*k*
neighbors.

To our knowledge, no technique linking the precision of the
search to a probability of improving the result can search for
*k*
neighbors.