Section: New Results
Automatic annotation and learning
Categorical object retrieval
Keywords : interest points, local descriptors, interpretability, feature selection.
Participants : Ahmed Rebai, Alexis Joly, Nozha Boujemaa.
Unlike search by similarity techniques which have—to some extent—become reliable over the past few years, object retrieval still have got many issues. In fact, searching for a concept addresses various problems related to feature extraction (e.g. invariance in viewpoint, illumination, affine transformations, etc.) and machine learning (robustness, over-fitting, genericity, computation time, etc.).
Our research focuses on building concise and powerful models to make it possible to retrieve objects in large heterogeneous image collections. We integrated indeed a good feature selection algorithm based on both boosting and lasso techniques.
Contrary to most training algorithms used in automatic object recognition, this new algorithm generates sparse models from the complete space of all the local features of the training images. The intuitive idea is to add an extra term to the loss function. This term represents a constraint which causes shrinkage of the solutions towards zero.
Given a loss function L(y, f(x)) where f(x) being
the sum of base learners , the
objective is to minimize the following function:
where n is the number of the training examples and 0 is
a parameter that controls the amount of the shrinkage applied. The
bigger the
coefficient, the sparser the model. Sparsity is
known to be a good tradeoff between model simplicity and good category
representation. Added to that, sparsity tends to favor
interpretability which is practical for a human interaction
afterwards.
Preliminary experiments carried out on the PascalVOC database using two successful state-of-the-art descriptors (SIFT and SURF) are promising.
Automatic image annotation through visual word co-occurences
Keywords : automatic image annotation, object detection, bag-of-word, visual descriptor.
Participants : Nicolas Hervé, Nozha Boujemaa.
The bag-of-visual-words is a popular representation for images that has proven to be quite effective for automatic annotation.
The main idea behind bag-ofvisual-words is to represent an image with a collection of visual patches and to compute an histogram counting the occurrences of these patches as a global signature. This representation can then be used in any learning framework to manage the automatic annotation problem. It is simple to implement and provides current state-of-the-art performances on several evaluation benchmarks. One of the main characteristic of bag-of-visual-words is their orderless nature. The spatial position of the visual patches is dropped and never used. On one hand this choice brings flexibility and robustness to the representation as it is able to deal with changes in viewpoint or occlusion. On the other hand, the spatial relations between patches could be useful to describe the internal structure of objects or to highlight the importance of contextual visual information for these objects.
We extend this representation in order to include weak geometrical information by using visual word pairs. We choose to consider the co-occurrence of words in a predefined local neighborhood of each patch. Thus, we only consider the distance between two patches, whatever their relative orientation. This way, we include both contextual and structural information in our new visual signature.
Following our previous work, we choose to extract standard low-level visual patches on a regular grid [20] before creating the pairs and we use SVMs as a learning strategy.
On a standard image database (Pascal VOC 2007), we achieve 10% higher annotation performances by considering the word pairs.
Embedding the word pairs in a standard bag-of-visual-words representation brings very significant improvement for an automatic annotation task. The weak geometrical information they encode is complementary to the standard words occurrences histogram.
This work as been published in [19] . The overall system is described in the thesis [8] .
Objects retrieval with efficient boosting
Keywords : scalability, relevance feedback, local descriptors, Adaboost, multi-probe locality sensitive hashing, feature selection.
Participants : Saloua Ouertani-Litayem, Alexis Joly, Nozha Boujemaa.
Most recent and effective recognition techniques are based on
high-dimensional and sparse representations induced by the large
number of local visual features. Classifiers learned on such
representations are usually applied to test images one by one and the
complexity in a retrieval context is intrinsically linear in dataset
size. Hence, we explored an efficient boosting strategy in order to
reduce the retrieval complexity when using feature rich
representations of images. For learning step we used AdaBoost
algorithm with a weak learner based on distances between training
local features. Instead of predicting the scores of the images one by
one, we performed T range queries in the dataset according to
the T weak classifiers parameters
.
Therefore we used the a posteriori multi-probe locality sensitive
hashing similarity search structure [44] . Each range
query returns a set of features Rt such as:

Experiments on Caltech 256 dataset show that the technique is about 250 times faster than the naive exhaustive method with surprisingly better performances (see Table 2 ).
Exhaustive [46] | Approximate | |||
Classes | Time(sec) | M.A.P | Time(sec) | M.A.P |
airplanes | 9008.92 | 0.2037 | 35.43 | 0.3881 |
american-flag | 8935.19 | 0.2922 | 35.72 | 0.3903 |
chess-board | 9537.36 | 0.7156 | 33.21 | 0.7446 |
golf-ball | 8908.83 | 0.1156 | 39.31 | 0.2361 |
mars | 9017.57 | 0.1603 | 31.2 | 0.0909 |
motorbikes | 9001.33 | 0.2863 | 34.43 | 0.4516 |
sunflower | 8942.48 | 0.5797 | 32.63 | 0.6214 |
swiss-army-knife | 1604.16 | 0.0201 | 31.52 | 0.1196 |
tennis-racket | 8923.87 | 0.2266 | 33.08 | 0.2715 |
tower-pisa | 8911.52 | 0.2683 | 40.01 | 0.5512 |
Means | 8279.123 | 0.2868 | 34.65 | 0.3865 |
We also applied the proposed method to a real time relevance feedback mechanism based on freely selected image regions. Experiments show that the active learning provides significant effectiveness improvements (see figure 14 and [22] , [30] for more details).