Team Imedia

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: New Results

Automatic annotation and learning

Categorical object retrieval

Keywords : interest points, local descriptors, interpretability, feature selection.

Participants : Ahmed Rebai, Alexis Joly, Nozha Boujemaa.

Unlike search by similarity techniques which have—to some extent—become reliable over the past few years, object retrieval still have got many issues. In fact, searching for a concept addresses various problems related to feature extraction (e.g. invariance in viewpoint, illumination, affine transformations, etc.) and machine learning (robustness, over-fitting, genericity, computation time, etc.).

Our research focuses on building concise and powerful models to make it possible to retrieve objects in large heterogeneous image collections. We integrated indeed a good feature selection algorithm based on both boosting and lasso techniques.

Contrary to most training algorithms used in automatic object recognition, this new algorithm generates sparse models from the complete space of all the local features of the training images. The intuitive idea is to add an extra term to the loss function. This term represents a constraint which causes shrinkage of the solutions towards zero.

Given a loss function L(y, f(x)) where f(x) being the sum of base learners Im2 ${f{(x)}=\#8721 _{j}^{}\#946 _jh_j{(x)}}$ , the objective is to minimize the following function:

Im3 ${\#915 {(\#946 ,\#955 )}=\munderover \#8721 {i=1}n{L(}{(y_i,f{(x_i)})}+\#955 ·{||\#946 ||}}$(1)

where n is the number of the training examples and $ \lambda$$ \ge$0 is a parameter that controls the amount of the shrinkage applied. The bigger the $ \lambda$ coefficient, the sparser the model. Sparsity is known to be a good tradeoff between model simplicity and good category representation. Added to that, sparsity tends to favor interpretability which is practical for a human interaction afterwards.

Preliminary experiments carried out on the PascalVOC database using two successful state-of-the-art descriptors (SIFT and SURF) are promising.

Automatic image annotation through visual word co-occurences

Keywords : automatic image annotation, object detection, bag-of-word, visual descriptor.

Participants : Nicolas Hervé, Nozha Boujemaa.

The bag-of-visual-words is a popular representation for images that has proven to be quite effective for automatic annotation.

The main idea behind bag-ofvisual-words is to represent an image with a collection of visual patches and to compute an histogram counting the occurrences of these patches as a global signature. This representation can then be used in any learning framework to manage the automatic annotation problem. It is simple to implement and provides current state-of-the-art performances on several evaluation benchmarks. One of the main characteristic of bag-of-visual-words is their orderless nature. The spatial position of the visual patches is dropped and never used. On one hand this choice brings flexibility and robustness to the representation as it is able to deal with changes in viewpoint or occlusion. On the other hand, the spatial relations between patches could be useful to describe the internal structure of objects or to highlight the importance of contextual visual information for these objects.

We extend this representation in order to include weak geometrical information by using visual word pairs. We choose to consider the co-occurrence of words in a predefined local neighborhood of each patch. Thus, we only consider the distance between two patches, whatever their relative orientation. This way, we include both contextual and structural information in our new visual signature.

Following our previous work, we choose to extract standard low-level visual patches on a regular grid [20] before creating the pairs and we use SVMs as a learning strategy.

On a standard image database (Pascal VOC 2007), we achieve 10% higher annotation performances by considering the word pairs.

Embedding the word pairs in a standard bag-of-visual-words representation brings very significant improvement for an automatic annotation task. The weak geometrical information they encode is complementary to the standard words occurrences histogram.

This work as been published in [19] . The overall system is described in the thesis [8] .

Objects retrieval with efficient boosting

Keywords : scalability, relevance feedback, local descriptors, Adaboost, multi-probe locality sensitive hashing, feature selection.

Participants : Saloua Ouertani-Litayem, Alexis Joly, Nozha Boujemaa.

Most recent and effective recognition techniques are based on high-dimensional and sparse representations induced by the large number of local visual features. Classifiers learned on such representations are usually applied to test images one by one and the complexity in a retrieval context is intrinsically linear in dataset size. Hence, we explored an efficient boosting strategy in order to reduce the retrieval complexity when using feature rich representations of images. For learning step we used AdaBoost algorithm with a weak learner based on distances between training local features. Instead of predicting the scores of the images one by one, we performed T range queries in the dataset $ \upper_omega$ according to the T weak classifiers parameters Im4 ${(\#119855 _t,\#952 _t)}$ . Therefore we used the a posteriori multi-probe locality sensitive hashing similarity search structure [44] . Each range query returns a set of features Rt such as:

Im5 ${R_t=range_\#937 {(\#119855 _t,\#952 _t)}=\mfenced o={ c=} \#119855 \#8712 V_\#937 {\#8741 d}{(\#119855 ,\#119855 _t)}\lt \#952 _t}$

Experiments on Caltech 256 dataset show that the technique is about 250 times faster than the naive exhaustive method with surprisingly better performances (see Table 2 ).

Table 2. Mean Average Precision and Prediction time for the 10 studied classes of Caltech256
  Exhaustive [46]   Approximate
  Classes  Time(sec)  M.A.P  Time(sec)  M.A.P
  airplanes  9008.92  0.2037  35.43  0.3881
  american-flag  8935.19  0.2922  35.72  0.3903
  chess-board  9537.36  0.7156  33.21  0.7446
  golf-ball  8908.83  0.1156  39.31  0.2361
  mars  9017.57  0.1603  31.2  0.0909
  motorbikes  9001.33  0.2863  34.43  0.4516
  sunflower  8942.48  0.5797  32.63  0.6214
  swiss-army-knife  1604.16  0.0201  31.52  0.1196
  tennis-racket  8923.87  0.2266  33.08  0.2715
  tower-pisa  8911.52  0.2683  40.01  0.5512
  Means  8279.123  0.2868  34.65  0.3865

We also applied the proposed method to a real time relevance feedback mechanism based on freely selected image regions. Experiments show that the active learning provides significant effectiveness improvements (see figure 14 and [22] , [30] for more details).

Figure 14. The Mean Average Precision for the probabilistic prediction method vs the one in the case of exhaustive prediction.


Logo Inria