Team Imedia

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: New Results

Interactive retrieval and navigation

Relevance feedback on local image features, with application to the identification of plant species

Keywords : relevance feedback, local features, feature selection, active learning.

Participants : Wajih Ouertani, Nozha Boujemaa, Michel Crucianu.

The characterization, evaluation and use of plant biodiversity is based on the precise and efficient identification of its components and especially of the species. The identification keys issued from systematic botany mainly rely on characteristics that are ineffective in many real-world situations. The development of the inventory of species, of community ecology and of the monitoring of self-propagating plants is limited because it requires an active and continuing involvement of the very few highly specialized botanists. The collaboration between the UMR AMAP and the IMEDIA team aims to address this challenge by exploiting image analysis and recognition in a generic interactive species identification system. Since the identification process should be interactive, we decided to further explore relevance feedback on sets of local image features that describe regions of interest of an image. In the case under focus here, such regions would correspond to plant organs whose attributes are potentially relevant for identification. It should be noted that the problem we address is very difficult, since there is significant variability in pose and the relevant plant organs often correspond to sets of patches that are scattered in a region of interest.

During the first year of Wajih Ouertani's PhD, we have studied and tested state of the art kernels for matching sets of vectors, in order to extend SVM-based relevance feedback to the use of local features. The first experiments concerned the Pyramid Match Kernel (PMK) [43] , for which interesting results are reported in the literature on object class recognition. PMK is based on a hierarchical uniform quantization of the feature space and represents a set of local features as a multi-resolution histogram. The kernel is obtained as a modified histogram intersection, with level-specific weight. Our experiments on Graz-02 dataset with several descriptors including SIFT show that there are two significant problems with this approach. The first is related to PMK and more specifically to the construction of the hierarchy. Consider a quantization level that is too coarse and unable to provide enough discrimination to separate local features that have low similarity. When getting from this level to the next, more refined level, each quantization interval is divided by two in all dimensions. If the dimension of the description space is high, the resulting quantization intervals are too small and local features that should be considered similar actually fall in different intervals. To address this issue, we investigated the random histogram features set representation [39] , associated to a linear kernel, and this representation appears to be more able to avoid this kind of problem.

The second problem is not specific to the use of PMK but rather to the fact that all the local features selected in a region of interest are used as a single (positive or negative) example. Set kernels tend to “bind” the local features in a set together, so it becomes harder to ignore part of them that are actually irrelevant (e.g. they come from the background) or to be robust to strong occlusions. We explore solutions based on replacing a large set of local features by several localized subsets of features.

Another issue is object/noise separation: since the relevant plant organs often correspond to sets of patches that are scattered in a region of interest, many of the local features falling in the region selected by the user actually belong to the background or have in their description a strong influence from the background. It is then necessary to find appropriate feature selection solutions in order to reduce the level of such noise.

As part of our work, several programs and software modules were developed to handle, integrate and evaluate this type of feedback. In order to evaluate the performance in the target application, a botanical database with a local ground truth (region-based annotations) was prepared by AMAP using annotation software developed by IMEDIA.

Multi-criteria classification for Plant Identification

Keywords : computational botany, multi-criteria classification.

Participants : Hervé Goëau, Donald Geman, Nozha Boujemaa.

In the frame of the Pl@ntNet project, we begun to work on classification methods helping botanists to identify plant species. One field of investigation which has recently started concerns the “multi biological criterias” classification. Indeed, botanists are used to observe and analyse specimens according various visual aspects, various “characteristics” or “biological criterias”, in order to identify the biological taxonomy of one plant and to discern plants between them.

Figure 10. Mutli-criteria description of a plant from the specie (family - genus - specie) “Ebenaceae Diospyros Elliotii” (flora of Mali - Ph. Birnbaum). Each picture focus on one or several botanical characteristics as “leaf”, “bark”, “petiole”, “limbe apex”, .... These multiple views represent one single sample of the specie, and other samples may describe other botanical characteristics not showed with this current sample.

Figure 10 shows an example of this biological description on one specimen. This sample, one “Ebenaceae Diospyros Elliotii” plant in its natural environment, is represented by several pictures where each picture is annotated by a set of labels, i.e. some usual botanical characteristics as “bark”, “flowers”, “inflorescence”, “limb marge”, “leaf”, “petiole”, ...). These annotations in this botanical context lead us to an original image classification problem where each individual sample in the training data is represented by several multi-labeled pictures. Moreover, each class (i.e. each specie) is represented by several specimens which are not necessary covering the same botanical characteristics. Furthermore, this is a challenging classification problem because even one flora of a limited geographical area can contain several hundred species.

First investigations are centered on an hierarchical classification model, an extension of a previous work on information fusion [42] . This classification method combines the visual signatures of the partial and complementary views of the known species and the botanical expert annotations. Our future challenge is to take into account the botanical expertise knowledge of a user in an interactive approach in order to improve the classification performances.

Logo retrieval with a contrario visual query expansion

Keywords : visual query expansion, logo retrieval, a contrario, geometric consistency, SIFT, LSH.

Participants : Alexis Joly, Olivier Buisson [ INA ] .

In the scope of a use case of VITALAS European project, we did work on a new content-based retrieval framework applied to logo retrieval in large natural image collections. The first contribution is a new challenging dataset, called BelgaLogos , which was created in collaboration with professionals of BELGA press agency, in order to evaluate logo retrieval technologies in real-world scenarios. The dataset as well as baseline results have been made available to the community on a dedicated web page and exchanges with other partners did start on the topic.

Figure 11. Some images of BelgaLogos dataset

The second and main contribution is a new visual query expansion method using an a contrario thresholding strategy in order to improve the accuracy of expanded query images. Whereas previous methods based on the same paradigm used a purely hand tuned fixed threshold, we provide a fully adaptive method enhancing both genericity and effectiveness. This new technique has been evaluated on both OxfordBuilding dataset and our new BelgaLogos dataset. Results did show that the proposed technique outperforms both the baseline method and previous state-of-the-art visual query expansion method. Mean Average Precision results on BelgaLogos dataset are provided in Table 1 . More details can be found in [21] .

Table 1. BelgaLogos results
Baseline Qexp acontrario    
   Logo nameQset1   Qset2   Qset1   Qset2   
   Adidas7.8   0.7   13.3   0.7   
   Adidas-text5.6   1.1   7.8   1.1   
   Base14.4   38.9   21.5   58.2   
   Bouygues18.2   11.3   18.6   15.3   
   Citroën6.1   4.5   38.4   4.5   
   Citroën-text5.3   0.1   18.8   0.1   
   CocaCola23.0   0.1   48.6   0.1   
   Cofidis26.0   55.2   26.6   65.3   
   Dexia16.6   29.3   24.0   51.3   
   Ecusson1.1   0.1   5.9   0.1   
   Eleclerc78.1   74.1   80.6   80.1   
   Ferrari24.7   7.5   41.4   17.5   
   Gucci50.0   0.0   50.0   0.0   
   Kia32.8   61.3   67.5   75.6   
   Mercedes9.7   18.5   15.0   19.2   
   Nike1.4   1.2   3.5   2.6   
   Peugeot20.0   20.7   20.2   23.2   
   US President64.3   60.3   96.6   100.0   
   Puma8.6   2.2   20.0   2.2   
   Puma-text51.6   0.7   56.6   0.7   
   Quick24.4   39.0   41.4   56.6   
   Roche50.0   0.2   50.0   0.2   
   SNCF33.3   27.9   35.4   33.7   
   StellaArtois32.7   31.8   39.3   43.4   
   TNT22.5   2.5   33.54   4.4   
   VRT11.1   5.8   12.53   11.2   
   All20.8   19.0   34.11   25.7   

Video database navigation

Keywords : navigation, video, key frame, clustering.

Participants : Raffi Enficiaud, Alexis Joly, Olivier Buisson [ INA ] .

In the scope of the VITALAS project, we developed a graphical interface in order to exploit the temporal relationships of images within videos. The tests were conducted on a database of 10 hours of news videos (approximately 75000 images). The interface combines the classical similarity search of Maestro with the temporal information available from news events. Based on this information we allow a user to more efficiently navigate through a large collection of audio-visual data. Indeed, the navigation allows to combine the images similarity search with their temporal relationship by proposing two views. The first one shows an unordered similarity search and, for each image, the videos and their time stamp within each video. The events that are closer to the beginning of the videos are, for instance, more related to the hot news. The second view shows the main topics and the key frames of the video associated to the selected images, along with their temporal occurrence. The main topics are drawn from a clustering on the whole database. The clusters with a large number of elements are considered as structuring each videos (reports, interviews, jingle...), while clusters of smaller size are considered as providing information on t he topics covered by the videos. In the example shown in figure 12 , we used maps as entry points on the semantic content of news report, which are in this case the events related to identifiable and geographically located parts of the world. This first view on the left reveals that the first map is used in three different videos in the 10 hours, and might be a location covered by a series of reports. By selecting an image, a time-line view ( on the right) is shown. This view stresses the contents co-occurring with the map within the same time period . It may then provide information on events, people, polls or popular opinions that, in some extent, are related to this geographical event and might be hard to infer with the visual similarity only.

We demonstrated the functionalities of this interface during the VITALAS annual review.

Figure 12. Screenshots of the video navigation tool. Left: panel showing a classical similarity search on a news map, along with the videos in which they appear. Right: time-line view of two videos, each of them containing several maps. The upper side contains the key images of the database, arranged in a temporal manner. The small images show the key-frames of the video. The lower side shows the temporal occurrences of these images.

Scene Pathfinder: Unsupervised Clustering Techniques for Movie Scenes Extraction

Keywords : video segmentation, scene detection, shots clustering.

Participants : Mehdi Ellouze, Nozha Boujemaa, Adel Alimi [ ENIS, Tunisia ] .

The need for watching movies is in perpetual increase due to the widespread of the internet and the increasing popularity of the video on demand service. The important mass of movies stored in the Internet or in VOD servers need to be structured to accelerate the browsing operation. We propose in [10] a new system called "The Scene Pathfinder" that aims at segmenting the movies into scenes to give users the opportunity to have a non-sequential access and to watch particular scenes of the movie. This helps them to judge quickly the movie and decide if they have to buy or to download it and avoiding waste of time and money. The proposed approach is multimodal (see also [48] , [47] , [38] ). We use both of visual and auditory information to accomplish the segmentation. We base on the assumption that every movie scene is either action or non-action scene. Non-action scenes are generally characterized by static backgrounds and occur in the same place. For this reason, we base on the content information and on the Kohonen map to extract these kinds of scenes (shots agglomerations). Action scenes are characterized by high tempo and motion. For this reason, we base on tempo features and on the Fuzzy CMeans to classify shots and to localize the action zones. The two processes are complementary. Indeed, the over segmentation that may occur in the extraction of action scenes by basing on the content information is repaired by the Fuzzy clustering. Our system has been tested on a varied database and obtained results show the merit of our approach (compared to [38] ) and that our assumptions are well-founded.

Figure 13. The Scene Pathfinder, the general framework.

In figure 13 , we present our framework. We divide the scenes of the movies into two important classes: action scenes and non-action scenes. To detect non-action scenes (dialog, monolog, landscape, romance...) we use the content information and the Kohonen map to discover the agglomerations of shots (scenes) having common backgrounds and objects. In the other hand, we use audio-visual tempo features and the Fuzzy CMeans classifier to delimit the core of action scenes (fight, car chase, war, gun fire...) to remedy the over segmentation that may occur in action scenes.