Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Advanced Algorithms of Data Analysis, Description and Indexing

Advanced Image Description Techniques

Morphological processing of images

Participant : Sébastien Lefèvre.

Despite its widespread use in image processing, mathematical morphology has mainly been ignored in the context of multimedia analysis. However, this toolbox makes possible the design of scale-spaces and subsequent global or local image descriptors, which may offer several advantages over the state-of-the-art: efficient algorithms to produce local descriptors with a limited computation time, strategies to ensure compactness of the image description, preservation of edge information through the multiresolution scheme, and finally intrinsic ability to offer invariance to several image transforms. So we have started to study the potential interest of mathematical morphology for image description.

Image Joint Description and Compression

Participants : Ewa Kijak, Joaquin Zepeda.

This is a joint work with the Temics project-team (C. Guillemot).

The objective of the study initiated in 2007, in collaboration with Christine Guillemot from Temics is to design scalable signal representation and approximation methods amenable to both compression (that is with sparseness properties) and description.

In this work, we investigate sparse representations methods for local image description. These methods provide several advantages. First, unlike critically-sampled transforms as separable wavelet transforms, curvelets and bandlets that have been considered for compression and de-noising applications, they allow the extraction of low level signal features (points, edges, ridges, blobs) or of local descriptors. Then, sparse representations allow the use of inverted files, which provide a solution to the indexation of high dimensional data by taking advantage of sparse vectors properties. Indeed, document similarity calculations are thus carried out efficiently using the scalar products between these sparse vectors. In this context, two approaches have been considered: descriptors sparse representation and signal sparse description.

Concerning description sparse representation, we proposed an approach that applies a pursuit-based sparse decomposition to each SIFT descriptor of an image to obtain a sparse vector [28] . The descriptors sparsity still enables the use of inverted file type indices. The aim is to tackle the problem of SIFT descriptor high dimensionality, while retaining the local property of the input descriptors. We compare our approach in the context of local querying to the Video Google one, where multiple input SIFT descriptors are aggregated into a single sparse descriptor, resulting in the loss of description locality.

Concerning signal sparse description, our aim is to adapt existing work in sparse decomposition (designed with compression and prediction in mind) to the construction of image descriptors displaying covariance to the set of admissible transformations. In this context we have identified three problems to address: 1) dictionary design, 2) atom selection method and 3) descriptor comparison method.

Regarding descriptor comparison method, we introduced a new method to search for approximate nearest neighbors (ANN) under the normalized inner product distance, using sparse image representations. The approach relies on the construction of new sparse image vectors designed to approximate the normalized inner product between underlying signal vectors. The resulting ANN search algorithm shows significant improvement compared to querying with the original sparse query vectors, approach considered in the literature for content-based image search.

Regarding dictionary design, we study a method to construct dictonaries in a way that a different dictionary is used at each iteration of the decomposition, and that these iteration tuned dictionaries satisfy some desirable properties. This method and associated algorithm gave promising results and should be validated next year for compression and indexing purposes.

NLP techniques for Image Description

Participants : Vincent Claveau, Patrick Gros, Pierre Tirilly.

Natural Language Processing (NLP) and text retrieval techniques can help to describe and retrieve images at two stages:

We worked in 2009 on each of these two stages.

First, we continued the work about the use of weighting schemes  [64] , [45] and Minkowski distances for visual word-based image retrieval  [68] that we initiated in 2008. We showed that the optimal parameters (weighting scheme and distance) of visual word-based systems strongly depend on the kind of query considered. It questions some common habits in visual word-based retrieval that consider tf.idf as the state-of-the-art weighting scheme and L1 as the state-of-the-art distance. It also shows that using only one dataset to evaluate visual word-based systems, as it is the case most of the time, should be avoided because the query's properties can sharply change from one dataset to another, making the results unstable. At last, this work highlightens a correlation between the use of certain weighting schemes and the effect of the Minkoswki distance parameter. This result might be particulary interesting in the case of fractional distances  [44] , which generally offer the best performance but do not respect the triangle inequality: using L1 distance with adapted weights could overcome the triangle inequality limitation and offer the same performance as a fractional distance at the same time.

Then, we worked on high-level image description, using NLP techniques to extract textual image descriptors from the text accompagning images in a large parallel text-image corpus of news articles. We proposed to annotate images containing logos with named entities from suited categories (names of brands, products, companies, organizations, events and artistic groups). We first developed a fast logo detector relying on the visual word scheme. This detector can achieve 95% detection precision with a 60% recall, or 80% precision with a 80% recall. We used this detector to select images containing one or more logos, then we annotated each of these images with the most frequent suited named entity that appears in the article coming with the image, following a method similar to the one we already used to annotate face images  [69] . This annotation method achieves an acceptable annotation accuracy (between 40 and 70 %) on our news corpus. We then explored more complex entity selection criterias, using the document frequency or annotation frequency of named entities. We showed that including annotation frequency to entity scores can provide slightly better results than pure frequency, for logo annotation as well as face annotation. This result means that, on news corpora, images tend to contain similar general information rather than varied precise information.

Advanced Data Analysis Techniques

Intensive Use of Factorial Analysis for Text and Textual Streams Mining

Participant : Annie Morin.

Textual data can be easily transformed in frequency tables and any method working on contingency tables can be used to process them. Besides, with the important amount of available textual data, we need to find convenient ways to process the data and to get invaluable information. It appears that the use of factorial correspondence analysis allows us to get most of the information included in the data. But even after the data processing, we still have a big amount of material and we need visualization tools to display it. We study the relevance of different indicators used to cluster the words on one side and the documents on the other side and we are concerned by the visualization of the outputs of factorial analysis: we need to help the user to go through the huge amount of information we get and to select the most relevant points. Most of the time, we do not pre-process the texts: that means that there is no lemmatization. We also start exploring temporal changes in textual data and the first experiments have been done on newspaper corpus from 1987 to 2003. For the moment, we mainly focus on the visualization of results.

Intensive Use of SVM for Text Mining and Image Mining

Participants : Nguyen Khang Pham, François Poulet.

Support Vector Machines (SVM) and kernel methods are known to provide accurate models but the learning task usually needs a quadratic program, so this task for very large datasets requires a large memory capacity and a long time. We have developed new algorithms: a boosting of least squares SVM to classify very large datasets on standard personal computers and incremental and parallel SVMs. The incremental part of the algorithm avoids us to load the whole dataset in main memory; we only need to have a small part of the dataset in main memory to build a part of the data model. Then we put together the partial models to get the full one with the same accuracy as usual algorithm; it solves the memory capacity problem of SVM algorithms.

To solve the computational time problem we have distributed the computation of the data blocks on different computers by the way of parallel and distributed algorithms. The first versions of the algorithms were based on a CPU distributed software program, then we have used GP-GPU (General Purpose GPU) versions to significantly improve the algorithm speed [11] . The GPU version of the algorithm is 130 time faster than the CPU one. The time needed for usual SVM algorithms like libSVM, SVMPerf or CB-SVM is divided by at least 2500 with one GPU or 5000 with two GPU cards.

We have extended the least squares SVM algorithm (LS-SVM). The first step was to adapt the algorithm to deal with datasets having a very large number of dimensions (like in text or image mining). Then we have applied boosting to LS-SVM for mining huge datasets having simultaneously a very large number of datapoints and dimensions on standard computers. The performance of the new algorithm has been evaluated on large datasets from Machine Learning repository like Reuters-21578 or Forest Cover Type and image datasets. The accuracy is increased in almost all datasets compared to LibSVM.

We are currently studying the same kind of principles (incremental and parallel) with other classification algorithms in order to deal with very large image datasets.

Intensive Use of Data Analysis Methods for Image Mining

Participants : Patrick Gros, Annie Morin, Nguyen Khang Pham, François Poulet.

This work is done with Institut francophone pour l'informatique, Hanoï, Vietnam.

To analyze and retrieve information in image databases, we use the same method as in textual data analysis. This work is part of the Ph.D. thesis of Pham Nguyen Khang. That means that in order to apply Correspondence Analysis (CA) to images, we must define "words" in images. This is usually achieved by 2 stages: (1) vector quantizing automatically extracted local image descriptors (i.e. , SIFT) and (2) applying a clustering algorithm (i.e. , k-mean clustering) on the set of descriptors to form "visual words". Once the visual words are defined, we then construct a contingency table by crossing visual words and images.

We began experiments by applying CA on a small database (961 images). First, a vocabulary of 1000 visual words was computed from 60 first images using SIFT descriptors (code of D. Lowe) and a k-mean algorithm. CA was then applied on a contingency table of 961 x 1000. We kept only 30 first axes and used those axes for computing image similarity (Euclidean distance was used). Surprisingly, we found some groups of images which belong to the same categories (toys, houses, Eiffel tower...).

Motivated by this promising result we continued our approach on the "caltech4" database (4090 images of 5 categories: faces, airplane, motorbikes, cars (rear) and backgrounds) [67] . About 3000 descriptors sampled from all of descriptors (a third for every category) were clustered to form 2224 visual words. We explored this database on 2 tasks: image categorization and image retrieval. For the first task, we applied CA on the contingency table representing the database and kept only 7 axes (for comparison to PLSA trained with 7 hidden topics). A k-mean algorithm was then invoked to form clusters (categories). The result showed that CA performed slightly better than PLSA.

For image retrieval task, we compared our approach to PLSA and TF*IDF using L1 distance, L2 distance and cosine similarity. In the case of PLSA and CA, the retrieval was performed very fast because the problem dimension was reduced from 2224 to 7. For PLSA and CA, cosine similarity gave better result than L1 and L2 distance. The performances of CA and PLSA were equivalent and much better than that of TF*IDF.

We have also proposed a method for scaling up the problem using inverted files based on image representation quality on axes. Every inverted file was associated with the well represented images (on the axis to which the file belong). Given an image query, the search began by choosing the appropriate inverted files and intersecting those files. The similarity computation is done only on a subset of images resulting from the intersection of inverted files (about 1/5 to 1/8 of entire database). The performance was degraded only about 0.3% with respect to the exhaustive method.

Besides, in order to reduce the learning time and to deal with large databases, we have developed an incremental version of CA algorithm which splits data in blocks and processes block after block [34] . The parallelization of this algorithm on Graphic Process Unit (GPU) showed that the GPU version performs 20 to 30 times faster the CPU one [26] . The retrieval's accuracy is also improved when combining contextual information into our index structure. For that, we have integrated the Contextual Dissimilarity Measure  [56] into our retrieval platform using inverted files. This integration was explored in two directions: offline (correction terms are computed before the retrieval task) and online (correction terms are computed only on images in the candidate list). Tests realized on the Nister dataset  [60] have shown a significant improvement of the accuracy. Our algorithm have been assessed for large scale datasets: we have merged the Nister dataset with 1 million other images. With our method, only 0.06% of the dataset was explored (in 1/8 second) and that 0.06% contains 86.7% of relevant images. We also have investigated [30] a new approach for unsupervised classification with random oblique decision trees for very high dimensional data. Nguyen Khang Pham has defended his Ph.D. thesis [10] in November 2009.

Advanced Indexing Algorithms

Indexing for Very Large High Dimensional Spaces

Participant : Laurent Amsaleg.

This topic is done in close cooperation with Hervé Jégou from Lear project-team and is a joint work with researchers from Reykjavík University.

It is well known that high-dimensional nearest-neighbor retrieval is very expensive. Dramatic performance gains are obtained using approximate search schemes, such as the popular Locality-Sensitive Hashing (LSH). Several extensions have been proposed to address the limitations of this algorithm, in particular, by choosing more appropriate hash functions to better partition the vector space. All the proposed extensions, however, rely on a structured quantizer for hashing, poorly fitting real data sets, limiting its performance in practice.

We have studied the two families of quantization schemes for hashing when used in high-dimensional settings. We conclude that using a k-means unstructured quantizer for hashing significantly improves the accuracy of LSH, as it closely fits the data in the feature space. We have also designed two variants of the k-means approach offering different trade-offs in terms of memory usage, efficiency and accuracy.

The joint work with researchers from Reykjavík University has been partially focused on ways to very efficiently compute local image descriptors as video analysis using local descriptors demands high throughput of the local descriptor creation process. The most practical method to achieve the high throughput is to use the GPUs delivered with most recent computers. We have been working on adapting the computation of the Eff2 descriptors, a variant of SIFT, to the GPU. We have compared our GPU-Eff2 descriptors to SiftGPU, another GPU-based variant of SIFT, and showed that while both GPU-based variants yield similar results, the GPU-Eff2 descriptors require less than half of the processing time required by SiftGPU. Furthermore, our analysis shows that the SiftGPU descriptors also use the CPU, making them less scalable than GPU-Eff2.

Challenging the Security of CBIR Systems

Participants : Laurent Amsaleg, Ewa Kijak.

This work is done in collaboration with Teddy Furon from the Temics project-team.

Content-Based Retrieval (CBR) is the generic term describing the problem of searching for digital material in large multimedia databases. CBR systems are of great diversity: they deal with a plurality of media (text, still images, music, videos) and offer a wide range of querying modes, from simple query-by-example schemes to more complex search processes involving user feedback aiming at best bridging the semantic gap. In a way, CBR systems promote the cultural and historical value of multimedia contents. They make the multimedia databases very useful, their contents reusable, spurring the enrichment of the artistic and cultural patrimony. CBR has proved to be a marvellous technology, recognizing content even when deeply distorted. Overall, CBR systems have so far been used in very cooperative and “friendly” settings where it benefits content providers business, while increasing users digital experience enjoyment.

However, we recently witness another use of this technology. CBR is used to filter multimedia contents in order to protect the creation of the few from the piracy of the many. CBR techniques are used to “clean the Internet”, stopping the upload of copyrighted material on User Generated Contents sharing platforms such as YouTube, or forbidding downloads from P2P networks. Overall, filtering is an application of CBR techniques that is quite different from its primary goal: the environment is now hostile in the sense that filtering restricts users freedom, controlling and/or forbidding distribution of content.

While the cryptographic community has been investigating the security of systems for years, almost no work address this issue in the computer-vision community. We therefore started to consider how security issues can impact the typical components building a complete content-based information retrieval system (CBIRS). We are exploring three avenues of research that are (i) a threat analysis where we study the crucial elements to consider for assessing the security of CBIRS; (ii) attacking the core technologies of state-of-the-art CBIRS to discover an initial set of potential security flaws; (iii) trying to attack various specific techniques, at the description level and at the database level to show that challenging the security of CBIRS is feasible in practice.

Large scale evaluation of global GIST descriptors

Participant : Laurent Amsaleg.

Our work on this topic is done in close collaboration with Matthijs Douze, Hervé Jégou, Harsmirat Sandhawalia and Cordelia Schmid and the Lear project-team.

To make an image index at the web's scale, each server has to handle 10 to 100 million images. At this scale, it is no longer possible to use local descriptors: the memory usage of the descriptors becomes prohibitive. More importantly, the amount of memory scanned to do a single search increases, slowing down the search below the acceptable for an interactive search. Therefore, we have investigated global descriptors.

We evaluated the search accuracy and complexity of the global GIST descriptor for two applications: same location/object recognition and copy detection. We identified the cases in which a global description can reasonably be used. The comparison is performed against a state-of-the-art bag-of-features representation.

We proposed an indexing strategy for global descriptors that optimizes the trade-off between memory usage and precision. Our scheme provides a reasonable accuracy in some widespread application cases together with very high efficiency: In our experiments, querying an image database of 110 million images takes 0.18 second per image on a single machine.

The system is intended as a rough pre-filter. Images that are preselected (the short-list) can then be reprocessed by a more costly algorithm to produce more accurate results. We evaluated the method on standard datasets used to test image matching methods. The method and the experiments have been published in [20] .

Describing Sequences for Audio/Video Retrieval

Participants : Laurent Amsaleg, Romain Tavenard.

Our work on this topic is done in close collaboration with Guillaume Gravier from the Metiss project-team.

Today, we can quite well exploit rather large databases of still images and we know how to efficiently query them by content. The next step asks to turn our focus on more complex documents, typically video and audio. There are today several description techniques for audio and video but only very few techniques efficiently perform query-by-content on video or audio databases at large scale. Being able to use such techniques is particularly crucial for professional multimedia archivists.

The state of the art makes such searches possible, but only at a very small scale, i.e., on a very small amount of data. Today, no search technique is efficient enough to allow any practical usage of real-scale audio or video archives. In addition, simply extending existing multidimensional indexing techniques is not possible since they were designed for description schemes in which the concept of sequence is lacking.

We have started investigating this issue in 2007. Overall, deciding whether two sequences of descriptors are similar requires to clarify what elements should be compared, and how the comparison should be enforced. We have tried two very different approaches where elements to compare were either the descriptors themselves, or a new feature based on the whole sequence of descriptors. Directly comparing sequences of descriptors is done using the traditional Dynamic Time Warping approach. It is in fact an a posteriori alignment of the sequences to compare. Here, the similarity of sequences is directly related to the similarity of the descriptions. One of the key points here is that computing optimal alignment is costly in terms of computing time, which is why we investigated ways to approximate the alignment using few computations.

These initial results suggest to push forward the investigations. We will look on ways to insert these techniques into large-scale indexing schemes.

We also compared sequence models, where each sequence is modeled using a Support Vector Machine approach used in regression (and not in classification, as usually done). Each model is somehow a translation of the temporal behavior of its corresponding sequence. Overall, we have shown that relying on models (instead of relying on descriptors) provides a better robustness to severe modifications of sequences, like temporal distortions for example. These results were obtained using a sequence collection made of real audio data broadcast on radio. We first tried to use cross-similarity estimation based metrics to compare models as direct comparison between models is impossible. Another way we investigated was to build sequence features based on the representability of a sequence with respect to a set of reference models. Such a feature space could then be indexed by any classical indexing techniques.

Browsing Personal Image Collections

Participant : Laurent Amsaleg.

Our work on this topic is done in close collaboration with Kári Harðarson from the University of Reykjavík.

The Database Lab at Reykjavík University is currently writing a photo browser. One of the main ideas being tested is to use the location of thumbnails on screen to indicate properties of the underlying photos. Users can select which properties of photographs map to which attribute of the thumbnail, it's location on screen, it's brightness and size etc.

One of the uses for this feature is to let the computer group together pictures on screen that may have the same people on them. The user can see at a glance whether a cluster of photos contains the same person or whether he needs to drag some photos to a different cluster. If he does, the browser notifies the face recognition module that the photos did not portray the same person so that it can learn from it.

This mechanism is general and could be used to classify any property the images may have but using it to represent the results of facial recognition seems promising.

During the preceeding years, we have been focusing on the underlying database and on the on-screen presentation. We did not had, however, a module that would recognize faces and return the amount of likeness between faces as a numeric value that could be displayed.

We used the financial support of the Eff2 Associate Team to send to Reykjavík University an intern for developping such a tool and integrate the resulting software modules in the current browser prototype. The intern had to find and evaluate an already available face recognition library and then find a way to package it in such a way that our browser could use it to decide where to place thumbnails on the screen.

He finished the project successfully and handed in a module before he left Iceland. Although the browser has not been connected to the library yet, we have seen a demonstration where the library is informed of pictures to scan and returns information about the location of faces in the pictures and their respective likenesses. When fall semester ends, the browser work will continue and the library will be connected to the browser sometime in the spring.

In addition, it is worth noting that another French intern did some work for that image browser, payed by Reykjavík University, however. Both works make a consistent story: in addition to face recognition, the second project was to design a "slideshow" presentation module which takes a set of photos and analyzes them to look for similar photos. Photos that have something in common are then displayed in sequence during the presentation. If many photos have similar subjects, they are reduced in size and shown side by side during the show. Even though the pattern recognition is primitive, it nevertheless serves to make the slide presentations more pleasing to the eye and it looks as if someone planned it carefully.


Logo Inria