Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Image Retrieval for Large Databases

Our work on image description does not aim at finding new general descriptors. The IMEDIA and LEAR teams are very active in this field, and we use their results. The originality of our work comes from the size of the database we want to handle. In large databases, most images will be compressed. Is it possible to describe an image without decompressing it? Without sticking too tightly to the JPEG'2000 format, we try to find new description schemes based on wavelet decomposition of images. This is our first direction of research.

A second direction concerns the combination of descriptors: when documents are described by many descriptors, how a query should be processed in order to provide an answer as fast as possible? To answer this question, we study the information that each descriptor can provide about the other ones. The aim is to determine the order in which the descriptors should be considered by using data mining techniques applied to visual descriptors.

The third direction tackles the problem of indexing and retrieving large collections of descriptors. In the local description scheme, 1 million of images can give raise to 600 millions of descriptors, and retrieving any information among such an amount of data requires really fast access techniques, whatever the aim of this access.

A fourth direction is due to our collaboration with the roboticians of the LAGADIC team. They work on visual servoing and using a database is a good way to improve the applicability of their techniques to large displacements. Our description technique appears to be particularly well suited to such an application where a matching between images is required, and not only a global link of similarity between images.

Image Description, Compression and Watermarking

Keywords : Image Indexing, Image Description, Image Compression.

Participants : Patrick Gros, François Tonnin.

This is a joint work with the TEMICS team (C. Guillemot).

During the last two decades, image representations obtained with various transforms, e.g. , Laplacian pyramid, separable wavelet transforms, curvelets and bandlets have been considered for compression and de-noising applications. Yet, these critically-sampled transforms do not allow the extraction of low level signal features (points, edges, risdges, blobs) or of local descriptors. Many visual tasks such as segmentation, motion detection, object tracking and recognition, content-based image retrieval, require prior extraction of these low level features. The Gaussian scale space is almost the unique image representation used for this detection problem. Management of large databases are therefore uneasy, as the extraction of features requires first to de-compress the whole database and then convert the images in the Gaussian scale space. It is thus desirable to find representations suitable for both problems: compression and signal feature extraction. However, their design criteria are somewhat antagonist. Feature extraction requires the image representation to be covariant under a set of admissible transformations, which ideally is the set of perspective transformations. Reducing this set of transformations to the group of isometries, and adding the constraint of causality, the image representation is uniquely characterized by the Gaussian scale space. In a compression perspective, one searches to reconstruct the image from a minimal amount of information, provided by quantized transform coefficients. Thus, the image representation should be sparse, critically-sampled (or minimally redundant), and transform coefficient should be as independent as possible. However, critically-sampled representations suffer from shift-variance, thus are not adapted for feature extraction.

This year, the collaboration between TEMICS (Christine Guillemot) and TexMex (Patrick Gros) through the thesis of François Tonnin came to an end with the defense of the thesis in June. This work has led to the design of a feature point extractor and of a local descriptor in signal representations given by the over-sampled steerable transforms. Although the steerable transforms due to their properties of covariance under translations and rotations, and due to their angular selectivity, provide signal representations well-suited to feature point and descriptor extraction, the opposite constraints of image description and compression were not fully solved.

The final problem is rarely addressed in the literature and consists in the proper quantization of transformed coefficients for both good image reconstruction and preservation of description quality. As the transform is redundant, one image has many possible representations. We first used POCS (Projection onto Convex Sets) to find a sparser representation and we adapted the classical technique in order to preserve the content of the neighborhoods of extracted points. Then, we designed a compression scheme allowing the reconstruction of steerable coefficients from the information required by description, which is reduced to an energy and an orientation coefficient for each spatial point. The final step is the quantization of these coefficients. This compression scheme allows us to detect illegal copies in image bases compressed at one bit per pixel.

Videos appear to be the next challenge. On the one hand, HDTV represents sets of images even bigger than what is present in most still image collections. On the other hand, some functionalities are required by the professionals of the domain in order to develop their services: scalable coding to allow an easy distribution of the content on many platforms, copy detection as a complementary tool to DRM. These aspects form the core of the ICOS-HD project that will begin in 2007.

Scalability of Local Image Descriptors

Keywords : Local image descriptors, scalability, PvS-framework, high-dimensional indexing, median rank aggregation.

Participants : Laurent Amsaleg, Hervé Jegou.

This is a joint work with researchers from Reykjavík University. This work is done in the context of the INRIA Associate teams program. This program links two research teams (one INRIA, one foreign) willing to cross-leverage their respective excellence and their complementarity. Björn Þór-Jónsson (Associate Professor) leads the team of researchers involved in Iceland.

With the proliferation of digital media and online access, multimedia retrieval and image retrieval, in particular, is growing in importance. The computer vision community has recently started a trend towards advanced image description schemes using local descriptors (e.g. , SIFT). The applications of local descriptor schemes include face recognition, shape recognition and image copyright protection. With these schemes, each image yields many descriptors (several hundreds for high-quality images), where each descriptor describes a small ``local'' area of the image. Two images are typically considered similar when many of their descriptors are found to be similar.

All of these approaches, however, have only been studied and compared at a small scale. Typically, less than 10,000 images are used, which makes it hard to predict how they will perform with collections of hundreds of thousands of images or more.

There are three primary issues associated with scalability of image descriptors.

That work addresses all three issues.

In [84] , we have demonstrated that our PvS-framework achieves efficient query processing for large collections of local descriptors. We therefore decided to compare three major local descriptor schemes (SIFT, PCA-SIFT and RDTQ) to study their recognition power at large scale. This comparison included a fourth scheme that we designed, and called Eff2 . Using a collection of almost thirty thousand images, we showed that our new descriptor scheme gives the best results in almost all cases. We then gave two stop rules to reduce query processing time and show that in many cases only a few query descriptors must be processed to find matching images. Finally, we test our descriptors on a collection of over three hundred thousand images(We have signed, in December 2005, a formal agreement of cooperation between IRISA, Reykjavík University and Morgunbladid, the main newspaper in Iceland. This agreement defines the terms under which Reykjavík University and the TexMex team can obtain access to the image collection of Morgunbladid while Morgunbladid will have use of PvS software. This collection consists of about 300,000 high-resolution images. The images were delivered to us after being thumbnailed to 512x512 pixels, which is sufficient for performing extensive recognition-based performance measurements. We can keep the images for two years. Then, we have to destroy them. Their descriptions, however, can be kept as long as needed for research and development purposes, since this format does not allow for any presentation or reconstruction of the images. It is extremely difficult to get access to real image collections, and signing this aggreement gave us a real push since we were able to conduct a series of experiments at a scale never reached.), resulting in over 200 million local descriptors, and show that even at such a large scale the results are still of high quality, with no change in query processing time [32] , [31] .

Based on the experience gained with the PvS-framework, we have designed a more sophisticated and general index which is also based on ranking, projections and partitions. This index is called the NV-tree (Nearest Vector tree) and we are in the process of patenting it. With respect to the PvS-framework, the NV-tree yields better performance and space utilization, is better able to capture the real distribution of data by self-tuning the projection and partitioning strategies, copes with on-the-fly updates of the descriptor collections, can be used stand-alone or by aggregating the results from two or more indices, and lends itself effectively to distributed processing to further reduce response times. All in all, the NV-tree yields efficient query processing and good result quality with extremely large descriptor collections.

Describing Sequences for Audio/Video Retrieval

Participant : Laurent Amsaleg.

We can today quite well exploit rather large databases of still images and we know how to efficiently query them by contents. The next step asks to turn our focus on more complex documents, typically video and audio. There are today several description techniques for audio and video but only very few techniques to efficiently perform query-by-content on video or audio databases at large scale. Being able to use such techniques is particularly crucial for professional multimedia archivers.

People working in such organizations typically want to annotate incoming video or audio streams before archiving. Those annotations are then used by any subsequent search since they are at the roots of document matching. It is key to note that document annotation is an entirely manual process and to understand that this process can not scale with the constantly increasing number of streams to annotate. Therefore, one salient application is the automated segmentation of multimedia streams into separate units, then the automatic annotation of each unit, before archiving the documents. It is thus necessary to perform searches in streams to detect for example jingles, trailers, or the periodic broadcast of elements, etc. Those searches are more complex then searching simply for the repetition of identical patterns since it is necessary to find correlations despites distortions, duration variations, super-imposition of noise, text, additional music, inclusion of multiple side-streams, etc.

The state of the art make such searches possible, but only at a very small scale, i.e. , on a very small amount of data. Today, no search technique is efficient enough to allow any practical usage of real-scale audio or video archive. In addition, it has been observed that it is not possible to simply extend existing multidimensional indexing techniques since they were designed for description schemes in which the concept of sequences is lacking.

One of the most prevalent difficulties comes from the temporal aspect of video and/or audio descriptions. Describing video and audio means creating sequences of descriptions in which the notion of order between descriptions is central. That notion of order is ignored by all traditional search techniques that only search for independent elements that are, at most, very loosely coupled.

We therefore try to understand how multidimensional indexing techniques can integrate in their principles the notion of sequences of descriptions. This needs to be done to make possible searches by content in very large archives of video and/or audio documents. We have started to work on this topic. We have implemented few techniques from the state of the art (exhaustive search, dynamic time warping, mixture of Gaussian models and SVM-based modelling) and ran performance evaluations on audio recognition. Using a collection of real audio sample, we checked the ability of each technique to handle recognition despite time shifts, time distortions and some other signal distortions. It turns out that SVM-based models perform quite nicely but are very inefficient in terms of response time. This open room for improvement.

Navigation in Personal Image Collections

Participant : Laurent Amsaleg.

This is a joint work with researchers from Reykjavík University. This work is done in the context of the INRIA Associate teams program. This program links two research teams (one INRIA, one foreign) willing to cross-leverage their respective excellence and their complementarity. Björn Þór-Jónsson (Associate Professor) leads the team of researchers involved in Iceland.

In recent years, the world has seen a tremendous increase in the capability to create, share and store digital images. As a result, personal image collections are growing at an astounding rate and it is clear that in the future individuals will need to access tens of thousands, or even hundreds of thousands, of digital images. It is therefore imperative to start studying ways to access these images in a useful and interesting manner. What is needed is software that will allow users to seamlessly organize, search and browse their images.

In many households, organizing a home photo collection has long been a neglected task. This is still true even with the latest digital photo browsers that typically simply dump pictures into folders, an electronic version of the good old shoe-boxes our parents were using for paper-printed pictures. They offer no support for browsing and searching by image contents, and therefore are inadequate for handling such large collections. Despite numerous features (effective packing on thumbnails on screen, identifying representative images, zoomable user interfaces...), all current photo browsers share limitations such as using a time-line view or a folder view at each time, failing to use the two dimensions of the screen. Most have clumsy annotations capabilities and more than anything else completely separate the search and browsing functions. This key flaw is not unique to image browsers: on the Web, browsing is clicking hyperlinks while searching is through Google or others, typically returning a flat list of results from which browsing can start. Overall, presentation is typically linear and the contents of the images are not used to guide the search and presentation.

Each image may be described by a number of attributes, based on image contents and image meta-data (such as camera and time information, stored in so-called EXIF headers). Some of these attributes may be linear or spatial, such as time and location of taking the image, while others may be textual, hierarchical or categorical. These attributes may be considered dimensions in an image hyper-space, which we must be able to traverse dynamically to fully enjoy our digital images. In on-line analytical processing (OLAP), multi-dimensional data is dealt with by considering a few dimensions at a time and pivoting between dimensions when necessary. In advanced computer games such as EVE online, large three-dimensional worlds are explored by simulating space-travel. Both approaches have been very successful in keeping their users occupied and focused on their task for a long time. We propose that a browsing interface for images should merge these features into a multi-dimensional interface that allows flexible space-travel like exploration of the image hyperspace. In order to begin exploring the possibilities of such a browsing interface we have implemented a prototype, based on the PartiView browser, which allows us to browse images in a three-dimensional space. The dimensions may be based on image contents and image meta-data and different dimensions may be combined in an arbitrary manner. Our conclusion is that while the prototype has shortcomings, this is a very promising research direction that merits further exploration. What is novel in this work is that we want to integrate to an image browser OLAP browsing concepts, such as pivoting and filtering that have typically been designed to facilitate the browsing of huge financial data collections.

General Frame for Clustering High Dimensional Datasets using Random Projections

Participants : Laurent Amsaleg, Zied Jemai, Annie Morin.

Random Projections (RPs) have recently emerged as a powerfull method for dimension reduction. Random projections is computationally significantly less expensive than the other techniques and preserves distances quite nicely. It has been shown that RP yields results comparable to conventional dimensionnality reduction techniques [42] , [48] , [58] . RPs, however, are highly unstable, which becomes problematic when used for data clustering purposes.

We made an experimental observation showing that data points belonging to a "natural" cluster are very likely to be clustered together in many possible different clusterings. Multiplying projections and segmentations of the data space onto many lines gives many different clustering of the points. Note this method much relies on our previous PvS work [84] . Therefore, it is possible to create a matrix of co-association of points across all the produced clusterings. The rational of our approach is to weight associations between data points by the number of times they co-occur in a cluster. Clusters are produced by independent runs of RP-based clusterings. This matrix becomes the support for determining consistent and stable clusters, subsequently refined using a classical hierarchical approach.

We applied this RP-based clustering framework on 4 different databases with different dimensionality and cardinalities (up to 43 million 128-d descriptors). So far, this method seems to be scalable and can address easily the large-scale clustering problem in data-mining.

Intensive Use of Factorial Analysis for Image Mining

Keywords : Correspondance Analysis, Visualization.

Participants : Annie Morin, Nguyen Khang Pham, Patrick Gros.

We need to define "words" in images to use the same tools for image analysis as we do in textual data analysis. We use local descriptors to describe still images. For each image, we get a large number of descriptors lying in a high dimension space. The first step is to cluster the descriptors and replace each descriptor by its cluster number. We build a frequency table crossing the images and the clusters. In a cell (i,j), we have the number of descriptors of the image i belonging to the cluster j. To make the link with textual analysis, an image is equivalent to a document while a descriptor cluster is equivalent to a word. We then use factorial correspondence analysis to process the frequency table and to get groups of "words" defining a concept and related to the previous concepts, groups of images very closed to each other. This method is very similar to the one used by Andrew Zisserman et al. [101] . The work is just starting with the Ph.D. thesis of N. K. Pham.

Coupling Action and Perception by Image Indexing and Visual Servoing

Keywords : Robot Motion control, Visual Servoing.

Participants : Patrick Gros, Anthony Remazeilles.

This is a joint work with the LAGADIC team (F. Chaumette).

Our work aims at developing an integrated approach combining image recognition to allow a robot to localize itself with respect to an image collection depicting its environment and image servoing to control the motions and displacements of the robot.

This year, the work of Anthony Remazeilles was mainly dedicated to the implementation of the system developed on the Cycab platform (see Lagadic project annual report). Although the system worked well in indoor environments, outdoor scenes appeared to be problematic for the image recognition module. The lighting conditions result in big photometric differences between the images of the database and images to be recognized that are beyond the robustness capacity of the descriptors used.

Since we have now the possibility to manage very large image collection without additional cost in terms of online retrieval time, the solution is probably to add much more images in the database, images which should correspond to the various illumination conditions encountered by the robot (morning, mid-day and evening images, sunny and cloudy images...). This aspect will constitute our contribution to the collaboration in the future.


Logo Inria