Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: Scientific Foundations

Keywords : Low-level Descriptor, Metadata.

Document Description and Metadata

All the multimedia documents have the ambivalent characteristic to be, on the one hand, very rich semantically and, on the other hand, very poor, especially when considering the elementary components which constitute them (sets of characters or of pixels). More concise and informative descriptions are needed in order to handle these documents.

Image Description

Keywords : Image Matching, Image Recognition, Image Indexing, Invariants.

Computing image descriptors has been studied for about thirty years. The aim of such a description is to extract indices called descriptors whose distance reflects those of the images they are computed from. This problem can be seen as a coding problem: how images should be coded such that the similarity between the codes reflects the similarity between the original images?

The first difficulty of the problem is that image similarity is not a well-defined concept. Images are polysemic, and their similarity level will depend on the user who judges this similarity, on the problem this user tries to solve, and on the set of images he/she uses at this moment. As a consequence, there does not exist a single descriptor which can solve every problem.

The problem can be specialized with respect to the different kinds of users, databases and needs. As an example, the problem of professional users is usually very specific, when domestic users need more generic solutions. The same difference occurs between databases composed of very dissimilar images and those only composed of images of the same kind (e.g. , fingerprints or X-ray images). Finally, retrieving one particular image from an excerpt or browsing a database to choose a set of images may require very different descriptors.

To solve these problems, many descriptors have been proposed in the literature. The most frequent frame of use considered is that of image retrieval from a large database using the query-by-example paradigm. In this case, the descriptors integrate information on the whole image: color histograms in various color spaces, texture descriptors, shape descriptors (with the major drawback that it requires an automatic image segmentation). This field of research is still active: color histograms provide too poor information to solve any problem as soon as the size of the database increases [102] . Several solutions have been proposed to remedy this problem: correlograms [79] , weighted histograms [53] ...

Texture histograms are usually useful for one kind of texture, but they fail to describe all the possible textures, and no technique exists to decide in which category a given texture falls, and thus which descriptor should be used to describe it properly. Shape descriptors suffer from a lack of robustness.

Many other researches have been carried out in the case of specific databases. Face detection and recognition is the most classical and important case, but other studies concern medical images for example.

In the team, we work with a different paradigm based on local descriptors: one image is described by a set of descriptors. This solution opens the possibility of partial recognitions, like object recognitions independently of the background [100] .

The main stages of the method are the following. First, simple features are extracted from each image (interest points in our case, but edges and regions can be used too). The most widely used extractor is the Harris point detector [77] which provides not very precise but "repeatable" points. Other detectors exist, even for points [86] .

The similarity between images are then translated into the concept of invariance: measurements of the image invariants to some geometric (rotation, translations, scalings) or photometric (intensity variations) transformations are searched for. In practice, this concept of invariance is usually replaced by the weaker concept of quasi-invariance [47] or by properties only experimentally established [67] , [66] .

In the case of points, the classical technique consists in characterizing the signal around each point by its convolution with the Gaussian and its first derivatives and by mixing these measurements in order to obtain the invariance properties. The invariance with respect to rotations, scalings and affine transformations was obtained respectively by Florack [68] , Dufournaud [61] and Mikolajczyk [92] , photometric invariance was demonstrated for grey-levels by Schmid [100] and for color by Gros [74] . The difficult point is that not only invariant quantities have to be computed, but that the feature extractor has to be invariant itself to the same set of transformations.

One of the main difficulties of the domain is the evaluation and the comparison of the methods. Each one corresponds to a slightly different problem. Comparing them is difficult and usually unfair: the results depend on the used databases, especially when these are quite small. In this case, a simple syntactic criterion can give the feeling of a good semantic description, but this does not tell anything about what would happen with a larger database.

Video Description

Keywords : Video Indexing, Structuring, Key-Events.

Professional and domestic video collections are usually much bigger than the corresponding still image collections: 1000 is a common factor between them. If video images often have a weaker quality than still images (motion, fuzzy images...), they present a temporal redundancy which can be exploited to gain some robustness.

Video indexing is a large concept which covers different topics of research: video structuring consists of finding the temporal units of a video (shots, scenes) and is a first step to compute a table of contents of a video; key-event detection is more oriented to the creation of an index of the video; finally, all the extracted elements can be characterized with various descriptors: motion descriptors [63] , or still image-based descriptors, but which can use the image temporal redundancy [76] .

Many contributions have been proposed in the literature in order to compute a temporal segmentation of videos, and especially to detect shot boundaries and transitions [54] , [69] . Nevertheless, shots appear to be a too low-level segment for many applications since a video typically contains more than 3000 shots. Scene segmentation, or what is called macro-segmentation is a solution, but it remains an open problem. The combination of media is probably an important axis of research to progress on this topic.

Text Description

Keywords : Natural Language Processing, Lexical Semantics, Machine Learning, Corpus-Based Acquisition of Linguistic Resources, Exploratory Data Analysis.

Lexical semantics for Information Retrieval

Automating indexing of textual documents [99] has to tackle two main problems: first choosing indexing terms, i.e. simple or complex words automatically extracted from a document, that "represent" its semantic content and make its detection possible when the document database is questioned; second, dealing with the fact that the representation is word-based and not concept-based. Therefore information retrieval has to overcome two semantic problems: various possibilities to formulate a same idea (how to match a concept in a text and a query expressed with different words); word ambiguity (a same word can cover different concepts). In addition to these difficulties, the meaning of a word, and thus the semantic relations that link it to other words, varies from one domain to another. One solution is to use domain-specific linguistic resources, both to disambiguate words and to propose equivalent formulations. These domain-specific resources are however not pre-existing and must be automatically extracted from corpora (collections of texts). Moreover, if one wants to use resources really adapted to one's text collection, prior to acquire them, one has to adopt a linguistic framework defining the semantic elements that are to be collected from corpora. In this respect, work in lexical semantics provides different theoretical models; let us cite three of them that are used in the TexMex project.

F. Rastier's differential semantic theory [96] is a linguistic theory in which the meaning of a word is defined through the differences it presents with the other meanings in the lexicon. Within a given semantic class —group of words that can be exchanged in some contexts—, words share generic semes (i.e. , generic semantic features) that characterize their common points and are used to build the class (e.g. the generic seme /to seat / is associated with the class {chair, armchair, stool ...}), and specific ones that explicit their differences (/has arms / differentiates armchair from the two others).

In J. Pustejovsky's Generative lexicon theory [95] , a so-called qualia structure is defined. In this structure, words are described in terms of semantic roles. For example, the telic role indicates the purpose or function of an item (cut for knife ), the agentive role its creation mode (build for house )... The qualia structure of a noun is mainly made up of verbal associations, encoding relational information.

The Meaning-Text Theory is a broad linguistic framework [90] , whose lexicology part defines Lexical Functions [91] (LFs). LFs are designed to encode every semantic relation of a word, such as syntagmatic relations (e.g. mouse–to click , shower–to have ...) or paradigmatic ones (e.g. professor–student , bee–honey ...).

Concerning the corpus-based acquisition of lexical resources, many researches have been undergone in the last decade. While most of them are essentially based on statistical methods, symbolic approaches also present a growing interest [41] . Relying on both methods, machine learning solutions are being developed in TexMex ; they aim at being automatic and generic enough to give the possibility to extract, from a corpus, the kind of lexical elements required by a given task (for example, query expansion in an information retrieval system).

Characterization of Huge Sets of Thematically Homogeneous Texts

A collection of texts is said to be thematically homogeneous if the texts share some domain of interest. We are concerned by the indexing and analysis of such texts. The research of relevant keywords is not trivial: even in thematically homogeneous sets, there is a high variability in the used words and even in the concerned sub-fields. Apart from the indexing of the texts, it is valuable to detect thematic evolutions in the underlying corpus.

Generally, textual data are not structured and we must suppose that the files we are concerned with have either a minimal structure, or a general common thema. The method we use is factorial correspondence analysis. We get clusters of documents and their characteristic words.

Retrieval and Description Evaluation

Keywords : Evaluation, Performance, Discriminating Power.

The situation on this subject is very different according to the concerned media. Reference test bases exist for text, sound or speech, and regular evaluation campaigns are organized (NIST for sound and speech recognition, TREC for text in English, CLEF for text in various European languages, SENSEVAL or ROMANSEVAL for text disambiguation...).

In the domain of images and videos, the BENCHATLON provides a database to evaluate image retrieval systems while TREC provides test database for video indexing. A system to evaluate shot transition algorithms has been developed by G. Quenot and P. Joly [98] .

Setting protocols of evaluation to compare different content-based information systems (CBIR) is a very hard task, especially when considering the relevance feedback from users who submit an image or a video as query-by-example to the CBIR system. In this context, our idea is to automatically learn user profiles during the searching scenarii and to correlate some feedback indicators (non-intrusively collected) with the sets of descriptors used in the query to compute the results. Finally, the objective is to adapt the next query execution or the image/video browsing, by dynamically taking into account the last feedback.

Metadata Management

Keywords : Metadata, Standard, MPEG-7, TV-Anytime, MPEG-21, Integration, Metadata Management, Automatic Selection.

To improve the data organization or to define the strategies to compute some descriptors, it may be advisable to use additional information, called metadata. Metadata (data about the data) must describe the data well enough as to be used as a surrogate for the data when making decisions regarding description and use of the data. Metadata can give complex information concerning structure description, semantics and contents of data items, their associated processes and, more widely, the respective domains of this various information.

Metadata are: i) data describing and documenting data, ii) data about datasets and their usage aspects, iii) the content, quality, constraints, and other characteristics of data.

The documenting role of metadata is fundamental. This information can provide decision elements in order to choose the most appropriate dataset or processing techniques and also, the most appropriate data presentation mode. In the case of large amounts of data, it is difficult to analyze data content in a straightforward way. Metadata then give appreciation or description informative elements of the dataset.

However, metadata role is not restricted to documenting information. Metadata must also allow:

V. Kashyap and A. Sheth [81] proposed a first classification of metadata for multimedia documents in two main classes: metadata which contain external information (date, localization, author...) and metadata which contain internal information directly dependent on the content (such as low-level descriptors) or describing the content independently (such as keywords annotations) [43] , [50] , [56] , [72] , [80] . Many standardized metadata such as in MPEG-7, TV-Anytime, etc. and also ad hoc content-descriptive metadata can be included in this classification [87] , [89] .

The key elements of the metadata managed by TexMex include (but are not limited to):

The selection and organization of metadata is highly application-dependent and also depends on the various objectives of metadata consumption that can facilitate: data access, data summary, data interoperability, media or content presentation and adaptation...

Metadata are a privileged way to keep information relative to a document or its descriptors in order to facilitate future processing. They appear to be a key point in a coherent exploitation of large multimedia databases.


Logo Inria