Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities
Inria / Raweb 2003
Project: TEXMEX

Project : texmex

Section: Scientific Foundations

Keywords : descriptor , metadata .

Metadata and Document Description

All the multimedia documents have the ambivalent characteristic to be, on the one hand, very rich semantically and, on the other hand, very poor, especially when considering the elementary components which constitute them (continuations of letters or pixels). More concise and informative descriptions are needed in order to handle these documents.

Image Description

Keywords : image matching , image recognition , image indexing , invariants .

Computing image descriptors has been studied for about ten years. The aim of such a description is to extract indices called descriptors whose distance reflects those of the images they are computed from. This problem can be seen as a coding problem: how images should be coded such that the similarity between the codes reflects the similarity between the original images?

The first difficulty of the problem is that image similarity is not a well defined concept. Images are polysemic, and their level of similarity will depend on the user who judges this similarity, on the problem this user tries to solve, and on the set of images he is using. As a consequence, there does not exist a single descriptor which can solve every problem.

The problem can be specialized with respect to the different kinds of users, databases and needs. As an example, the problem of professional users is usually very specific, when domestic users need more generic solutions. The same difference occurs between databases composed of very dissimilar images and those composed only of images of one kind (fingerprints or X ray images). Finally, retrieving one particular image from an excerpt or browsing in a database to choose a set of images may require very different descriptors.

To solve these problems, many descriptors have been proposed in the literature. The most frequent framework is image retrieval from a large database of dissimilar images using the query by example paradigm. In this case, the descriptors integrate the information of the whole image : color histograms in various color spaces, texture descriptors, shape descriptors (their major drawback is to require automatic image segmentation). This field of research is still active: Color histograms provide a too poor information to solve any problem as soon as the size of the database increases [76] and several solutions have been proposed to remedy this problem: correlograms [58], weighted histograms [38]...

Texture histograms are usually useful for one kind of texture, but they fail to describe all the possible textures, and no technique exist to decide in which category a given texture falls, and thus which descriptor should be used to describe it properly. Shape descriptors suffer from the lack of robustness of shape extractors.

Many other works have been done in the case of specific databases. Face detection and recognition is the most classical and important case, but other works concern medical images for example.

In the team, we work with a different paradigm based on local descriptors: one image is described by a set of descriptors. This solution offers the possibility of partial recognitions like object recognitions independently of the background [75].

The main stages of the method are the following. First, simple features are extracted from each image (interest point in our case, but edges and regions can be used as well). The most widely used extractor is the Harris [56] point detector which provide not very precise but "repeatable" points. Other detectors exist, even for points [64].

The similarity between images are then translated into the concept of invariance: Measurements of the image invariants to some geometric (rotation, translations, scalings) or photometric (intensity variations) transformations are searched for. In practice, this concept of invariance is usually replaced by the weaker concept of quasi-invariance [36] or by properties established only experimentally [47] [46].

In the case of points, the classical technique consists of characterizing the signal around each point by its convolution with a Gaussian kernel and its first derivatives and by mixing these measurements in order to obtain the invariance properties. The invariance with respect to rotations, scalings and affine transformations was obtain respectively by Florack [48], Dufournaud [42] and Mikolajczyk [66], photometric invariance was demonstrated for grey-levels by Schmid [75] and for color by Gros [53]. The difficult point is that not only invariant quantities have to be computed, but that the feature extractor has to be invariant itself to the same set of transformations.

One of the main difficulties of the domain is the evaluation and the comparison of the methods. Each one corresponds to a slightly different problem and comparing them is difficult and usually unfair: The results depend on the used database, especially when they are quite small. In this case, a simple syntactic criteria can give the impression of a good semantic description, but this does not tell anything about what would happen with a larger database.

Video Description

Keywords : video indexing , structuring , key-events .

Professional and domestic video collections are usually much bigger than the corresponding still image collections: there is a common factor of 1000 in between the two. If the images often have a weaker quality (motion, fuzzy images...), they present a temporal redundancy which can be exploited in order to gain some robustness.

Video indexing is a large concept which covers different topics of research. Video structuring consists of finding the temporal units of a video (shots, scenes) and is a first step to compute a table of contents of a video. Key-event detection is more oriented to the creation of an index of the video. Finally, all the extracted elements can be characterized with various descriptors: motion descriptors [44], or still image based descriptors, which can also use the image temporal redundancy [55].

Many contributions have been proposed in the literature in order to compute a temporal segmentation of videos, and especially to detect shot boundaries and transitions [39] [49]. Nevertheless, shots appear to be too low-level segments for many applications since a video can contain more than 3000 of them. Scene segmentation, or what is called macro-segmentation is a solution, but it remains an open problem. The combination of media is probably an important axis of research to progress on this topic.

Text Description

Keywords : natural language processing , lexical semantics , machine learning , corpus based acquisition of linguistic resources , exploratory data analysis .

Automating indexing of textual documents [74] has to tackle two main problems: first choosing indexing terms, i.e. simple or complex words automatically extracted from a document, that "represent" its semantic content and make its detection possible when the document database is considered; second, dealing with the fact that the representation is a word one and not a conceptual one. Therefore information retrieval has to be able to overcome two semantic problems: various possibilities to formulate the same idea (how to match a same concept in a text and a query but expressed with different words); word ambiguity (a same word –graphical chain– can cover different concepts). In addition to these difficulties, the meaning of a word, and thus the semantic relations that link it to other words, varies from one domain to the other. One solution is to make use of domain-specific linguistic resources, both to disambiguate words and to expand user queries with synonyms, hyponyms... These domain-specific resources are however not pre-existing and must be automatically extracted from corpora (collections of texts) using machine learning techniques.

Lots of works have been done during the last decade in the domain of automatic corpus-based acquisition of lexical resources, essentially based on statistical methods, even though symbolic approaches also present a growing interest [32]. We focus on these two kinds of methods and aim at developing machine learning solutions that are generic and fully automatic to give the possibility to extract from a corpus the kind of lexical elements required by a given application. We specifically extract semantic relations between words (especially noun-noun relations) using hierarchical classification techniques and implementing principles of F. Rastier's differential semantic theory [71]. We also acquire through symbolic machine learning (inductive logic programming [67]) noun-verb relations defined within J. Pustejovsky's Generative lexicon theory [70]; those peculiar links give access to interesting reformulations of terms (disk shop - to sell disks) that are up to now not often used in information retrieval systems. Our research both concerns the machine learning algorithms developed to extract lexical elements from corpora, and the linguistic and applicative interests of the learnt elements.

Acquisition of lexicons based on Rastier's differential semantics.

Differential (or interpretative) semantics [71] is a linguistic theory in which the meaning of a word is defined through the differences that it presents with the other meanings in the lexicon. A lexicon is thus a network of words, structured in classes, in which differences between meanings are represented by semes (i.e. semantic features). Within a given semantic class –group of words that can be exchanged in some contexts–, words share generic semes that characterize their common points and are used to build the class (e.g. /to seat/ is associated with {chair, armchair, stool...}), and specific ones that explicit their differences (/has arms/ differentiates armchair from the two others). Following Rastier, two kinds of linguistic contexts are fundamental to characterize relations of lexical meaning: the topic of the text unit in which a word occurrence is found, and its neighborhood. Differential semantics states that valid semantic classes, in which specific semes can be determined, can only be defined within specific topic. And a topic can be recognized within a text by the presence of a semantic isotopy, i.e. the copresence within the sets of semes (named sememes) representing some of its words of some recurrent semes. For example, a war topic can be detected in a text unit that contains the words soldier, offensive, general... by the presence of the same seme /war/ within the sememes of all these words.

We have developed a 3-level method to extract lexicons based on Rastier's principles. First, with the help of a hierarchical classification method (Linkage Likelihood Analysis, LLA [62]) applied on the distribution of nouns and adjectives among the paragraphs, we automatically learn sets of keywords that characterize the main topics of the studied corpus. These sets are then used to split the corpus into topic-specific corpora, in which semantic classes are built using LLA technique on shared contexts. Finally we try to characterize similarity and dissimilarity links between words within each semantic class.

Acquisition of elements of Pustejovsky's Generative lexicon.

In one of the components of this lexical model [70], called the qualia structure, words are described in terms of semantic roles. For example, the telic role indicates the purpose or function of an item (cut for knife), the agentive role its creation mode (build for house)... The qualia structure of a noun is mainly made up of verbal associations, encoding relational information.

We have developed a learning method, using inductive logic programming [67], that enables us to automatically extract, from a morpho-syntactically and semantically tagged corpus, noun-verb pairs whose elements are linked by one of the semantic relations defined in the qualia structure in the Generative lexicon. This method also infers rules explaining what in the surrounding context distinguishes such pairs from others also found in sentences of the corpus but which are not relevant. In our work, stress is put on the learning efficiency that is required to be able to deal with all the available contextual information, and to produce linguistically meaningful rules. And the obtained method and system, named asares, is generic enough to be applied to the extraction of other kinds of semantic lexical information.

Characterization of huge sets of thematically homogeneous texts.

A collection of texts is said to be thematically homogeneous if the texts share some domains of interest. We are concerned by the indexing and analysis of such texts. The research of relevant keywords is not trivial : even in thematically homogeneous sets, there is a high variability in the used words and even in the concerned sub-fields. Apart from the indexing of the texts, it is valuable to detect thematic evolutions in the underlying corpus.

Generally, textual data are not structured and we must suppose that the files we are concerned with have either a minimal structure, or a general common thema. The method we use is the factorial correspondence analysis. We get clusters of documents and their characteristic words. Recently, R Priam defended a thesis where he proposes methods very close to the Kohonen maps to visualize words and documents local proximities.

Evaluation of Descriptors

Keywords : evaluation , performance , discriminating power .

The situation on this subject is very different according to the concerned media. Reference test bases exist for text, sound or speech, and regular evaluations campaign are organized (NIST for sound and speech recognition, TREC for text in English, AMARYLLIS for text in French, SENSEVAL or ROMANSEVAL for text disambiguation)(It should be noted that within TREC, the aim is now not only to retrieve pertinent documents, but to find in these documents the parts which provide the answer to a precise query.).

In the domain of images and videos, The BENCHATLON provides a database to evaluate image retrieval systems while TREC provides test database for video indexing. A system to evaluate shot transition algorithms has been developed by G. Quenot and P. Joly [73].


Keywords : metadata .

To improve the data organization or to define the strategies to choose or compute some descriptors, it may be advisable to use contextual informations. These informations, called metadata (or data about the data) can provide some information about how the data were produced, obtained, or used, they can provide information about the users or describe the content of a document.

V. Kashyap and A. Sheth [59] proposed a classification of the metadata for multimedia in two main classes: metadata which contain external information (date, localization, author...) and metadata which contain internal information linked to the content or the way this content is represented within the document. The latter are called descriptors:

  1. Some of them are computed directly from the documents;

  2. The others are provided by a human, like keywords.

The organization of the metadata can be completed and become matricial when considering the various objectives of the metadata: easy data access, data abstract, interoperability, media or content representation...), or the way the metadata were obtained. Metadata are a privileged way to keep information relative to a document or its descriptors in order to facilitate future processing. They appear to be a key point in a coherent exploitation of large multimedia databases.