Team TEXMEX

Scientific Foundations
Application Domains
Software
Contracts and Grants with Industry
Other Grants and Activities
Bibliography
Inria / Raweb 2003
Project: TEXMEX

# Project : texmex

## Section: Overall Objectives

### Overall Objectives

The explosion of the quantity of numerical documents raises the problem of the management of these documents. Beyond the problem of storage, we are interested in problems related with the management of the contents: how to exploit large bases of documents, how to classify them, how to index them to be able to search documents, how to visualize their contents? To solve these problems, we propose a multi-field work gathering within the same team specialists of the various media: image, video, text, and specialists in exploitation techniques of the data and metadata extracted from these data: databases, statistics, information retrieval. Our work is at the intersection of these fields and relates more particularly to 3 points: searching in large image databases, adding semantics to search engines, coupling media for multimedia document description.

The exploitation of the contents of large databases of digital multimedia documents is a problem with multiple facets, and the construction of a system exploiting such a database calls upon many techniques: study and description of documents, organization of the bases, search algorithms, classification, visualization, but also adapted management of the primary and secondary memories, interfaces and interaction with the user.

The five major challenges of the field appear to us to be the following ones:

• it is necessary, first of all, to be able to treat large sets of documents: it is important to develop techniques which scale up gracefully with respect to the quantity of documents they handle, and to evaluate their results according to quality and speed;

• the multimedia documents are not a simple juxtaposition of independent media, and it is important to better exploit the existing links between the various media present in the same document;

• multimedia document databases are evolutionary: the sets of documents evolve, as do the document description techniques and the modes of interrogation, which modifies in turn the way in which the bases are used;

• towards queries of a semantic nature, description techniques have only access to the document syntax; it is thus necessary to find means for reducing this difference between semantic needs and syntactic description tools;

• the user-system interaction is a central point: the user must be able to translate his needs efficiently and simply but with shades, to guide the system or to evaluate the results; he must be the one who controls the system.

We have adopted a matricial organization. On the one hand, we have competences in two main fields, automatic document description and exploitation of these descriptions, and on the other hand, we defined three transversal topics of research. The main idea is to concentrate on the questions where the team multidisciplinarity appears to be an asset to obtain original results. Most graduate students of the team work at the intersection of two domains and are thus advised by two persons.

#### Our first field of competence

$\phantom{\rule{-0.166667em}{0ex}}$ is thus document description. The documents are generally not exploitable directly for search or indexing tasks: it is necessary to use intermediate descriptions which must carry the maximum information on document semantics, while being also computable automatically. To the documents and their descriptors, one can add metadata, which we define here as all the information (other than the descriptors) which inform, supplement or qualify the data (and the descriptors) to which they are associated.

#### Our second field of competence

$\phantom{\rule{-0.166667em}{0ex}}$ relates to description exploitation. The question is to define the techniques which make it possible to handle and exploit large volumes of data, metadata and descriptors, which have been extracted from the documents: organization and management of the bases, logical and temporal consistency, selection and strategies of computation of descriptors and metadata; statistical techniques for the exploration of great volumes of data; indexing techniques aiming at confining in the smallest possible volume the exploitation of the data and thus avoiding an exhaustive examination whose cost is certainly controlled but crippling; system problems related to the physical organization of large volumes of data, like disc access management or cache memory management requiring new techniques which are adapted to the characteristics of the descriptors and to the way of using them.

#### First topic of research: searching in large image bases.

Going from corpora of a few thousands of images to corpora containing a few millions remains a research challenge today. The solution can come neither from the only descriptors nor from new indexing techniques, but necessitates to take into account all the various components of the system and their articulation. We thus propose to work on:

• data description, especially in the case of compressed or watermarked images,

• indexing and search algorithms,

• database organization and use of the metadata,

• system and hardware support,

and on the coupling between these various techniques to improve the performances of the current systems in terms of speed as well as of quality of recognition.

#### Second topic of research: towards more semantic search engines.

Search engines are extensively used tools, but they appear to be disappointing most of the time, due to their syntactic approach which is based on keywords. Natural language processing tools could however provide them more semantic capabilities, by allowing word sense disambiguation or the possibility to recognize the various formulations of the same concept. It is thus advisable to combine these two techniques.

This union is, however, not so simple. On the one hand, it requires to provide query extension strategies to search engines and then to translate these extensions in terms of similarity. On the other hand, natural language processing tools must work in much broader environments than the ones in which they are usually used. The contribution of such a modification of the engines must also be established, which requires a precise evaluation of the obtained results.

#### Third subject of research: multimedia and coupling between media

Studying media coupling is undertaken in two manners. Within the framework of video, we are interested in descriptions which jointly use the sound and image tracks of the video. Such techniques can be applied to automatic video structuring, but also to improve people detection and recognition techniques, whether by their face or their voice.

In addition, we study the coupling between text and image in the documents where these two media are strongly coupled, a common case in scientific bibliographical databases, on the web, in newspapers, in art books or technical documents. The goal is to connect, in the same document, the image and the text which is referred to it. This should make it possible to obtain an automatic and semantic description of the images, to connect different documents, either by the search for images visually similar, or by the search for texts treating the same subject, and thus to improve the description of the images and to remove possible ambiguities in the understanding of the text.