Team TEXMEX

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Dissemination
Bibliography

Section: New Results

Text Retrieval for Large Databases

Natural Language Processing and Machine Learning

Keywords : Natural Language Processing, Machine Learning, Information Retrieval, Corpus-Based Acquisition of Lexical Relations.

Participants : Laurent Amsaleg, Vincent Claveau, Antoine Doucet, Fabienne Moreau, Pascale Sébillot, Laurent Ughetto.

Our research in this field focuses on three points: elaboration of fully automatic and generic machine learning solutions —using both symbolic and statistical approaches— to extract from textual corpora linguistic resources needed by a given application; exploitation of lexical resources in information retrieval systems, trying to discover accurate solutions to the problem of their integration into the systems; quest for fast systems in spite of this integration.

During 2006, our work has especially concerned the 3 following aspects:

  1. Machine-learning based acquisition of morphological variants of words.

    Information retrieval systems (IRSs) usually suffer from a low ability to recognize a same idea that is expressed in different forms. A way of improving them is to take into account morphological variants (such as compile/compilation/recompiling ...). We proposed a simple yet effective method to recognize these variants that are further used so as to enrich queries. It is based on an unsupervised machine learning approach using analogies to decide if two words are morphologically related. In comparison with already published methods, our system does not need any external resources or a priori knowledge, and thus supports many languages. This new approach has been evaluated against several IR collections, 6 different languages. Reported results show a significant and systematic improvement of the whole IRS efficiency both in terms of precision and recall for every language [18] .

  2. Linguistic resources and information retrieval (IR).

    Our aim in this domain is to explore methods that enable information retrieval systems to capture the semantics of natural language texts, and to exploit the semantic information that natural language processing (NLP) techniques can automatically extract from textual documents. Currently, for example, systems based on Salton's vector space model (VSM) represent documents and queries as sets of words, without regard for the relationship (or even linear order) between them. Fabienne Moreau's Ph.D. thesis [13] aims at adapting current IR models to allow information gleaned from NLP to inform IR. This year has been dedicated to the following points:

    • elaboration of a prototype of linguistically informed-IRS, based on VSM, to integrate in parallel multiple kinds of linguistic knowledge, belonging to the morphological, syntactic, or semantic levels of language. This tool has enabled us to demonstrate the effective interest of several sorts of knowledge (especially morphological and semantic);

    • through an original analysis of correlations between the various linguistic index in terms of complementarity and redundancy to retrieve documents, bringing to light of the relevance of some mono-level and some multi-level index couplings;

    • in order to automatically detect the best way to combine those informations within an IRS, proposition of a supervised machine-learning technique (based on a neural network) that merges the lists of documents produced by each linguistic index, and automatically adapts its behavior to the characteristics of the queries. This combination results in higher performances than those obtained by the most efficient single index, especially in terms of stability;

    • adaptation of our previously-described method for the acquisition of morphological variants to IRS. Using its productions to expand queries has led to high scores on various collections of documents, and several languages.

  3. Efficient but rapid IR systems.

    Concurrently to this aim of efficiency, we also study a second aspect of the quest of performances for IRS: systems have to be relevant and to provide better answers, but they also have to be rapid and to answer quickly to users, even when they are questioning huge textual databases. Antoine Doucet has joined TexMex in October 2006 for a one-year Inria post-doctoral position; the research aims at finding a way to merge the models that are the most suitable to integrate linguistic information and the best search algorithms developed within the scientific field of text databases (TDB). In this domain, very efficient algorithms for IR on huge textual databases have been developed, that however use over-simplified models of linguistic knowledge.

Intensive Use of Factorial Analysis for Text Mining: Indicators and Displays

Keywords : Correspondance Analysis, Visualization.

Participant : Annie Morin.

Textual data can be easily transformed in frequency tables and any method working on contingency tables can be used to process them. Besides, with the important amount of available textual data, we need to find convenient ways to process the data and to get invaluable information. It appears that the use of factorial correspondence analysis allows us to get most of the information included in the data. But even after the data processing, we still have a big amount of material and we need visualization tools to display it. We study the relevance of different indicators used to cluster the words on one side and the documents on the other side and we are concerned by the visualization of the outputs of factorial analysis: we need to help the user to go through the huge amount of information we get and to select the most relevant points. Most of the time, we do not pre-process the texts: that means that there is no lemmatization.

Visualization and Web Mining

Keywords : Search Result Visualization, 3D Metaphors, Self-Organizing Maps, Interface Evaluation, Information Retrieval, Human-Computer Interfaces, Visual Categorization of Web Pages.

Participants : Nicolas Bonnel, Annie Morin.

This work deals with the dynamic generation of interactive 3D presentations of web search results. Here the issue is how to effectively represent the results matching a query on textual search engine. These researches were carried out in the context of a thesis in cooperation with the R&D division of France Télécom (CIFRE grant) [10] .

Textual information retrieval is one of the main tasks related to the Web. This task relies generally on search engines which have become an essential tool of the Web. Indeed users start at a search engine 88% of the time when they have a new task to complete on the Web. However, confronted with the huge increase of information available on the Web and the lack of significant evolution of the search process, the amount of documents matching a query becomes awfully important. It is therefore difficult for the user to effectively interpret all these results. This problem of representing search result is addressed through information user interfaces (IUI). Much work has been carried out on search result visualization since a decade, without any real impact on the most popular user interfaces. Our approach focuses on the necessity to dynamically create interactive 3D presentations based on visualization metaphors adapted not only to the end-user, but also to the task to complete as well as to the data. Two main steps can be distinguished in our work [24] .

The first one consists of organizing effectively the results of a web query. For this purpose, on-the-fly clustering methods are investigated and only statistical and deterministic approaches are considered. More precisely we focus on a particular unsupervised clustering method: the self-organizing maps. This method enables to cluster the results and to organize the clusters thanks to a non-linear projection on a predefined map. To correctly adapt this method to our context, we discuss on many points such as the distances to use or the weighting schemes to apply. Another important point is the quality evaluation of the proposed classification and organization.

The second step concerns the visualization of the organized search results. The goal is to define cognitive 3D metaphors of visualization allowing for a richer space representation which efficiently and effectively helps users in their tasks. Various interactive and adaptive metaphors are then proposed but the main one is based on the city concept. This metaphor enables the user to interact with web search results which are represented in a 3D virtual city and organized on the city ground according to the computation of the self-organizing map. All the proposed interfaces are hybrid (i.e. composed of a 2D part and a 3D scene) and integrated in the SmartWeb prototype which uses the X-VRML language that enables to effectively design 3D visualization metaphors and automatically generate interactive 3D content.

A user study of this interface was carried out and an evaluation framework for IUI is proposed in order to be able to successfully evaluate this kind of interface [22] , [21] . Some key points for designing successful 3D metaphors are also discussed [51] .

Finally our approach can be considered as a post-processing step for a search engine. Obviously the necessity to have an effective indexing and an effective retrieval process is a complementary work to this one [52] .


previous
next

Logo Inria