Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities
Inria / Raweb 2003
Project: TEXMEX

Project : texmex

Section: New Results

Text Retrieval in Large Databases

Natural Language Processing and Machine Learning

Keywords : natural language processing , machine learning , lexical semantics , corpus-based acquisition of lexical relations , semi-supervised learning , inductive logic programming , hierarchical classification .

Participants : Vincent Claveau, Fabienne Moreau, Mathias Rossignol, Pascale Sébillot.

[8] [22] [23] [15]

The general frame of our work is explained in 3.2.3.

During 2003, our work has especially concerned the 3 following points:

  1. asares: two semi-supervised versions combining symbolic and statistical approaches.

    asares is our inductive logic programming (ILP) based system (using aleph algorithm) that automatically infers from a set of positive and negative examples of elements in a given relation (e.g. noun-verb (N-V) pairs in which the V plays for the N one of the roles defined in the qualia structure in Pustejovsky's Generative Lexicon model, or do not play such a role; we shall refer respectively to these two cases as N-V qualia (resp. non-qualia) pairs) morpho-syntactic and semantic patterns that characterize this relation and can be applied to a corpus to get new elements in this same relation. [15] fully describes this system and its application to the extraction of N-V qualia pairs; it also explains the refinement operator well-adapted to the hierarchical knowledge we deal with that we have built, that allows us to travel efficiently through our hypothesis search space. However the automation of the system and its easy application to a new corpus is limited by the supervised nature of ILP. We have thus proposed two semi-supervised versions of asares that combine in two different ways a statistical approach (N-V qualia pairs are considered as one special kind of cooccurrences) and the ILP approach. The first and sequential combination is presented in [22], whereas the second one, that more deeply integrates the two techniques, is described in [23]. The two solutions lead to the same results as asares supervised version, but keep advantages of the two approaches they mix: robustness and automation of the statistical method, quality of the results and expressiveness of the symbolic one.

  2. Acquisition of semantic lexicons based on Rastier's differential semantics: automatic generation of sets of keywords for topic characterization and detection.

    We have ended the elaboration of a sequence of statistical data analysis treatments, that refines and enriches the results of an initial LLA (linkage likelihood analysis) classification of the words of a given corpus based on their distribution over its paragraphs, and allows us to obtain in a fully automatic way and with the use of no prior information sets of keywords that characterize the main topics of the (morpho-syntactically tagged) corpus. Each class can then be used to detect the presence of its topic in any paragraph of the corpus, by a simple keyword cooccurrence criterion. The obtained sets enable us to split an initial non-specialized corpus into several topic-specific ones and to get the linguistic material necessary to carry on the second step of the elaboration of semantic lexicons based on Rastier's principles, i.e. the automatic constitution of semantic classes within homogeneous topics. We have begun a work in this direction. In order to achieve this goal, once again without any human intervention or external data, thus eventually favoring precision over recall, we consider statistical techniques that group words appearing in similar contexts. We plan to take into account different lengths of contexts for different word categories, and to consider the relative positions of contextual elements. This idea leads us to the need of the definition of a non-symmetric similarity measure to automatically build the semantic classes.

  3. Linguistic resources and information retrieval (IR).

    We have evaluated the interest of N-V qualia relations for the expansion of queries in an information retrieval system (IRS). More precisely, we have used Salton's IRS smart and the data of the IRS evaluation campaign Amaryllis, and asares has learnt qualia pairs from one Amaryllis corpus. Our experiments have shown [8] that expanding a query with verbs that play one qualia role for nouns that it contains locally but significantly improves the results of smart. More precisely, the relevance of the first ten documents is increased, and these results are particularly interesting if we consider the way search engines are commonly used. Moreover, Fabienne Moreau has begun in October a PhD thesis which aims at exploring methods of extending Salton's vector space model (VSM) to improve its ability to capture the semantics of natural language texts. Currently, under Salton's theory, documents are represented as a set of features, without regard for the relationship between individual terms. The goal of this thesis is to adapt the VSM to allow information gained from natural language processing to inform IR.

Visualization and Web Mining

Participants : Nicolas Bonnel, Annie Morin.

This is joint work with France Telecom R & D (cf. 7.1.2).

Nicolas Bonnel, a second year PHD student, is currently working on the dynamic generation of 2D and 3D multimedia interactive presentations, that aims at representing the results of a search in a database. N. Bonnel has a Cifre contract with France Telecom and his thesis is done in cooperation with FT. For that, he uses metaphors developed by France Telecom and works on the relevance of descriptors for the documents and the improvement of the graphical representation. We need therefore to perform quality evaluation and we have to take into account the user profile to optimize the results of a query.

Knowledge Extraction and Visualization from Textual Databases

Keywords : Exploratory data analysis .

Participant : Annie Morin.

Knowledge extraction from textual databases is not obvious. Among the used methods, we find factorial analysis, neural networks or Kohonen maps. R Priam 's PhD thesis [10] proposes an adaptation of Kohonen maps to discrete data and develops a new algorithm called CASOM for correspondence analysis and Self organizing maps. CASOM is a non-linear extension of correspondence analysis which allows a graphical representation of words and documents.