Section: New Results
Document Mining and Information Retrieval
XML Document Mining
Keywords : Classification, Clustering, Data Mining, Mining Complex Data, XML Document, XML mining, INEX.
Participants : Thierry Despeyroux, Yves Lechevallier, Anne-Marie Vercoustre, Sergiu Chelcea, Florent Masseglia, Brigitte Trousse.
XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Standard methods have been used to classify XML documents, reducing them to their textual parts. These approaches do not take advantage of the structure of XML documents that also carries important information.
First we made in 2005 two types of researches related to XML Document mining which are published this year:
First we developed a new representation model for clustering XML documents and described our approach working both for structure-based clustering and Structure-and-Content clustering. Results from several experiments using the INEX (Initiative for the Evaluation of XML Retrieval) IEEE collections  and INRIA activity reports were published this year  .
Second we proposed an original supervised classification technique for XML documents which is based on a linearization of the structural information of XML documents and on a characterization of each cluster in terms of frequent sequential patterns. Experiments on the MovieDB collection validated the efficiency of our approach  .
Finally we have also contributed to a book chapter on XML document mining that has been accepted and will be part of a book on "Data Mining Patterns: New Methods and Applications"  .
Moreover as in 2005  , we studied the impact of selecting different parts (sub-structures) of XML documents for specific clustering tasks. Our approach integrated techniques for extracting representative words from documents elements with the 2-3 HAC algorithm (cf. section 6.3.6 ) for classifying documents. We evaluated with the same collection of Inria XML activity reports (year 2003) used previously. Our first objective was to study the impact on the resulted 2-3 hierarchies of selecting different parts (sub-structures) of XML documents and of using two distances then to compare them with INRIA's research organization. As results, we showed that the quality of clustering strongly depends on the selected document features as in  , and also on the used distance. Indeed with the classical euclidian distance on the words frequency we had a lot of atypical teams as the epidaure team in Figure 9 due, for example, to the heavely usage of the word ``imaging'' in its presentation. We noted that the Jaccard distance allowed us to reduce the influence of too general frequent words (such as the words ``applications'', ``computer''). Our second objective was to compare the 2-3 AHC algorithms gains with the classical AHC ones on this application: for more details, see section 6.3.6 .
Entity Extraction From XML Documents
Keywords : Entity Extraction, Wrapping, named entities, semantic annotations.
Participants : Anne-Marie Vercoustre, Thierry Despeyroux, Eduardo Fraschini.
In order to improve the reliablility and accessibility of document-based information systems, we need to develop tools that involve both the structure of the documents (beyond the DTD model, or the actual tree structure) and the content, i.e. the textual part. We have contributed to the development of http://Ralyx.inria.fr/ , a system that uses the structure of the INRIA Activity Reports to provide multiple views on these XML reports and supports the exploitation of this rich information source, both by internal and external parties. However those reports contain mostly text that is difficult to exploit without further document mining. We need to extend their textual content with specific and semantic annotations that can be used both in validation tools and in more specific queries.
This year we experimented with extracting entity names (organisation names) from these activity reports. In this task the XML structure was only used to select specific parts of the reports (e.g. the "contract" section) in order to identify and extract our partners organisations.
Our approach is inspired by the techniques for extracting information from regular web pages (wrapper induction) and does not require natural language ressources, large collections for training, nor manual tuning. The main idea is to use a small list of known organisms (e.g. extracted from a few documents), and to use them as a seed to generate patterns of extraction for new names. We have generated only very generic patterns based on the syntagms of the language as identified by standard taggers. The patterns were trained on a subset of documents for which the entities had been manually extracted. Half of the training set was used to validate the approach with standard recall and precision measures. Using such generic patterns, we could not expect a very high precision, but we were able to extract many new names that were not known in advance. The experiments and the results have been accepted for publication  . The next directions to explore will be
pattern generation without training,
weighting the patterns both locally (within one document) and globally,
dynamically improve the initial list of entities by reusing the list provided the previous year.
Document Mining for Scientific and Technical Watch
Keywords : XML Document, mapping, scientific watch, clustering.
Participants : Reda Kabbaj, Mustapha Eddahibi, Bernard Senach, Brigitte Trousse.
Research Institutes are more and more involved in applying for grants and supports. As the number of wickets increases, they require more and more resources in watching "calls for tender" to identify current opportunities and route them to the relevant research teams. There is no currently much support tool for this task and calls for tender may be routed to the wrong research teams or, inversely, a competent team may not receive an invitation to tender, therefore missing an opportunity. It is the aim of our "Mapping bidirectionnel AO-ER" system to provide such a tool. Our methodology relies mainly on a bottom up approach to classify documents and qualify research teams, based on text mining. We will explore the use of ontologies to improve our approach. The main contributions are:
the use of text mining to describe research teams and calls to tender;
a mapping method based on classification;
a generic architecture independent of specific data;
The system is based on an automatic classification (supervised and non supervised) of two types of documents: DDAO ("Documents décrivant les Appels d'Offres") are the textual descriptions of the invitation to tender, DDER ("Documents décrivant les Equipes de Recherche") are the textual description of the research team The system, which relies on the K-means algorithm, offers four components:
a pre-processing module has an information selection mechanism which can represent specific terms of the domain,
an indexation module provides a data structure to rapidly access to DDAO et DDER,
a knowledge module represents the knowledge extracted from the documents,
a mapping module associates research teams and calls to tenders (in both directions) and allows classification.
Initially, the system will be applied to ANR's calls for tender and then extended to other calls for grants.
Web HTML Pages Clustering For Ontology Construction
Keywords : ontology construction, clustering, Web pages.
Participant : Marie-Aude Aufaure.
We proposed two approaches for ontology construction from web pages. The first one is based on a contextual and incremental clustering of terms described here, while the second one (cf. section 6.2.4 ) designs a knowledge base for learning ontologies from the web.
Our first approach defines and evaluates a context-based clustering algorithm for ontology learning (COCE algorithm) included in a global architecture for knowledge discovery for the semantic web  . This algorithm is based on an incremental use of the partitioning K-means algorithm and is guided by a structural context. This context is based on the HTML structure and the location of words in the documents. It is deduced from the various analysis included in the pre-processing step (structural and linguistic analysis). This contextual representation guides the clustering algorithm to delimit the context of each word by improving the word weighting, the word pair s similarity and the semantically closer cooccurents selection for each word. By performing an incremental process and by recursively dividing each cluster, the COCE algorithm refines the context of each word cluster and improves the conceptual quality of the resulting clusters and consequently of the extracted concepts. The COCE algorithm offers the choice between either an automatic execution or an interactive one with the user. We experiment the contextual clustering algorithm on HTML document corpus related to the tourism domain (in French) and we evaluate the extracted ontological concepts with our contextual algorithm  . The results show that the appropriate context definition and the successive refinements of clusters improve the relevance of the extracted concepts in comparison with a simple K-means algorithm.
Formal Concept Analysis and Semantics for Contextual Information Retrieval
Keywords : formal concept analysis, semantics, contextual information retrieval.
Participant : Marie-Aude Aufaure.
In this work, we define an information retrieval methodology which uses Formal Concept Analysis in conjunction with semantics to provide contextual answers to Web queries  ,  . The conceptual context defined can be global - i.e. stable- or instantaneous- i.e. bounded by the global context. Our methodology consists first in a pre-treatment providing the global conceptual context and then in an online contextual processing of users requests, associated to an instantaneous context. The pre-treatment consists in computing a conceptual lattice from tourism Web pages in order to build an overall conceptual context. Each concept of the lattice corresponds to a cluster of Web pages with common properties. A matching is performed between the terms describing each page and a thesaurus about tourism, in order to label each concept in a standardized way. Whereas the processing of tourism Web pages is achieved offline, the information retrieval is performed in real-time: users formulate their query with terms from the thesaurus. This cluster of terms is then compared to the concepts labels and the best-matching concepts are returned. Users may then navigate within the lattice by generalizing or on the contrary by refining their query.
This method has several advantages:
Results are provided according to both the context of the query and the context of available data. For example, only query refinements corresponding to existing tourism pages are proposed;
The added semantics can be chosen depending on the target user(s);
More powerful semantics can be used, in particular ontologies. This allows enhanced query formulation and provides more relevant results.
Our information retrieval process is illustrated through experimentation results in the tourism domain. One interest of our approach is to perform a more relevant and refined information retrieval, closer to the users expectation.
Web Pages Mining for Improving Search Engines
Keywords : ranking criteria, search engine, Web, query formulation.
Participants : Thierry Despeyroux, Yves Lechevallier, Florent Masseglia, Bernard Senach, Doru Tanasa, Brigitte Trousse, Anne-Marie Vercoustre.
Motivated by our work in 2005 in the context of the e-Mimetic project (confidential status), we pursued some researches in collaboration with E. Boutin from LePont laboratory of the University South Toulon and M. Nanard from the IHMH team of LIRMM. Our goal was to define and evaluate new Web pages ranking criteria based on page presentation. For this we used a test collection of Web pages returned by different search engines in response to a specific set of queries. The pre-processing and clustering tasks allowed us to make some methodological propositions to solve some emerging sub-problems such as: