Section: New Results
Data Transformation and Knowledge Management in KDD
Keywords : Feature selection, Mutual information, Entropy, K Nearest Neighbor, Spectrometry.
Participant : Fabrice Rossi.
Feature selection is an extremely important part in any data mining process  . Selecting relevant features for a predictive task (classification or regression) enables for instance specialists of the field to discover dependencies between the target variables and the input variables, that lead in turn to a better understanding of the data and of the problem. Moreover, performances of predictive models are generally higher on well chosen feature sets than on the original one, as the selection process tends to filter out irrelevant or noisy variables and reduces the effect of the curse of dimensionality.
We have been working since 2004  on the application of feature selection methods to spectrometric data. This type of data introduces specific challenges as they correspond to a small number of spectra (a few hundred) described by a very high number of correlated spectral variables (up to several thousands). We have shown how a recently proposed high dimensional estimator of the mutual information  could be used, together with a forward backward search procedure, to select relevant spectral variables in non linear regression problems  .
We have also started to combine our work on functional data analysis with our feature selection research (see section 6.3.3 for details).
Viewpoint Managment for Annotating a KDD Process
Keywords : viewpoint, complex data mining, annotation, metadata.
Participants : Hicham Behja, Brigitte Trousse.
This work was performed in the context of H. Behja's Ph.D (France-Morocco Cooperation - Software Engineering Network).
Our goal is to make explicit the notion of "viewpoint" from analysts during their activity and to propose a new approach integrating this notion in a multi-views Knowledge Discovery from Databases (KDD) analysis. We define a viewpoint in KDD as the analyst perception of a KDD process which is referred to its own knowledge. The KDD process implies various kinds of knowledge, which makes it complex. Our purpose is to facilitate both reusability and adaptability of a KDD process, and to support this complexity via storing past analysis viewpoints. The KDD process will be considered as a view generation and transformation process annotated by metadata related to the semantics of a KDD process.
In 2004 and 2005, we started with an analysis of the state of the art and identified three directions: 1) the use of the viewpoint notion in the Knowledge Engineering Community including object languages for knowledge representation, 2) modelling a KDD process adopting a semantic web based approach and 3) annotating a KDD process. Then we designed and implemented an object platform (design patterns and UML using Rational Rose) for KDD integrating the definitions of viewpoints. This platform used the Weka library and contains our conceptual model integrating the ``viewpoint'' concept and an ontology for the KDD process. Such an ontology is composed of original components we propose for the pre-processing step and others components based on the DAMON ontology for the data mining step. For the ontology, we have used the Protégé-2000 system.
This year we propose a new metadata format to annotate the KDD process in order to reuse the analysis of experts based on their preferences and formalized by the ``viewpoints'' analysts. Secondly, in order to facilitate the management and the use of our scheme in a complete KDD analysis, we propose an object-oriented framework that integrates specializable ``viewpoints'' and reusable components. The proposed model is based on use cases to annotate the KDD process in terms of viewpoints, and on the systematic use of design patterns to comment and justify design decision. Our approach proposed object-oriented models for the KDD process, characterized by its complexity, and allows the capitalization of corporate objects for KDD.
Cluster Interpretation Process Metamodel based on a Clustering Ontology
Keywords : metadata, XQuery, clustering's interpreting, RDF, Dublin Core, PMML.
Participants : Abdourahamane Baldé, Yves Lechevallier, Brigitte Trousse, Marie-Aude Aufaure.
This work was conducted in the context of A. Baldé's Ph.D.
The main goal of this thesis is to help end-users to interpret, automatically, the results of their clustering methods. Thus, this work addresses the last step of KDD process (post processing) and its anticipation from the data mining step.
In 2005, we designed this process by using metadata model as one solution to this problem. hen, we implemented this model with the Dynamic clustering algorithm (SClust ) developed in the AxIS project.
In 2006, we began by using our approach in the Weka Software (Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.) and we have shown that using our approach is very helpful in this context.
Based on our previous study  , this year we propose a new metamodel, based on our clustering ontology (cf. Figure 4 ) for cluster interpreting process. A meta-model is an explicit model of the constructs and rules needed to build specific models within a domain of interest. The main interest of this metamodel is to explicitly explain the main concepts used in the interpretation domain, and the relationships between them.
This metamodel, mainly inspired from and based on the Common Warehouse Metamodel proposed by (CWM ), allows us to elaborate the automatic interpretation process. Then, we defined some interpretation scenarios in our tool. Our minimal ontology of clustering domain helps us within the definition of these scenarios. This ontology is constructed using the Protégé2000 software.
We experimentally validated this new approach on Weka and on Sclust Algorithm.
Our main contributions can be summarized as follows:
construction of a clustering ontology, supporting the automation of the interpretation process. This ontology is used to define various interpretation scenarios,
creation and implementation of a metamodel based on this ontology,
extension of our metadata architecture based on the Saxon processor.
Knowledge Base For Ontology Learning
Keywords : ontology acquisition, ontology learning, knowledge base.
Participant : Marie-Aude Aufaure.
Many approaches dedicated to ontology extraction were proposed these last years. They are based on linguistic techniques (using lexico-syntactic patterns), on clustering techniques or on hybrid techniques. However no consensus has emerged for this rather difficult task. This is likely due to the fact that ontology construction relies on many dimensions such as the usage of the ontology, the expected ontology type and the actors to which this ontology is dedicated.
Knowledge extraction from web pages is a complex process starting from data cleaning until the evaluation of the extracted knowledge. The main process is web mining.
We proposed two approaches for ontology construction from web pages. The first one (cf. section 6.5.4 ) is based on a contextual and incremental clustering of terms, while the second one designs a knowledge base for learning ontologies from the web.
Indeed the objective of our second approach is to build a knowledge base for ontology learning from web pages  . This knowledge base is specified using a metaontology. This metaontology contains the knowledge related to the task of domain knowledge extraction. Our architecture is based on ontological components, defined by the metaontology, and related to the content, the structure and the services of a determined domain. In this architecture, we specify three ontologies: the domain ontology, the structure ontology and the services ontology  . These components are interrelated. For example, the relation between the domain ontology and the service ontology is useful to determine the set of concepts and relations identifying each service. Our ontology learning approach is based on the synthesis of the research work in this field. A prototype has been developed and experiments have been realized in the tourism domain (cf. section 4.4 ).
Comparison of Sanskrit Texts for Critical Edition
Keywords : distance, text comparison, Sanskrit, transliteration, critical edition.
Participants : Marc Csernel, Jean-Nicolas Turlier, Yves Lechevallier.
These results have been obtained in the context of the EuropeAid AAT project (cf. section 8.3.1 ) and the CNRS ACI action (cf. section 8.2.1 ). Our objective was to compare around 50 versions of the same text copied by hand along the centuries. During that duration numerous changes in the text were introduced by the different scribes, most of the time, without meaning it. Our aim is to obtain a critical edition of this text, i.e. an edition where all the differences between the different manuscripts are highlighted. One text is arbitrary chosen as a reference version, and all the manuscripts are compared one by one with this reference text.
The main difficulties in doing this comparison, from an algorithmic point of view, are given below:
The lack of space between the words.
The morpho-syntactic transformation that arises, in Sanskrit, between two consecutive words without separation between. These transformations, perfectly defined by the Sanskrit grammar, are called sandhi .
A number of altered manuscripts, partially destroyed by insects, mildew, rodents etc.
To address these difficulties we use a complete lemmatized reference version called pādapāthā (according to a special kind of recitation of Sanskrit texts) where each Sanskrit word is distinctively separated from the others by a blank or another separator. Each manuscript text (called mātrikāpathā ) will be compared with this reference version. In the text of the mātrikāpathā , where few blanks occur, words are transformed according to the sandhi .
The expected results are expressed as an edit distance , in terms of words, instead of the usual string diff: the sequence of words that are added, deleted, replaced from the pādapāthā to obtain the text of the manuscript.
In 2005, after addressing the graphic problems related to sanskrit, we developed an HTML interface for a critical edition of Sanskrit texts, and we made the first step of the processing of Sanskrit manuscripts.
This year we focusedmostly on the comparison of Sanskrit texts.
The comparison is done according to the following steps:
A parser makes the two versions homogeneous
The comparison is made letter by letter, using the algorithm of the Longest Common Subsequence (L.C.S), to determine which are the words in the mātrikāpathā . The separation between the words of the pādapāthā , are used as a pattern for this determination.
Once the L.C.S completed, we can not examine all the possible results provided, because their number is enormous , 1010 is quite common and can be frequently oversized.
The strategy developed is a navigation through the L.C.S matrix associated with some rules based on the common sense.
The rules based on common sense are quite simple, such as "two words are not considered as replacing each others if they don't have at least 50% of letters in common".
The results  , which have been obtained without specific Sankrit knowledge, are quite good, according to some Sankrit philologists. They are due to some common sense rules with some specific algorihmic methods.