Section: New Results
Data Transformation, Document Validation and Knowledge Management in KDD
Summarizing Data Streams and Clustering For Reducing the Size of Data
In the data mining approach, data are not collected for a statistical analysis purpose but are available just because of the computerized nature of the management of human activities. So, goals of statistical analysis are usually not defined before the storage of data. Consequently, there is a strong need for summarizing data in order to enable future rich analysis without storing all the available data. Our work made in cooperation with Antonio Ciampi (Univ of McGill, Canada) and Georges Hébrail (ENST, Paris) differs from this work in the sense that our compression technique has a semantic basis.
Summarizing data streams
Similarly to the data mining approach, stream data are not collected for a statistical analysis purpose but are available just because of the computerized nature of the management of human activities. So, goals of statistical analyses are usually not defined before the streams to begin or before the maximum storage of data to be reached. Consequently, there is a strong need for summarizing data streams in order to enable future rich analyses without storing all the available data. Several approaches have been proposed to summarize data streams, for instance: micro-clustering techniques or sampling techniques.
Our way of getting over the problem of data stream volume is to restrict the scope of analyses by defining sliding windows on the streams. Our approach eliminates the problem of distribution drift.
We propose [Oops!] to adapt the algorithms developed in Stephan et al. (1999) to the case of data available in the form of data streams instead of data bases.
Clustering Approach for Reducing the Size of Data
Our goal is to propose some clustering methods which reduce the volume of data in data warehouses with the possibility to perform needed analysis. Our approach is based on two key ideas [Oops!] , [Oops!] :
A preliminary data reduction using a Kohonen Self Organizing Map (SOM) is performed. As result, the individual measurements are replaced by the means of the individual measurements over a relatively small number of micro-clusters corresponding to Kohonen neurons. The micro-clusters can now be treated as new 'cases' and the means of the original variables over micro-clusters as new variables. This 'reduced' data set is now small enough to be treated by classical clustering algorithms. A further advantage of the Kohonen reduction is that the vector of means over the micro-clusters can safely be treated as multivariate normal, owing to the central limit theorem. This is a key property, in particular because it permits the definition of an appropriate dissimilarity measure between micro-clusters.
Participant : Fabrice Rossi.
Feature selection is an extremely important part in any data mining process  . Selecting relevant features for a predictive task (classification or regression) enables for instance specialists of the field to discover dependencies between the target variables and the input variables, that lead in turn to a better understanding of the data and of the problem. Moreover, performances of predictive models are generally higher on well chosen feature sets than on the original one, as the selection process tends to filter out irrelevant or noisy variables and reduces the effect of the curse of dimensionality.
In 2007, we have continued our work on the combination of functional data analysis and feature selection. Our previous work in this direction (  ,  ) has been published in extended form in an international journal [Oops!] . We have also started another approach [Oops!] that consists in clustering the spectral variables via a simple correlation measure. The clustering is constrained to produce interval clusters, i.e., to select sub-interval of the spectral range under analysis (this can be considered as a “functional aware” variable clustering method). Each cluster of variables is replaced by a mean variable. Then, the resulting variables are processed by our previously proposed mutual information based feature selection method  .
We have also studied approaches to automatize feature selection [Oops!] . Our main idea is to use resampling methods to investigate the variability of an estimator of a dependency measure of two variables, e.g., the mutual information. This allows to choose automatically the parameters of the estimator by minimizing its variability. The same strategy can be used to stop a forward selection measure by estimating in a robust way the increase in the dependency measure induced by adding a new variable.
XML document validation
Participant : Thierry Despeyroux.
Following previous experiments  we develop a new methodology for XML documents verification  offering a rule based specific specification language (SeXML). The design of this language and all the parsers used by the system are based on CLF  (Computer Language Factory), a framework developed to ease parsers generation.
SeXML is based on Structural Operational Semantics (SOS) and Natural Semantics, but environments are hidden to users and are managed in an automatic manner. Basic objects are XML patterns with logical variables. As SeXML is compiled to Prolog, there is a complete access to Prolog predicates and external tools. For example Treetagger calls from SeXML have been used in other part of the project.
SeXML has also been used for structure based extraction in XML files [Oops!] .
Viewpoint Managment for Annotating a KDD Process
This work was performed in the context of H. Behja's Ph.D (France-Morocco Cooperation - Software Engineering Network).
Our goal is to make explicit the notion of "viewpoint" from analysts during their activity and to propose a new approach in integrating this notion in a multi-views Knowledge Discovery from Databases (KDD) analysis.
Past years, we designed and started the implementation of an object platform (design patterns and UML using Rational Rose) for KDD integrating the definitions of viewpoints. This platform used the Weka library and contains our conceptual model integrating the “viewpoint” concept and an ontology for the KDD process. Such an ontology is composed of original components we propose for the pre-processing step and others components based on the DAMON ontology for the data mining step. For the ontology, we have used the Protégé-2000 system.
We proposed a new metadata format to annotate the KDD process. The proposed model is based on use cases to annotate the KDD process in terms of viewpoints, and on the systematic use of design patterns to comment and justify design decision. Our approach proposed object-oriented models for the KDD process, characterized by its complexity, and allows the capitalization of corporate objects for KDD. This year we pursued the implementation of the platform under Protegé-2000 and the redaction of the thesis document.
Knowledge Base For Ontology Learning
Participant : Marie-Aude Aufaure.
Many approaches dedicated to ontology extraction were proposed these last years. They are based on linguistic techniques (using lexico-syntactic patterns), on clustering techniques or on hybrid techniques. However no consensus has emerged for this rather difficult task. This is likely due to the fact that ontology construction relies on many dimensions such as the usage of the ontology, the expected ontology type and the actors to which this ontology is dedicated.
Knowledge extraction from web pages is a complex process starting from data cleaning until the evaluation of the extracted knowledge. The main process is web mining.
Our objective here is to propose a semi-automatic construction of ontologies from web pages [Oops!] . To achieve such an objective, we build a knowledge base to represent web knowledge which is specified using a metaontology containing the knowledge related to the task of domain knowledge extraction. Our architecture is based on ontological components, defined by the metaontology, and related to the content, the structure and the services of a determined domain. In this architecture, we specify three interrelated ontologies: the domain ontology, the structure ontology and the services ontology [Oops!] . Our metaontology is able to store the knowledge related to different techniques and methods for ontology construction. We have defined an on-line information retrieval system using this web knowledge architecture [Oops!] . The on-line information retrieval system enriches the user query with domain concepts and classifies the web documents according to the concepts and the services; it also gives the user the opportunity to detect a set of services related to a given concept. A prototype has been developed and experiments have been realized in the tourism domain (cf. section 4.4 ).
Comparison of Sanskrit Texts for Critical Edition
These results have been obtained in the context of the EuropeAid AAT project and the CNRS ACI action (cf. section 8.2.2 ). Our objective is to compare around 50 versions of the same text copied by hand along the centuries. During that period numerous changes in the text were introduced by the different scribes, most of the time, without meaning it.
Our aim is to obtain a critical edition of this text, i.e. an edition where all the differences between the different manuscripts are highlighted. One text is arbitrarily chosen as a reference version, and all the manuscripts are compared one by one with this reference text.
The main difficulties in doing this comparison, from an algorithmic point of view, are:
The lack of space between the words.
The morpho-syntactic transformation that arises, in Sanskrit, between two consecutive words without separation between. These transformations, perfectly defined by the Sanskrit grammar, are called sandhi .
A number of altered manuscripts, partially destroyed by insects, mildew, rodents etc.
To address these difficulties we use a complete lemmatized reference version called pādapāthā (according to a special kind of recitation of Sanskrit texts) where each Sanskrit word is distinctively separated from the others by a blank or another separator. Each manuscript text (called mātrikāpathā ) will be compared with this reference version. In the text of the mātrikāpathā , where few blanks occur, words facing each other are transformed according to the sandhi .
The expected results are expressed as an edit distance , but in terms of words, instead of characters as in the usual string diff : which are the words that have been added, deleted, replaced from the pādapāthā to obtain the text of the manuscript (the mātrikāpathā ).
We first developed an HTML interface for critical edition of Sanskrit texts.
Then we focused on the comparison of Sanskrit texts, which present some difficulties because of the sandhi . Pādapāthā where sandhi do not aplly and mātrikāpathā where sandhi apply are not homogenous.
The comparison is made letter by letter, using the algorithm of the Longest Common Subsequence (L.C.S), as basic support, in order to determine which are the words in the mātrikāpathā .
This year we completed the whole process, and some new problems arise which were impossible to consider at the begining of the project, and on the other hand getting a complete chain of treatment suggested us some obvious ameliorations.
Some results have been summarized in a paper written in collaboration with François Patte (Université Paris Descartes) [Oops!] , and presented in the First International Sankrit Computational Linguistics Symposium (October 2007).
The software produces results which have been evaluated as satisfactory by some Sankrit philologists, even considering the imperfections discovered at the end of the project.
Another paper written in collaboration with Patrice Bertrand (ENST Bretagne) [Oops!] , has been presented at the “Société Francophone de Classification” congress of september 2007, and an other [Oops!] included in a book edited by Springer.