Project : axis
Section: New Results
Keywords : KDD, preprocessing, data transformation, metadata, knowledge management, viewpoint, ontology, annotation, reusability.
Data Transformation and Knowledge Representation
ARFF Format Library
Keywords : ARFF library, database, PCA, Self Organizing Map.
Participants : Luc Baubois, Aicha El Golli, Yves Lechevallier.
During his trainee period Luc Baubois  has proposed the ARFF library. This library generates an ARFF file from database. An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of the University of Waikato for use with the Weka machine learning software( http://www.cs.waikato.ac.nz/~ml/). Weka is a collection of machine learning algorithms for data mining tasks and marketed as open source released under the GNU General Public License. The library proposed also visualization interfaces for the ARFF files using the PCA (Principal Component Analysis) projection, as well as the resulting ARFF files of the Kohonen maps, extended to mixed data.
Structure Mining in Preprocessing
Keywords : structure mining, sequential patterns, web usage mining, long patterns.
Participants : Calin Garboni, Florent Masséglia.
In this work, we focus on the preprocessing step of knowledge discovery. Actually, considering the actual representation of the KDD process, given in Figure 4, we are working on the automation of the preprocessing step (within the dash lines). Indeed, in order to perform a knowledge discovery process on any kind of data, one has to transform the data from the original raw format to a specific format that will be understood by the data mining algorithm. This transformation is usually designed by an expert who has the required knowledge about the data and about the data mining algorithm. Our goal is to help and even replace the expert by providing an intelligent tool, able first to ``understand'' the structure of the data to prepare, and second to propose a parsing over the data (based on the discovered structure). We started using an access log file and tried to discover automatically the structure of this file. As this structure is already well known, we use it as a bench for our proposal. The method is based on the sequential pattern mining principle. Actually, mining sequential patterns aims at extracting the frequent sequences hidden in the data. The structure in a file organizing the records line by line can be considered as a frequent sequence repeated in (almost) each record.
Sequential patterns for structure discovery
Our contribution is based on a comparison between sequential patterns and structures. The structure is expected to be common to all records in the datasets. Nevertheless, in numerous types of records, such as log file entries, data can be altered (due to errors while recording the entry, system crash, etc.). The structure can thus be considered as a sequential pattern having a very high support over the dataset. For an Apache log file, we expect to find a frequent pattern such as :
... - - [/Mar/2003::: +000] "GET / HTTP/1." 0 "" ""
Then, filling this pattern would give the following rule:
[0-9]+''.''[0-9]+''.''[0-9]+''.''[0-9]+ - - [''[0-9]+''/Mar/2003'' (etc.)
With this type of rule, it will be possible to parse the original data file and to obtain the transformed data file. So far, we developed the SINTHES (Structure is IN THE Sequence) method ,. The main contribution of this method is to provide a top-down generating-pruning principle for frequent sequences extraction. Actually, the structure to discover is usually very long, so any existing method based on the apriori principle will fail (because of the large number of candidates to generate and test). SINTHES uses a sample of the data file to process and applies several filters to the sequences in order to delete infrequent combinations of characters in the considered candidates. When the filters have been applied to the candidates, the most frequent is used in the top-down generating-pruning principle. This candidate will be the seed for all the subsequences of size k-1. For each one we determine the frequency and the most frequent subsequence will be chosen for the next iteration. The algorithm will continue until the support of one subsequence is higher than the minimum support and that will be the candidate representing the structure.
Objective: a new schema for KDD
The aim of this work is to provide a new schema for the KDD process. Once the structure is discovered and the parser is generated (thanks to the rules inferred from the data file), it is possible to apply this parser to the original data file (as illustrated in figure 5). At this time, only the structure discovery is possible (thanks to the SINTHES method). A further step will aim at generating the parsing rules by comparing the structure to the data file and extract the missing information (nature of the embedded characters in the structure). Then, based on these rules, the parser will be generated.
Metadata Extraction for Supporting the Interpretation of Clusters
Keywords : metadata, classification, Dublin Core, RDF.
Participants : Abdourahamane Baldé, Yves Lechevallier, Brigitte Trousse.
The huge volume of data produced by different information sources requires to develop tools to retrieve pertinent data. Metadata, defined as information about data, is a challenging way to add semantic to data and a way to manage a considerable volume of data without accessing their content. Nowadays, a huge volume of unstructured data needs also to be managed. These data are usually badly structured and difficult to access. In our context, metadata will be used to share information and resources without any access to their content in respect of privacy issues. Data from different sources can be compared using those metadata.
We propose a new methodology in collaboration with M-A Aufaure (Supelec) to extract metadata during the classification process , . Metadata will give information about clusters like clusters contents, variables describing the clusters, classification method, set of criteria used, general information. Standards like Dublin Core and RDF have been used to model metadata. We applied this structure to one algorithm called Clustering Algorithm on Symbolic Data Table: our goals are to offer descriptions to facilitate the interpretation of the resulting clusters. Today metadata offer a true means of capitalization of knowledge and know-how.
Viewpoint Managment for Annotating a KDD Process
Keywords : KDD, viewpoint, ontology, knowledge, annotation, metadata, reusability, RDF, weka.
Participants : Hicham Behja, Brigitte Trousse.
This research is mainly related to Behja's doctoral thesis in the context of France-Morocco Cooperation (Software Engineering Network). Our goal is to make more explicit the notion of " viewpoint "  from analysts during their activty and to propose a viewpoint-based KDD model 1) for annotating the underlying goals of KDD activities and 2) for encapsulating existing KDD algorithms or methods and then offering more flexibility and adaptability.
In 2004, in collaboration with Abdelaziz Marzark (Univ of Casablanca, Morocco), we propose a new approach for applying the viewpoint notion in Knowledge Discovery from Data Base (KDD) multiviews analysis. We defined viewpoint as the perception of an expert on a KDD process, perception referred by his/her own knowledge. We propose to structure this knowledge into two different types: 1) domain knowledge that will give information primarily about attributes and data from the database and 2) task knowledge from the analyst field, that will relate to the tasks carried out by the analyst during the KDD process. This classification aims, on the one hand, at integrating the role of the processing expert, and on the other hand, at reducing the size of the handled data. The goals are to facilitate both reusability and adaptability of a KDD process, and to reduce his complexity with maintaining the trace of the past analysis viewpoints. The KDD process will be regarded as generating and transformation views annotated by metadata to store the discovery knowledge. We started with a analysis of the state of the art and identified three directions: 1) the use of the viewpoint notion in the Knowledge Engineering Community including object languages for knowledge representation, 2) modeling KDD process adopting a Semantic web based approach  and 3) KDD process annotation. Then we designed and implementing an object platform for the KDD processes including the viewpoint notion (design patterns and UML using Rational Rose). The current platform is based on the Weka library. We are now applying our model to analyzing the use of web sites, specially Inria Sophia-Antipolis site (according to the "reliability" and "ergonomic" viewpoints).