Section: New Results
Data Transformation and Knowledge Management in KDD
Dissimilarities for Web Usage Mining
Keywords : Dissimilarities, Web Usage Mining, Clustering, Validation, Benchmark.
Participants : Fabrice Rossi, Francisco De Carvalho, Yves Lechevallier, Alzennyr Da Silva.
Many Web Usage Mining methods rely on clustering algorithms in order to produce homogeneous classes of documents (when the content of the web site is analyzed) and/or of users (when browsing behaviors are analysed). Information extracted from web server logs are complex and noisy, but can be used to define usage based dissimilarities between users of the site or between pages of the site. There are however many possibilities to define this type of dissimilarity measure.
We have defined a benchmark site that allows to compare dissimilarity measures via the clustering results they produce. The benchmark consists in one year of log of the web site of the CIn, the laboratory of Francisco De Carvalho. This site is small (91 pages) and is very well organized. This allows to define a meaningful semantic structure and to build an expert partition of its content. Then, this expert partition can be compared to the results of a clustering algorithm applied to the dissimilarity matrix constructed with a specific measure.
The results that will be published in the EGC 2006 conference show that the Jaccard index and the ``term frequency inverse document frequency'' approach obtain quite good results, whereas the cosine measure performs badly. It seems also that better results could be obtained by taking into account the structure of the site together with usage data.
Distances for Clustering Homogeneous XML Documents
Keywords : distance, text comparison, Sanskrit, transliteration.
Participants : Marc Csernel, Sergiu Chelcea, Yves Lechevallier, Sattisvar Tandabany, Brigitte Trousse.
In the context of a research project with India and some others French partners, about a hundred ancient manuscripts written in Sanskrit, all arisen from the same text (the Benares Glose), should be compared in order to make a critical edition and provide some classification between the manuscripts (cf. section 6.2.7 ).
During his internship, S. Tandabany developed some tools that were required for clustering homogeneous XML Sanskrit Documents. First, a modified longest common substring algorithm is proposed to deal with Sanskrit characters. Then, as in Sanskrit inversions of characters are not always meaningful, a detection of possible inversions is applied. Finally, the Agglomerative 2-3 Hierarchical classification (cf. section 6.3.4 ) is used as the classification algorithm. To do this, we proposed a new distance between texts taking into account some Sanskrit specificities and allowing the addition of meta-data (worn state and shape of the manuscripts as objects, annotations about words forgotten, etc.). The text is splitted into paragraphs, ``sub-distances'' are computed between each corresponding paragraph, taking into account adds, deletions, transformations and inversions. Then, some constraints needed to obtain a distance (triangular inequality) are removed to get a dissimilarity instead of a distance for the 2-3AHC. The impact of these modifications on the classification was analysed. Finally, our results are highlighted with some experiments and examples [55] .
Distances for Clustering Downtown Tourist Itineraries
Keywords : Tourist itineraries, Clustering, Distance, 2-3 AHC.
Participants : Rémi Busseuil, Sergiu Chelcea, Brigitte Trousse.
In the context of the MobiVIP project (cf. section 7.1.2 ), during Rémi Busseuil's internship we studied the possibility of clustering tourists itineraries in the town center (Antibes in this case). Thus, a software for tourist itineraries generation and clustering was developed (in Visual .NET), taking into account not only the geographical characteristics of an itinerary, but also the symbolical ones: street type, buildings type, etc. This new use of semantical data, opens new directions for the road itineraries recommendations, by addressing new issues like the purpose of the itinerary or the nature of the crossed areas.
Clustering itineraries has many advantages besides the possibility of choosing the most suitable one: it is also an analysis and comparison tool. This can have multiple applications: route or destination prediction, traffic anticipation, etc. As clustering algorithm, we used the Agglomerative 2-3 Hierarchical Classification (2-3 AHC) algorithm. The 2-3 AHC has the advantage of being easily visualized compared to a classical clustering method.
In order to compare different itineraries, we basically divided each itinerary into fragments and then we computed a distance/dissimilarity value using the Longest Common Subsequence algorithm (LCS) and a spread function developed in [55] . Different ways of defining the dissimilarity and of comparing the fragment were tested.
An example of clustering 40 itineraries issued from 4 different profession types is presented in Figure 4 bellow.
Semantics Tools for XML documents
Participant : Thierry Despeyroux.
The main goal of the Semantic Web is to ease a computer-based data mining and discovery, formalizing data that is mostly textual. Our approach is different as we are concerned in the way Web sites are constructed, taking into account their development and their semantics. In this respect we are closer to what is called content management.
Our formal approach is based on the analogy between Web sites and programs when there are represented as terms, although differences between Web sites and programs can be pointed out :
-
Web sites may be spread along a great number of files.
-
Information is scattered, with many forward references.
-
We may need to use external resources to define the static semantics (thesaurus, ontologies, taggers, image analysis program, etc.).
We are developing a specification language to express global constraints in Web sites or in a collection of XML documents in an operational way.
An initial version of this language as been described in [6] , together with its application to a real sized collection of documents: the Inria scientific activity report for the years 2001 and 2001.
The language and its implementation has been developped and improved in 2005, in particular for efficiency. At the same time our XML core parser has been extended to allow parsing of HXTML documents.
The same language has been used to extract information in XML documents. This has been the case to choose and extract words from different part of XML documents. These words was first passed to a tagger, then used to cluster the different documents. A first experiment has been done in 2004 and has been presented to the EG2005 conference [33] , [32] (cf. section 6.5.1 ).
In a more long term experiment, we have initiated a regular monitoring of the Inria activity reports to see how the number of bad URLs in these reports evoluates. This monitoring, started in december 2004 and then performed every two weeks, takes into account the activity report for 2002, 2003 and 2004.
Metadata Extraction for Supporting the Interpretation of Clusters
Keywords : metadata, XQuery, clustering's interpreting, RDF, Dublin Core, PMML.
Participants : Abdourahamane Baldé, Yves Lechevallier, Brigitte Trousse.
This work was conducted in the context of the PhD of A. Baldé.
A huge volume of data is produced by many applications. Data mining techniques are part of knowledge discovery methods whose aim is to discover knowledge in large databases without predetermined information about the application field which is well-known as KDD. But data mining is a complex process for an end-user and the main difficulties consist in the interpretation of the results. Metadata can help the interpretation process by providing additional information. Our objective is to facilitate the interpretation process and to point out that metadata can play a major role for this purpose. In spite of the visual representation of the results, the user should acquire a significant experience to be able to interpret the clusters. Data mining tools generally offer visualization modules which are not adapted to analysis. The original contributions of our work made in collaboration with Marie-Aude Aufaure (Supelec) concern new approaches to representing clustering's metadata and interpreting clustering's results by using metadata.
First, we propose a metadata model that could be automatically exploited [17] . We also propose a tool in order to help the end-user to interpret the clusters obtained. This tool is based upon the architecture described in Figure 5 .
This architecture is composed by three layers: metadata model, metadata manager which manages metadata extraction and storage and manipulations performed on these metadata and user query layer using XQuery. In order to implement these queries, we use the Saxon processor. Saxon is a set of tools dedicated to XML documents processing: it has established a reputation for fast performance, the highest level of conformance to the W3C specifications. This method can be applied to a wide variety of data mining methods.
Viewpoint Managment for Annotating a KDD Process
Keywords : viewpoint, complex data mining, annotation, metadata.
Participants : Hicham Behja, Brigitte Trousse.
This work was performed in the context of the PhD of H. Behja (France-Morocco Cooperation - Software Engineering Network).
Our goal is to make more explicit the notion of "viewpoint" from analysts during their activity and to propose a new approach to integrate the viewpoint notion in a multi-view Knowledge Discovery from Databases (KDD) analysis. We define a viewpoint in KDD as an analyst's perception of a KDD process, perception referring to its own knowledge [12] . Our purpose is to facilitate both reusability and adaptability of a KDD process, and to reduce his complexity whilst maintaining the trace of the past analysis viewpoints. The KDD process will be considered as a view generation and transformation process annotated by metadata to store the semantics of a KDD process.
In 2004 we started with an analysis of the state of the art and identified three directions: 1) the use of the viewpoint notion in the Knowledge Engineering Community including object languages for knowledge representation, 2) modelling KDD process adopting a Semantic Web based approach and 3) the use of annotations of KDD processes. Then we designed and implemented an object oriented platform for KDD processes including the viewpoint notion (via design patterns and UML using Rational Rose). The current platform is based on the Weka library.
In 2005 we proposed and implemented the knowledge conceptual model [25] integrating the viewpoint concept (cf. Figure 6 ). It is composed of four models structured in two types of knowledge:
First, for the domain knowledge, the domain model that describes the analyzed domain knowledge in terms of objects, attributes, data, etc. and the analyst domain knowledge that will relate to the tasks carried out by the analyst; choice of methods, variables, etc. We propose a formal representation of the domain model as a datawarehouse that allows the business information to be viewed from many viewpoints. For our example of the HTTP log in Web Usage Mining (WUM), the used database design is the star schema.
Second, for the strategic knowledge, we find:
-
the task and method model which describes the KDD analyst domain knowledge. Here, the domain objects are methods, algorithms, parameters, etc. This model is a semi-formal generic ontology. For its construction we are mainly inspired from the DAMON system ontology for the data mining step, but we address all three KDD steps (preprocessing, data mining and postprocessing). This ontology is developed in Protégé-2000 system.
-
the viewpoint model which describes the viewpoint specification in terms of preferences related to the decision-making process in KDD (choice of the attributes, methods and systems, etc.). This viewpoint model, described by a RDF scheme, manipulates both the analyzed domain and the analyst domain:
-
The viewpoint analyzed domain specifies the significant attributes for the expert from the analysed domain. This vision allows the analyst on the one hand to restrict the analyzed domain and on the other hand to guide the goal of the retrieval by defining a diagram on the raw data.
-
The viewpoint analyst domain allows to define a symbolic execution by choosing the methods and the algorithms for each KDD process step.
-
The viewpoint organizational model describes the organization of the viewpoints in terms of relations among them (in progress).
-
This work is accepted for publication in january 2006 in a special issue on ``Méthodes Avancées de Développement des SI'' of the french journal ISI (D. Rieu and G. Giraudin editors).
Production and Display of a Critical Edition of Sanskrit documents
Keywords : Text comparison, Sanskrit, transliteration, Critical Edition, Unicode, XML, electronic display.
Participants : Marc Csernel, Marina Dufresne, Yves Lechevallier, Selma Khebache.
A critical edition is the edition of a well known text taking into account all possible versions of this text. Critical editions are particularly needed for texts issued from manuscripts where the variations can be very significant from one manuscript to the other.
Production of critical edition of Sanskrit Text.This is particularly important for the Indian subcontinent where at least one third of the manuscripts existing through the whole world are supposed to be found, the main part of them being written in Sanskrit. It was not an Indian tradition to deal with critical edition, so very few of them exist at present time. The idea is to provide a computer assisted construction of critical edition. Such tools exist for occidental languages, but, due to some Sanskrit specificities they could not suit our purpose:
-
Sanskrit is written according to a transliterated alphabet.
-
Separation between words are mostly absent in a manuscript and their presence is not meaningful. The text is generally formed by a sequence of thousands of characters;
-
The writing of two words is different if there is a blank (or any separation) between them, or if they are following each other directly. This notion is called a sandhi.
In order to avoid the complexity problem induced by the thousands of characters sentences, and to be able to provide to the philologist the exact words where a difference occurs, we need either a lexicon, or a text where all the words appear separately. We will use the second solution and we call such a text a Padapatha according to a certain form of recitation used by the Sanskritist. Due to the sandhi, such a text is not readily comparable with the manuscript text. We must provide a pre-processing based on LEX to construct all the sandhi related to the padapatha in order to be able to provide a suitable comparison. For the comparison we use the Longest Common Subsequence (LCS) algorithm based on dynamic programming. The algorithm output will be used as input of the following project steps :
-
Electronic display of critical edition of Sanskrit text
-
Cluster (cf. section 6.2.2 ) and Phylogenetic trees
Electronic display of critical edition of Sanskrit Text.Traditionally critical editions are represented as particularly boring (for unfamiliar) books, where the text itself is very small, and where the notes are numerous and enormous. It's a philologist dream to see only the points they care about. An electronic form of critical edition could be a proper answer.
But there are a lot of problems related to the Sanskrit we have to care about. Thanks to Unicode it is now possible to get some standard about the display of sandhi characters. But because of the ligature formation, two Sanskrit characters separated by a blank do not look similar as if they were put one after the other.
Marina Dufresne during her internship (cf. section 9.2.3 ) developed a software tool what allows an interactive display of critical edition of Sankrit text starting from an XML text. This tool is not perfect but it has been greatly appreciated by the Sanskrit community. This work has bee done in collaboration with François Patte.