Project : axis
Section: New Results
Keywords : web usage mining, preprocessing, document mining, semantics, adaptative services, XML, HTTP logs, sequential pattern, clustering, data analysis, 2-3 hierarchies, AHC, neural network, personalization, recommender system, user behavior, user profile, metrology.
Web Mining and Web applications
Site Semantic Checking
Keywords : semantics, Web sites, adaptative services, Web Semantics, formal approaches, typing, natural semantics, CLF.
Participant : Thierry Despeyroux.
The main goal of the Semantic Web is to ease a computer-based data mining and discovery, formalizing data that is mostly textual. Our approach is different as we are concerned in the way Web sites are constructed, taking into account their development and their semantics. In this respect we are closer to what is called content management.
Our formal approach is based on the analogy between Web sites and programs when there are represented as terms, although differences between Web sites and programs can be pointed out :
Web sites may be spread along a great number of files. This is also the case for programs, but these files are usually all located on the same file system. With Web sites we will have to take into account that we may need to access different servers. Currently, a program such as the ``make'' program cannot handle URLs, only directories.
Information is scattered, with many forward references. A forward reference describes an object (or a piece of information) that is used before it has been defined or declared. In programs, forward references exist but are most often limited to single files so the compiler can compile one file at a time. This is not the case for Web sites, and as it is not possible to load a complete site at the same time, we need to use other techniques.
We may need to use external resources to define the static semantics (for example one may need to use a thesaurus, ontologies or an image analysis program). In one of our example, we call the wget program to check the validity of URLs in an activity report.
We are developing a specification language to express global constraints in Web sites. The compiler is written in Prolog and produces Prolog code to take advantage of a fast XML parser developed previously and the ease of term manipulation of prolog.
As a real sized test application, we have used the scientific part of the activity reports published by Inria for the year 2001 and 2002 that can be found at the following URLs:
http://www.inria.fr/rapportsactivite/RA2001/index.html and
http://www.inria.fr/rapportsactivite/RA2002/index.html.
The XML versions of these documents contain respectively 108 files and 125 files, a total of 215 000 and 240 000 lines, more than 12.9 and 15.2 Mbytes of data. Our system reported respectively 1372 and 1432 messages.
This work has been presented at the Word Wide Web conference in May 2004 [32].
We started to new applications. The first one concerns XML document mining and is presented in the following section. Our specification language is used to select subpart of XML documents and to interface with an external natural language tagger. The second is to study the way a site or an XML document containing external URLs are corrupted during their life.
XML Document Mining
Keywords : Document mining, XML clustering, XML classification.
Participants : Thierry Despeyroux, Yves Lechevallier, Brigitte Trousse, Anne-Marie Vercoustre, Mihai Jurca.
With the increasing amount of available information, there is a need for more sophisticated tools for supporting users in finding useful information. In addition to tools for retrieving relevant documents, there is a need for tools that synthesise and exhibit information that is not explicitly contained in the document collection, using document mining techniques. Document mining objectives include extracting structured information from rough text, as well as document classification and clustering.
XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Standard methods have been used to classify XML documents, reducing them to their textual parts. These approaches do not take advantage of the structure of XML documents that also carries important information.
We study the impact of selecting (different) parts of documents for a specific clustering task. The idea is that different parts of XML documents correspond to different dimensions of the collection that may play different roles in the classification task. We carried some experiments in clustering homogeneous XML documents to validate an existing classification or more generally an organisational structure.
Our approach integrates techniques for extracting knowledge from documents with unsupervised classification of documents. The goal of unsupervised classification (or clustering) is to identify emerging classes that are not known in advance. We focus on the feature selection used for clustering and its impact on the emerging classification. This approach differs from other ones in two respects : first we mix the selection of structured features with the selection of textual features, second this last selection is based on syntactic typing be means of a tagger. We use TreeTagger, a tool for annotating text with part-of-speech and lemma information that has been developed at the Institute for Computational Linguistics of the University of Stuttgart [77].
Based on the selected features the documents are then clustered using a dynamical classification algorithm that builds a prototype of each cluster as the union of all the features (words) of the documents belonging to this cluster. 6 gives an example of discriminating keywords that are generated for each cluster. They can act as summaries for the clusters.
We illustrate and evaluate this approach with a collection of 139 XML activity reports written by Inria research teams for year 2003. The objective is to cluster projects into larger groups (Themes), based on the keywords or different chapters of these activity reports. We then compare the results of clustering using different feature selections, with the official theme structure used by Inria between 1985 and 2003, and with the new one proposed officially in 2004.
The results that will be published in the EGC 2005 conference show that the quality of clustering strongly depends on the selected document features. In our collection of research reports, clustering using foundation sections always outperforms clustering using keywords.
as an indication for the organization that some parts of the Activity Report do not appropriately describe the research domains and that the choice of keywords and research presentation could be improved to carry a stronger message.
Although the analysis is closely related to our specific collection, we believe that the approach can be used in other contexts, for other XML collections (such as the Inex collection of IEEE articles) where some knowledge of the semantic of the DTD is available.
A Complete Methodology for InterSites Web Usage Mining
Keywords : usage mining, pre-processing, HTTP logs.
Participants : Doru Tanasa, Brigitte Trousse.
In the recent years, the Web Usage Mining (WUM) emerged as a new field of Data Mining and gained an increasing attention from both the business and research communities. Following a survey of the main WUM techniques, we propose in earlier works, a complete methodology for data preprocessing in inter-sites WUM which was published in [19] and [20]. Our first objective is to reduce in a significant but pertinent manner, the size of the Web servers log files. The second objective is to increase the quality of data obtained after the classical preprocessing step by means of an original advanced data preprocessing step. To validate the efficiency of our method, we have conducted an experiment using the log files of Inria's Web sites: we joined and analyzed together the log files collected from four of Inria's Web servers.
In 2004, we develop AxISLogMiner which supports our methodology for preprocessing Web logs for inter-sites Web Usage Mining and integrates various algorithms for extraction sequential patterns developed in the team.
AxISLogMiner URL= http://www-sop.inria.fr/axis/axislogminer/
Hybrid Methods for Web Usage Mining: Improvements
Keywords : low support, sequential pattern, neural network, user behaviour, web usage mining.
Participants : Florent Masséglia, Doru Tanasa, Brigitte Trousse.
In 2004, we made some improvements of our two hybrid methods for extracting sequential patterns with a low support (cf. 2003 AxIS activity report).
Cluster & Divide : automatic mining for sequential patterns
Related to our Cluster & Discover method implemented in the C& D application [18], we added a new feature, ``Mixed Mode'', that allows automatic mining for sequential patterns in Web Logs. This feature allows the mining process to run independently (non-interactive) so the user only needs to specify the parameters for clustering and the minimum support. The minimum support is then recalculated according to the size of the cluster and the sequential patterns are extracted from the cluster. Once the sequential pattern mining is done for all the clusters the user can visualize and explore the results.
Divide & Discover : improved performance in the number of divisions
We observed that the divide and discover method ([43],[16]) implemented in the D&D application needed numerous divisions before obtaining significant results. Actually, the clustering process was based on the extracted sequential patterns. Sequential patterns are not numerous and the number of clients not being classified was large. Even if the support was lowered, the number of sequential patterns did raise, but their length grew up. So the sequential patterns can be either not numerous or too long. In both of these cases, the clustering based on the sequential patterns extracted is not very efficient at the beginning.
In order to solve this problem we added the ``just items'' feature to the D& D application, which allows to extract only frequent items for the first division. There can be a large number of frequent items with a low support and they are easy to extract. Based on these frequent items we provided a new clustering process which is more efficient because items are not as specific as sequential patterns (because there is no combination between items as there can be within sequential patterns). The clustering is thus stronger and there are less non classified clients.
As a global consequence the number of divisions has been significantly reduced.
Applying our Data Mining Methods on Inria Web Data
Keywords : unsupervised clustering, contingence table, dynamic clustering algorithm, WUM data analysis, unsupervised clustering, self-organizing map, dissimilarity, 2-3 hierarchies, Web data, preprocessing, clustering.
Participants : Sergiu Chelcea, Aicha El Golli, Mihai Jurca, Yves Lechevallier, Fabrice Rossi, Brieuc Conan-Guez, Yves Lechevallier, Doru Tanasa, Brigitte Trousse, Rosanna Verde.
In 2004, we tested our different Data Mining methods (presented in the section 6.2) on Inria logs as reported below:
Applying Crossed Clustering Method on Web Data
An application ([40],[41]) on the web log data from Inria web server allows to validate the proposed procedure and to suggest it as an useful tool in the Web Usage. The log file has been processed in order to record the navigations on both URL: www.inria.fr and www-sop.inria.fr.
This study aims to detect the behavior of the users and, in the same time, to check the efficacy of the structure of the site. Behind the research of typologies of users, we have defined a hierarchical structure (taxonomy) over the pages at different levels of the directories. The analyzed data set has concerned the set of page views by visitors that were connected to the Inria site from the 1st to the 15th of January, 2003. Globally, the database contained 673.389 clicks (like page views in an user session), whichhave been already filtered from robot/spider entries and accesses of graphic files.
The data are collected in two tables where each row contains the descriptions of a symbolic object (navigation), that is the distribution of the visited topics on the two websites. Following our aim to study the behavior of the Inria web users, we have performed crossed clustering analysis to identify an homogeneous typology of users according to the sequence of the visited web pages, or better, according to the occurrences of the visited pages of the several semantic topics.
The results of the navigation set partition in 12 classes and of the topics one in 8 classes, constituted by the two partitions Q1 and Q2, are shown in the Table 7.
Results in Table 2 is a example of automatic clustering procedure to structure complex data that performs simultaneously typologies of navigation and groups of topics, homogenous from a semantic point of view.
Applying Kohonen maps on Web data
We have applied our adaptation of the Self organizing map (SOM) to complex data and specially to dissimilarity data [33] (cf. 6.2.1) on real data by clustering data issued from the same two Inria's Web servers [34].
The first analysis concerns the users navigations that are composed by a set of first level visited topics. To clusters the navigations we used the affinity coefficient between two navigations. For more details, see [12].
The second analysis concerns the clustering of the first syntactic topics in order to find an association between them. We have also applied our method to the same data set of the 2-3 AHC described in [27]. We used also the Jaccard index to cluster Inria's visited first level topics (in particular the research teams).
In the final map, shown in Figure 9, the neurons have been labeled according to the semantic topic of the individual referent. For the semantic topic "research team" we also represent the theme to which it belongs. It showed that the research teams of theme 1 were mapped to neighboring neurons in the map, as well as the research teams of theme 4 and the scientific events.
The SOM was constructed using the 196 first syntactical topics. Each neuron contains the semantic topic of the referent topic. For the research teams we also represent the theme which they belong to.
Applying 2-3 AHC on Web data
We have applied our 2-3 AHC algorithm [24] (cf. 6.2.6) on real data by clustering data issued from two Inria Web servers [27]. More concretely, we have clustered Inria visited topics (in particular the research teams) based on its Web sites' visitors behavior. Knowing that Inria scientific organization has changed on 1st April 2004, our goal was to analyze the impact of the Web site structure on users navigations before and after this change (two 15 days periods).
To cluster the topics from the visited URLs, we used the Jaccard index on users navigations (sets of URLs) during the advanced data preprocessing of Web data. Our analyses revealed:
The global impact of the Web site on users navigations. For example, in when analyzing the first level visited topics, 16 out of 19 formed clusters contained research teams from same research theme [27] (cf. Table 10).
The impact of the scientific organization : we found that the research teams clustering was different for the two analyzed periods and was highly influenced by Inria former and new scientific organization into research themes (cf. Figures 11a and 11b).
Personalized Recommendations for Mobility Information Retrieval
Keywords : personalization, user profile, recommender system, trip.
Participants : Sergiu Chelcea, Brigitte Trousse.
As members of the MobiVIP project (PREDIT 3), we studied an emerging research field related to mobility in the transport domain which is related to travel information retrieval.
To facilitate such retrieval, in collaboration with Georges gallais (Visa Action, Inria) we propose in [26] and [21] the use of recommender systems in a mobility context: these systems facilitate information retrieval, and support the preparation of the user's journey ("pre-trip": choice of the transport mode, schedule, route, time of the trip, ...) and to carry it out ("on-trip": interactive guidance, way visualization, destination planning). Compared to the state of the art, the originality of our approach lies in:
its recommendation calculation based on pre-trip and on-line pretrip logs,
its capacities to adapt the recommendations to the user's behavior during his information retrieval correlated to his own movement,
the on-line learning capabilities for supporting information retrieval.
Based on such an approach (and our first recommender Broadway-Web), we specify in 2004 the Be-TRIP recommender system and start the implementation with the pre-trip mode. Figure 12 shows a recommandation related to the bus schedule from Valbonne to Antibes for a tourist browsing inside the CASA site (we developed) and particularly in the Valbonne and Antibes sites.
Multi-disciplinary Approach of Internet Measures
Keywords : internauts practices, metrology, internet.
Participants : Eric Guichard, Brigitte Trousse, Florent Masséglia, Doru Tanasa, Yves Lechevallier.
Based on the main contributions of the workshop ``Mesures de l'internet'' organised at Nice in may 2003, E. Guichard supervises a collective book [11] published by ``Les Canadiens en Europe'' in April 2004.
This book describes various approaches issued from mathematics and computer science (linked to the internet metrology), linguistics, geography and human sciences related to the internet practices. Let us note two AxIS contributions ([14], [17]) in this book.
This work is related to the pragmatic and pluri-disciplinary approach adopted by the team from its creation in order to have a better understanding of the user practices of internet: the benefits of such an approach concern the definition of relevant evaluation criteria of web-based information systems (or relevant usage analysis variables) and also relevant specifications in the (re)-design of such systems.