Section: New Results
Web Usage and Internet Usage Mining Methods
Dynamic Clustering of Web usage Data For Charactering Visitors groups
Keywords : dynamic clustering algorithm, symbolic data analysis, unsupervised clustering, web usage mining.
Participants : Alzennyr Da Silva, Yves Lechevallier, F.A.T. de Carvalho, Brigitte Trousse.
The analysis of a web site based on its usage data is an important task as it provides insight into the organization of the site and its adequacy regarding user needs. Such knowledge is especially interesting for business applications. In this context, analyzing such data can help organizations, among other things, to plan cross marketing strategies and effectiveness of promotional campaigns. We thus defined an approach for discovering the profiles of visitor groups. To this purpose, we map user interests into symbolic objects which represent a user's successful interaction with the site. Symbolic Objects constitute the bases of the Symbolic Data Analysis (SDA). The general aim of this analysis is to extend the processing of classical data types to support more complex data. In conventional datasets, the objects are individualized, whereas in symbolic datasets they are unified by means of relationships. In our proposition, we identify groups of users with similar behaviour by means of a dynamic clustering algorithm which applies a context dependent dissimilarity measure defined by Francisco De Carvalho. The benchmark data set consists in a one-year log file coming from the web site of the CIn (Informatics Centre of UFPE, Brazil). Our approach was capable to identify the profiles of distinct typologies of users based on their navigational preferences. Although the method was carried out to identify visitor groups of an educational web site this approach is generic enough to be applied on any other domain. The results of our experiments were published this year in two international conferences [35] , [36] .
Crossed Clustering in Web Usage Mining
Keywords : contingence table, crossed clustering algorithm, unsupervised clustering, web usage mining.
Participants : Alzennyr Da Silva, Yves Lechevallier, Sergiu Chelcea, Doru Tanasa, Brigitte Trousse.
The emergence of new information technologies such as the Word Wide Web had for consequence the explosion of the amount of data. The necessity for summarizing these data has thus become obvious. In this context, we proposed an approach to automatically build homogeneous classes from these data and to define new statistical units to describe them. By reducing the initial amount of data, the summarization results contain a maximum of information. This kind of problem is addressed in the Web Usage Mining framework. Our approach is based on the crossed clustering method whose objective is to obtain simultaneously a row partition and a column partition from a contingency table. This method represents an effective solution for both search of a typology of individuals (represented by the lines of the table) and the construction of a taxonomy on variables values (represented by the columns of the table). As a result, we identify dominant groups of users as well the sets of pages visited by each group. One of the goals of this analysis is to better understand users' behaviour and for consequence to propose changes in the web site organization in order to better serve the users. We applied our proposition on the Web log data provided by the IT centre of UFPE (Recife, Brazil) [66] and also on the Web log data registering access on seven different e-commerce Web sites from the Czech Republic [33] .
Discovering Generalized Usage Patterns: the GWUM method
Keywords : WUM, Generalization, sequential pattern.
Participants : Doru Tanasa, Florent Masseglia, Brigitte Trousse, Yves Lechevallier.
This work [45] , [59] , [110] proposes an original method for Web usage analysis based on a user-driven generalization of Web pages. The information extracted for these pages, for the clustering purpose, regards the users' access to the pages. The information is obtained from the referrer field of the Web access logs when the user employed a search engine to find the page. The main idea is to characterize a Web page by the keywords that have been given to a search engine in order to find this page. For instance, if most of the accesses to the Web page about job opportunities in the AxIS team (``/axis/jobs-sop.htm'') come from a search engine with the keywords ``Position'' and ``Internship'', then this page may be generalized by (characterized with) the keywords ``Position,Internship''. This principle of generalization in illustrated in figure 7 .
Then the traditional data mining step will not be applied to the Web pages but on their generalization. The experiment that we carried out illustrates our methodology and shows some of the benefits obtained with such an approach in the discovery of frequent sequential patterns. These benefits consist in obtaining generalized patterns with a higher support and easier to interpret.
Mining Interesting Periods from Web Access Logs
Keywords : sequential pattern, Web logs, WUM, periods.
Participants : Alice Marascu, Florent Masseglia.
In this work done in collaboration with M. Teisseire (LIRMM) and P. Poncelet (Ecole des Mines d'Alès), we have focused on a particular problem that has to be considered by Web Usage Mining techniques: the arbitrary division of the data which is done today. This problem was introduced in [100] . This division comes either from an arbitrary decision in order to provide one log per x days (e.g. one log per month), or from a wish to find particular behaviours (e.g. the behaviour of the Web site users from November 15 to December 23, during Christmas purchases). In order to better understand our goal, let us consider student behaviours when they are connected for a working session. Let us assume that these students belong to two different groups having twenty students. The first group was connected on 31/01/05 while the other one was connected on 01/02/05, (i.e. the second group was connected one day later). During the working session, students have to perform the following navigation: First they access URL ``www-sop.inria.fr/cr/tp_accueil.html", then "www-sop.inria.fr/cr/ tp1_accueil.html" which will be followed by "www-sop.inria. fr/cr/tp1a.html''.
Let us consider, as it is usual in traditional approaches, that we analyze access logs per month. During January, we only can extract twenty similar behaviours, among 200 000 navigations on the log, sharing the working session. Furthermore, even when considering a range of one month or of one year, this sequence of navigation does not appear sufficiently on the logs (20/20000) and will not be easy to extract. Let us now consider that we are provided with logs for a very long period (e.g. several years). With our method, we can find that it exists at least one dense period in the range [31/01-01/02]. Furthermore, we know that, during this period, 340 users were connected. We are thus provided with the new following knowledge: 11% (i.e. 40 on 340 connected users) of users visited consecutively the URLs ``tp_accueil.html'', ``tp1_accueil.html'', and finally ``tp1a.html''.
The outline of our method [52] is the following: enumerating the sets of periods in the log that will be analyzed and then identifying which ones contain frequent sequential patterns. Our method will process the log file by considering millions of periods (each period corresponds to a sub-log). The principle of our method will be to extract frequent sequential patterns from each period. Our proposal is a heuristic-based miner, our goal is to provide a result having the following characteristics:
For each period p in the history of the log, let realResult be
the set of frequent behavioural patterns embedded in the
navigation sequences of the users belonging to p. realResult
is the result to obtain (i.e. the result that would be exhibited
by a sequential pattern mining algorithm which would explore the
whole set of solutions by working on the clients of Cp ). Let us
now consider perioResult the result obtained by running the
method presented in this paper. We want to minimize
(with Si standing for a frequent sequence in perioResult ), as
well as maximize
(with Ri standing for a frequent sequence in
realResult ). In other words, we want to find most of the
sequences occurring in realResult while preventing the proposed
result becoming larger than it should (otherwise the set of all
client navigations would be considered as a good
solution, which is obviously wrong).
We have conducted some experiments and extracted interesting behaviours. Those behaviours show that an analysis based on multiple division of the log (as described in this paper) allows obtaining behavioural patterns embedded in short or long periods.
P2P Usage Mining
Keywords : Peer-2-Peer (peer to peer, p2p), sequential patterns, genetic algorithms.
Participant : Florent Masseglia.
With the huge number of information sources available on the Internet, Peer-to-Peer (P2P) systems offer a novel kind of system architecture providing the large-scale community with applications for file sharing, distributed file systems, distributed computing, messaging and real-time communication. P2P applications also provide a good infrastructure for data and compute intensive operations such as data mining.
In [53] we have proposed a new approach for improving resource searching in a dynamic and distributed database such as an unstructured P2P system. This approach takes advantage of data mining techniques. By using a genetic-inspired algorithm, we propose to extract patterns or relationships occurring in a large number of nodes. Such a knowledge is very useful for proposing the user with often downloaded or requested files according to a majority of behaviors. It may also be useful in order to avoid extra bandwidth consumption. For instance, it may be discovered, in a P2P file sharing network, such as Gnutella [91] , that ``Mandriva Linux 2005'' distribution is often downloaded as ``CD1.iso, then CD2.iso and finally CD3.iso''.
We consider that the connected nodes can act with a special peer (a ``meter peer '') in order to provide the end user with a good approximation of patterns embedded in this very large distributed database. To evaluate our approach, we implemented a simulator capable of running simulated unstructured P2P system. Experiments were also conducted by using real datasets.
Web Usage Mining for Ontology Evolution
Keywords : ontology mangement, ontology evolution, web usage mining, tourism.
Participants : Brigitte Trousse, Marie-Aude Aufaure, Yves Lechevallier, Florent Masseglia.
This year we propose in collaboration with B. Legrand (LIP6) an original approach for ontology management in the context of Web-based information systems. Our approach relies on the usage analysis of the considered Web site, in complement to the existing approaches based on content analysis of Web pages. Our methodology is based on knowledge discovery techniques mainly from HTTP Web logs and aims at confronting the discovered knowledge in terms of usage with the existing ontology in order to propose new relations between concepts.
We illustrate our approach on a Web site provided by French local tourism authorities (related to Metz city) (cf. section 4.4 ) with the use of clustering and sequential patterns discovery methods. One major contribution of this work is thus the application of usage analysis to support ontology evolution and/or web site reorganization.
Such a work has been accepted for publication as a chapter of a book [112] .
Web Site Analysis based on an Ergonomic and Web Usage Mining Approach
Keywords : web usage mining, ergonomics, evaluation, Web site.
Participants : Bernard Senach, Brigitte Trousse.
Web Usage Analysis are often realized from different points of view and with exclusive techniques. For instance, considering web usage, the log analysis of a site is rarely related to the ergonomic analysis of this site (and conversely). The MobiVIP project (cf. section 7.1.2 ) has been an opportunity to set up a new methodology coupling the ergonomic approach with the technical log analysis. The study [70] , [71] has been conducted on a transportation web site used to consult various information about a bus network (lines' structure, geographical information, time tables): URL http://www.envibus.fr .
The illustration in Figure 8 sums up the different steps which have been followed:
A "discount usability" technique has first been used to point out potential users' difficulties linked for instance to a low structural or graphical user interface consistency. The suspected problem drove a specific log analysis and in some case, it was possible to find out in the usage data patterns confirming the hypothesis. For instance, to be used efficiently, some decision aids in the envibus required a topographical knowledge of the area, and it was suspected that this could be a reason to give up during a transaction. The log analysis showed that this assumption was correct as the ratio of interrupted request was very high on the corresponding pages. An important benefit of the coupling is also that suggested improvements given to the user interface designer are much more powerfull as more information can be provided and quantitative data enforces the qualitative analysis.