Section: New Results
Web Usage Mining Methods
Keywords : data visualization, web usage mining, graph visualization, non linear projection, dissimilarities, self organizing map.
Participants : Fabrice Rossi, Yves Lechevallier, Aicha El Golli.
The analysis of the content of a web site based on usage data is an important task as it allows to obtain insight on the organization of the site and of its adequacy to user needs. The (dis)agreement between the prior structure of the site (in terms of hyperlinks) and the actual trajectories of the users is of particular interest. In many situations, users have to follow some complex paths in the site in order to reach the pages they are looking for, mainly because they are interested in topics that appeared unrelated to the creators of the site and thus remained unlinked. On the contrary, some hyperlinks are not used frequently, for instance because they link documents that are accessed by different user groups.
In 2005 we have studied two general tools for visualizing the content of a site based on usage data:
in  we have used the logical and hierarchical organization of the web site to simplify the representation of user trajectories. Simplified trajectories are used to calculate dissimilarities between URL groups defined thanks to the site hierarchy (groups are also called topics in section 6.4.2 ). The groups, which reflect the prior semantic structure of the site, are represented thanks to the minimum spanning tree induced by the dissimilarity matrix. This allows to explore the relationship between prior categories and user browsing patterns. The method was applied to the INRIA web site and gave satisfactory results;
in  we have applied the same general methodology for calculating dissimilarities between URL groups but we used an adapted version of the Self Organizing Map (SOM) to visualization clusters obtained via the dissimilarity matrix (see section 6.3.1 for details on this version of the SOM).
InterSites Web Usage Mining: preprocessing methodology and crossed clustering
Keywords : Web Usage Mining, Complex data, Preprocessing, Crossed-clustering.
Participants : Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse, Rosanna Verde.
In the context of the ECML/PKDD 2005 Discovery Challenge, we improved  ,  our preprocessing methodology for intersites Web Usage Mining  . A clickstream dataset was proposed in the Discovery Challenge this year for the first time. The dataset consisted in requests for page views on seven different e-commerce Web sites from the Czech Republic. A request contained a PHP SessionID automatically generated for each new user visit on each server (unique IDs).
Based on Tanasa's preprocessing methodology  , we defined a new methodology to preprocess the provided datasets and to store it in a data warehouse. Since a user changing shops can have (during a single visit) multiple SessionIDs, one on each shop, we regrouped these PHP SessionIDs into intersite users visits. More precisely we regrouped SessionIDs belonging to a single user (same IP) into a Group of SessionIDs , corresponding to the user's actual (intersite) visit. This was done by comparing the Referrer with the previously accessed URLs (in a reasonable time window), each time the user moves to another shop. We thus reduced by 23.88% the number of user visits.
To analyze the traffic load in the seven shop sites, we grouped the requests in terms of Time Periods (slices of date and hour). We cross-clustered these time periods against the visited products using a generalized dynamic algorithm  .
The result consisted in the confusion table containing classes of periods and products (see Figure 8 ).
Such analyses allow us to identify best hours for marketing strategies, like fast promotions, online advices and publish banners, etc. Others analyses could be planned in the future, exploiting for example the link between the consumer activities and the time periods by shop or focusing on multi-shop user visits, etc.
In fact we apply a previous work published in 2003: indeed we proposed in  a crossed clustering algorithm in order to partition a set of objects in a predefined number of classes and to determine, in the same time, a structure (taxonomy) on the categories of the object descriptors. This procedure is a simultaneous clustering algorithm on contingency tables. The convergence of the algorithm is guaranteed at the best partitions of the objects in rclasses and of the categories of the descriptors in cgroups, respectively. This algorithm extended the dynamical algorithms hereafter proposed in the context of the Web Usage Mining. In particular, we had already performed it on the Web Logs Data, coming from the HTTP log files of INRIA web server  .
Extracting Dense Periods of Sequential Patterns
Keywords : sequential patterns, web usage mining, period.
Participant : Florent Masséglia.
This work has been done in collaboration with the LGI2P and the LIRMM (see 8.2.5 ) and has been published in  . Existing Web Usage Mining techniques are currently based on an arbitrary division of the data ( e.g. ``one log per month'') or guided by presumed results ( e.g ``what is the customers behaviour for the period of Christmas purchases?''). Those approaches have two main drawbacks. First, they depend on this arbitrary organization of the data. Second, they cannot automatically extract ``seasons peaks'' among the stored data.
The work presented in this section performs a specific data mining process (and particularly to extract frequent behaviours) in order to automatically discover the densest periods. Our method extracts, among the whole set of possible combinations, the frequent sequential patterns related to the extracted periods. A period will be considered as dense if it contains at least one frequent sequential pattern for the set of users connected to the Web site in that period.
Our method is based on:
a new representation of the Web log file designed to retrieve the ``login'' and ``logout'' information associated to each user.
a rewriting of the log in order to build periods based on the information of step 1. A period will begin at the arrival of a new user or end at the departure of a ``connected'' user.
a heuristic designed for extracting approximate frequent sequences from each period built at step 2.
The third step is based on Perio , the heuristic we have developed for that purpose and it is widely inspired from genetic algorithms.