Team AxIS

Overall Objectives
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: Software

Data Mining and Web Usage Mining

Clustering and Classification Toolbox

Participants : Marc Csernel, Yves Lechevallier [co-correspondant] , Brigitte Trousse [co-correspondant] .

We developed and maintained a collection of clustering and classification software, written in C++ and/or Java:

A Web interface developed in C++ and running on our Apache internal Web available for the following methods: SCluster, Div, Cdis, CCClust.

Previous versions of the above software have been integrated in the SODAS 2 Software  [93] which was the result of the european project ASSO(ASSO: Analysis System of Symbolic Official data) (2001-2004). SODAS 2 software supports the analysis of multidimensional complex data (numerical and non numerical) coming from databases mainly in statistical offices and administration using Symbolic Data Analysis [69] . This software is registrated at APP. The latest executive version of the SODAS 2 software, with its user manual can be downloaded at . See 2009 AxIS annual report for more details of the main contributions of AxIS to SODAS [79] , [105] which have been registered at APP.

Clustering Methods for mining Sequential Patterns in Data Streams

Participants : Maurice Yared, Florent Masseglia, Brigitte Trousse [correspondant] , Yves Lechevallier.

As a result of Marascu's thesis (2007-2009) [91] , a collection of softwares have been developed for knowledge discovery and security in data streams (cf. our 2009 annual report for more details on WOD, the outlier detection method and GEAR an implementation of the history management strategy).

Three clustering methods for mining sequential patterns (Java) in data streams have been developped in Java by A. Marascu during her thesis [91] . The softwares take batches of data in the format "Client-Date-Item" and provide clusters of sequences and their centroids in the form of an approximate sequential pattern calculated with an alignment technique.

This year, the Java code of SMDS has been integrated in the MIDAS demonstrator [68] .(cf. 8.2.2 ) and a C++ version [61] has been implemented for the CRE contract with Orange Labs with a visualisation module (in Java) (cf. 7.1 ). SMDS has been applied on data issued from mobile Orange portal.

AWLH for Pre-processing Web Logs

Participants : Yves Lechevallier [co-correspondant] , Brigitte Trousse [co-correspondant] .

AWLH is issued from AxISlogminer preprocessing software which implements the mult-site log preprocessing methodology developed by D. Tanasa in his thesis [15] for Web Usage Mining (WUM). In the context of the Eiffel project (2008-2009), we isolated and redesigned the core of AxISlogMiner preprocessing tool (we called it AWLH) composed of a set of tools for pre-processing web log files. AWLH can extract and structure log files from several Web servers using different input format. The web log files are cleaned as usually before to be used by data mining methods, as they contain many noisy entries (for example, robots bring a lot of noise in the analysis of user behaviour then it is important in this case to identify robot requests). The data are stored within a database whose model has been improved.

Now the current version of our Web log processing offers:

For recording the click actions by a user in a real time, we developed in 2009 a tool based on an open source project called "OpenSymphony ClickStream" for capturing Web user actions. For capturing and structuring data issued from annotated documents inside discussion forums, an extended version of AWLH has been developed.

Two Methods for Extracting Sequential Patterns with Low Support

Participants : Brigitte Trousse [correspondant] , Florent Masseglia.

Two methods for extracting sequential patterns with low support have been developed by D. Tanasa in his thesis [102] in collaboration with F. Masseglia and B. Trousse : Cluster & Divide [102] and Divide & Discover [13] , [102] .

See Chapter 3 of Tanasa's PhD document for more details on these two methods and on a framework for developing methods for extracting sequential patterns with low support.

ATWUEDA for Analysing Evolving Web Usage Data

Participant : Yves Lechevallier [correspondant] .

ATWUEDA [82] for Web Usage Evolving Data Analysis was developed by A. Da Silva in her thesis [80] . It is available at INRIA's gforce website: . A. Da Silva presented part of her work in a working research group at CNAM-Paris [81] .

This tool was developed in Java and uses the JRI library in order to allow the application of R functions in the Java environment. R is a programming language and software environment for statistical computing ( . The ATWUEDA tools is able to read data from a cross table in a MySQL database, split the data according to the user specifications (in logical or temporal windows) and then apply the approach proposed in the Da Silva's thesis in order to detect changes in dynamic environment. The proposed approach characterizes the changes undergone by the usage groups (e.g. appearance, disappearance, fusion and split) at each timestamp. Graphics are generated for each analysed window, exhibiting statistics that characterizes changing points over time.


Logo Inria