Section: Software
Data Mining and Web Usage Mining
Clustering and Classification Toolbox
Participants : Marc Csernel, Yves Lechevallier [co-correspondant] , Brigitte Trousse [co-correspondant] .
We developed and maintained a collection of clustering and classification software, written in C++ and/or Java:
-
a Java library (Somlib) that provides efficient implementations of several SOM variants [78] , [77] , [100] , [99] , [104] , especially those that can handle dissimilarity data (available on Inria's Gforge server http://gforge.inria.fr/projects/somlib/ , developed by AxIS Rocquencourt and Brieuc Conan-Guez from Université de Metz.
-
a functional Multi-Layer Perceptron library, called FNET, that implements in C++ supervised classification of functional data [95] , [98] , [97] , [96] (developed by AxIS Rocquencourt).
-
two partitioning clustering methods on the dissimilarity tables issued from a collaboration between AxIS Rocquencourt team and Recife University, Brazil: CDis and CCClust [83] . Both are written in C++ and use the “Symbolic Object Language” (SOL) developed for SODAS.
-
two improved and standalone versions of SODAS modules, SCluster and DIVCLUS-T [74] (AxIS Rocquencourt).
-
a Java implementation of the 2-3 AHC (developed by AxIS Sophia Antipolis). The software is available as a Java applet which runs the hierarchies visualization toolbox called HCT for Hierarchical Clustering Toolbox (see [75] ).
A Web interface developed in C++ and running on our Apache internal Web server.is available for the following methods: SCluster, Div, Cdis, CCClust.
Previous versions of the above software have been integrated in the SODAS 2 Software [93] which was the result of the european project ASSO(ASSO: Analysis System of Symbolic Official data) (2001-2004). SODAS 2 software supports the analysis of multidimensional complex data (numerical and non numerical) coming from databases mainly in statistical offices and administration using Symbolic Data Analysis [69] . This software is registrated at APP. The latest executive version of the SODAS 2 software, with its user manual can be downloaded at http://www.info.fundp.ac.be/asso/ . See 2009 AxIS annual report for more details of the main contributions of AxIS to SODAS [79] , [105] which have been registered at APP.
Clustering Methods for mining Sequential Patterns in Data Streams
Participants : Maurice Yared, Florent Masseglia, Brigitte Trousse [correspondant] , Yves Lechevallier.
As a result of Marascu's thesis (2007-2009) [91] , a collection of softwares have been developed for knowledge discovery and security in data streams (cf. our 2009 annual report for more details on WOD, the outlier detection method and GEAR an implementation of the history management strategy).
Three clustering methods for mining sequential patterns (Java) in data streams have been developped in Java by A. Marascu during her thesis [91] . The softwares take batches of data in the format "Client-Date-Item" and provide clusters of sequences and their centroids in the form of an approximate sequential pattern calculated with an alignment technique.
-
SMDS compares the sequences to each others with a complexity of O(n2) .
-
SCDS is an improvement of SMDS, where the complexity is enhanced from O(n2) to O(n.m) with n the number of navigations and m the number of clusters.
-
ICDS is a modification of SCDS. The principle is to keep the clusters' centroids from one batch to another.
This year, the Java code of SMDS has been integrated in the MIDAS demonstrator [68] .(cf. 8.2.2 ) and a C++ version [61] has been implemented for the CRE contract with Orange Labs with a visualisation module (in Java) (cf. 7.1 ). SMDS has been applied on data issued from mobile Orange portal.
AWLH for Pre-processing Web Logs
Participants : Yves Lechevallier [co-correspondant] , Brigitte Trousse [co-correspondant] .
AWLH is issued from AxISlogminer preprocessing software which implements the mult-site log preprocessing methodology developed by D. Tanasa in his thesis [15] for Web Usage Mining (WUM). In the context of the Eiffel project (2008-2009), we isolated and redesigned the core of AxISlogMiner preprocessing tool (we called it AWLH) composed of a set of tools for pre-processing web log files. AWLH can extract and structure log files from several Web servers using different input format. The web log files are cleaned as usually before to be used by data mining methods, as they contain many noisy entries (for example, robots bring a lot of noise in the analysis of user behaviour then it is important in this case to identify robot requests). The data are stored within a database whose model has been improved.
Now the current version of our Web log processing offers:
-
Processing of several log files from several servers,
-
Support of several input formats (CLF, ECLF, IIS, custom, ...);
-
Incremental pre-processing;
-
Java API to help integration of AWLH in external application.
For recording the click actions by a user in a real time, we developed in 2009 a tool based on an open source project called "OpenSymphony ClickStream" for capturing Web user actions. For capturing and structuring data issued from annotated documents inside discussion forums, an extended version of AWLH has been developed.
Two Methods for Extracting Sequential Patterns with Low Support
Participants : Brigitte Trousse [correspondant] , Florent Masseglia.
Two methods for extracting sequential patterns with low support have been developed by D. Tanasa in his thesis [102] in collaboration with F. Masseglia and B. Trousse : Cluster & Divide [102] and Divide & Discover [13] , [102] .
See Chapter 3 of Tanasa's PhD document for more details on these two methods and on a framework for developing methods for extracting sequential patterns with low support.
ATWUEDA for Analysing Evolving Web Usage Data
Participant : Yves Lechevallier [correspondant] .
ATWUEDA [82] for Web Usage Evolving Data Analysis was developed by A. Da Silva in her thesis [80] . It is available at INRIA's gforce website: http://gforge.inria.fr/projects/atwueda/ . A. Da Silva presented part of her work in a working research group at CNAM-Paris [81] .
This tool was developed in Java and uses the JRI library in order to allow the application of R functions in the Java environment. R is a programming language and software environment for statistical computing (http://www.r-project.org/ . The ATWUEDA tools is able to read data from a cross table in a MySQL database, split the data according to the user specifications (in logical or temporal windows) and then apply the approach proposed in the Da Silva's thesis in order to detect changes in dynamic environment. The proposed approach characterizes the changes undergone by the usage groups (e.g. appearance, disappearance, fusion and split) at each timestamp. Graphics are generated for each analysed window, exhibiting statistics that characterizes changing points over time.