Section: New Results
Keywords : symbolic data analysis, unsupervised clustering, Self Organizing Map, complex data, neural networks, hierarchical clustering, hierarchies.
Data Mining Methods
Adaptive Distances in Clustering Methods
Keywords : unsupervised clustering, distances table, dynamic clustering algorithm.
Participants : Marc Csernel, F.A.T. de Carvalho, Yves Lechevallier.
The adaptive dynamic clustering algorithm  optimizes a criterion based on a fitting measure between clusters and their prototypes, but the distances used to compare clusters and their prototypes change at each iteration. These distances are not determined absolutely and can be different from one cluster to another. The advantage of these adaptive distances is that the clustering algorithm is able to recognize clusters of different shapes and sizes. The main difference between these algorithms lies in the representation step, which has two stages in the adaptive case. The first stage, where the partition and the distances are fixed and the prototypes are updated, is followed by a second one, where the partition and their corresponding prototypes are fixed and the distances are updated.
The idea of dynamical clustering with adaptive distances is to associate a distance to each cluster, which is defined according to its intra-class structure. We proposed an approach [Oops!] , which generalizes easily the dynamic cluster method for the case of the adaptive and non-adaptive non-euclidean distances.
Self Organizing Maps on Dissimilarity Matrices
Keywords : dissimilarity, self organizing maps, neural networks, clustering, visualization.
Participant : Fabrice Rossi.
In 2007, we have continued our previous work on the adaptation of the Self Organizing Map (SOM) to dissimilarity data. For the SOM based on a generalized median, we have improved our software implementation available on INRIA's GForge (cf. section 5.4 ) with two new methods. First, we have studied the effect of prototype collisions, i.e., when two neurons of the SOM share the same prototype [Oops!] . Those collisions generally lead to maps of bad quality (in term of topology preservation) and should be avoided. We have proposed several strategies to prevent them. Second, we have used the branch and bound principle  to speed up the median SOM [Oops!] . This solution divides by up to 2.5 the time needed to obtain the results compared to our previous work  . As in our previous works, results of the new algorithm are strictly identical to the ones obtained by the standard naive implementation.
We have also started to investigate a quite different approach to SOM analysis of dissimilarity data. This approach is called the “relational approach” and is based on pioneering works by Hathaway, Davenport and Bezdek  . The main idea is to extend a dissimilarity in order to compute dissimilarities between virtual linear combinations of the original observations and those observations. We have proposed to apply this approach to topographic processing (i.e., to SOM and to Neural Gas) in [Oops!] . While this approach doesn't suffer from the prototype collision problems mentioned above for the median based SOM, the obtained algorithms are quite slow. We have therefore started to optimize them, especially by constraining the virtual linear combinations to have only a limited number of non zero terms [Oops!] .
Finally, we have started to work on a kernel version of the SOM, which happen to be very close to the relational version described in the previous paragraph. We have in particular obtained a best paper award at the WSOM conference 2007 with our paper on the comparison of median dissimilarity SOM and a batch version of the kernel SOM for graph analysis [Oops!] .
Functional Data Analysis
Keywords : functional data, neural networks, curves classification, support vector machines, machine learning.
Participant : Fabrice Rossi.
Functional Data Analysis is an extension of traditional data analysis to functional data. In this framework, each individual is described by one or several functions, rather than by a vector of Rn . This approach allows to take into account the regularity of the observed functions.
In 2007, we have continued our work on joining functional methods with feature selection methods. Details can be found in section 6.2.2 . In summary, our main idea is to reduce the number of features submitted to a feature selection method by leveraging the functional nature of the data, either via some spline representation [Oops!] or with a “functional aware” variable clustering method [Oops!] .
Keywords : data visualization, graph visualization, non linear projection, machine learning, metric studies.
Participant : Fabrice Rossi.
Our work on Self Organizing Map for dissimilarity data (see Section 6.3.2 ) is now mature enough to enable visualization of such data. We have in particular studied hyperbolic SOM visualization of macroarray data and of Proteins, based on non Euclidean metrics [Oops!] .
Sequential Pattern Extraction in Data Streams: Incremental Approach
Keywords : sequential patterns, data streams.
Participants : Alice Marascu, Florent Masseglia, Yves Lechevallier.
This work was conducted in the context of A. Marascu's Ph.D study.
In recent years, emerging applications introduced new constraints for data mining methods. These constraints are mostly related to new kinds of data that can be considered as complex data. One typical such data are known as data streams . In data stream processing, memory usage is restricted, new elements are generated continuously and have to be considered as fast as possible, no blocking operator can be performed and the data can be examined only once. In 2006 (  ) we have proposed a method called SMDS (Sequence Mining in Data Streams) for extracting sequential patterns from data streams. This year, our main goal was to improve the quality of the results. Actually, most data stream mining methods (including SMDS) are not able to manage the history of the knowledge in terms of content (evolution of the content of patterns). To this end, we have proposed the ICDS (Incremental Clustering in Data Streams) method [Oops!] . ICDS is based on the algorithmic schema of SCDS and improves it by managing the evolution of the content of the patterns. To summarize this method, we cut the data stream in batches of a same size and we process the batches one by one. For the first batch, at the very beginning, we add the first sequence s1 in a cluster c1 and decide that the centroid of c1 (i.e. c1 ) is equal to s1 . Then, for each other sequence si of the cluster, we perform the following steps:
Compare si with all clusters' centroids;
Find the nearest cluster cj ;
Add si to cj ;
Update cj the centroid of cluster cj .
The main difference with SMDS is that we don't start from scratch from one batch to another. Actually, at the end of this processing of the first batch, we keep the centroid of each cluster and use them in the processing of the next batch. The steps above are then iterated again, but the clusters from the previous batch are considered as a guide for the next batch processing.
ICDS has been tested over both real and synthetic datasets. Experiments could show the efficiency of our approach and the relevance of the extracted patterns on the Web site of Inria Sophia Antipolis. This work is the first step towards a better management of the kowledge extracted from a data stream. It allows managing the history of the content of the clusters and their evolution in time (which was not the case of SMDS). Our goal is now to propose a history management dedicated to the frequency of the extracted patterns with optimal sensitivity to the strengths of variations and awareness to the available resources.
Extracting Temporal Gradual Rules from Sequential Data
Keywords : gradual rules, temporal data, outlier detection.
Participants : Céline Fiot, Florent Masseglia.
This work aims at characterizing atypical behaviours by means of gradual data mining techniques. Our main objective is to take into account the temporal information contained within sequential data nature of the data while mining for knowledge.
First we studied existing works on atypical behaviours discovery as well as anomaly or intrusion detection using data mining approaches or gradual data mining methods. We tried to make an exhaustive and comprehensive survey. From this state of the art, we note that most approaches intended to detect atypical behaviours are based on outlier discovery by a prior clustering of the data. This clustering results are used to assess the atypicity of data, compared to the whole dataset.
Secondly this survey shows that (1) only few methods uses the temporal information that may exist within the data (Rensselaer Polytechnic Institute, New York, USA ; University of Pennsylvania, USA), (2) when atypical behaviours are observed, they are only partially explained (University of Alberta, Canada; University of Tübingen, Germany ; Mississippi State University) and (3) often not really intelligibly.
Therefore we are working on outlier detection based on sequential data clustering and on comprehensive description of atypicity using temporal gradual rules.
With this work our goal is now to provide a definition of temporal gradual rules, i.e. a new kind of gradual rules that include the temporal aspect of sequential data, for characterizing the content of the clusters (at this time, there are no propositions in the litterature for handling time in gradual rules). Such a temporal gradual rule may be for instance “The higher the number of messages having the same subject at time t , the higher the risk of mail server crash at time t+ x ”.
Mining Solid Itemsets
Keywords : itemsets, temporal itemsets, optimal window size.
Participants : Bashar Saleh, Florent Masseglia.
Association rule mining algorithms aim to obtain, among a very large set of records, the frequent correlations between the items of the database. However, for many real world applications, this definition of frequent itemsets is not well adapted. Possible interesting itemsets might remain undiscovered despite their very specific characteristics. In fact, interesting itemsets are often related to the moment during which they can be observed. We may consider, for instance, the behaviors of the users on the web site of an on-line store after a special discount on recordable DVDs and CDs, advertised on TV.
In [Oops!] , we propose to find itemsets that are frequent over a contiguous subset of the database. For instance, navigations on the web page of recordable CDs and DVDs occur randomly all year, but the correlation between both items is not frequent if we consider the whole year. However, the frequency of this behavior will certainly be higher within the few hours (or days) that follow the TV spot. Therefore, the challenge is to find the time window that will optimize the support of this behavior. In other words, we want to find B , a contiguous subset of D where the support of the behavior on B is above the minimum support and the size of B is optimal. we introduced the definition of solid itemsets, which represent a coherent and compact behavior over a specific period, and we propose Sim , an algorithm for their extraction.
Sim introduces a new paradigm for the counting step of the generated candidates and extends the Generating-Pruning principle of apriori in order to generate candidate solid itemsets and count their support. The generating principle is provided with a filter on the possible intersection of the candidates ( i.e. if two solid itemset of size k have a common prefix but do not share a common period, then they are not considered for generating a new candidate).
However, the counting step (or “pruning” in apriori) is not straightforward in our case and our goal is to build “kernels” of the candidate temporal itemsets over their period of possible frequency. Then, the kernels will be merged in order to find the corresponding solid itemsets.
Our experiments showed that Sim is able to extract the solid itemsets from very large datasets and provide useful and readable results such as behaviors corresponding to annual events, navigations on conferences Web sites or downloading of a software after the announce of a release. This work has been accepted in EGC 2008.