Section: New Results
Keywords : symbolic data analysis, unsupervised clustering, Self Organizing Map, complex data, neural networks, hierarchical clustering, hierarchies.
Data Mining Methods
Self Organizing Maps on dissimilarity matrices
Keywords : dissimilarity, self organizing maps, neural networks, clustering, visualization.
Participants : Aicha El Golli, Yves Lechevallier, Fabrice Rossi, Nicomedes Lopes Calvacanti Junior.
The standard Self Organizing Map (SOM) is restricted to vector data from Rn . In our previous work [64] , [65] , we proposed an adapted version of the SOM, called the DSOM (for Dissimilarity SOM) that can be applied to any data for which a dissimilarity can be defined.
In 2005, we improved DSOM by defining a new algorithm associated to an improved implementation. This implementation significantly reduces the execution time of the method without changing the results [29] . The new algorithm is based on a factorization technique applied to the computation of the criterion optimized by the method (the sum of weighted dissimilarities between a prototype candidate and the data to cluster). It is associated to an early stopping scheme and to some memorization techniques that leverage the iterative nature of the method.
We have also applied the DSOM to usage-based web content clustering and visualization [42] , see section 6.4.1 for details.
Functional Data Analysis
Keywords : functional data, neural networks, curves classification, support vector machines, machine learning.
Participant : Fabrice Rossi.
Functional Data Analysis is an extension of traditional data analysis to functional data. In this framework, each individual is described by one or several functions, rather than by a vector of Rn . This approach allows to take into account the regularity of the observed functions.
In 2005, we have extended our approach based on neural methods for functional data to the case of Support Vector Machines (SVMs) applied to function classification:
-
in [44] , we have introduced functional kernels based on derivation operators and on B-spline smoothing. An application to spectrometric curves classification showed an improvement of these kernels over standard non functional kernels;
-
in [46] , we have studied the theoretical properties of a projection based functional kernel, in which functions are projected to a truncated Hilbert basis in a pre-processing step. The coordinates on this basis are handled by a standard SVM. We showed that this method, associated to a split sample procedure for the choice of the truncation level, is consistent (i.e., it can reach the Bayes error rate asymptotically). We also illustrated the method on several real world data set (speech recognition problems).
In 2005, our earlier work on functional multi-layer perceptrons has been published in international journals [22] , [23] , [24] .
Partitioning Method: Adaptive Distances on Interval Data
Keywords : unsupervised clustering, quantitative data, dynamic clustering algorithm.
Participants : F.A.T. de Carvalho, Yves Lechevallier, Renata Souza.
The main contribution [19] is the proposal of a new partitional dynamic clustering method for interval data based on the use of an adaptive Hausdorff distance at each iteration. The idea of dynamical clustering with adaptive distances is to associate a distance to each cluster, which is defined according to its intra-class structure. The advantage of this approach is that the clustering algorithm recognizes different shapes and sizes of clusters. Here the adaptive distance is a weighted sum of Hausdorff distances. Explicit formulas for the optimum class prototype, as well as for the weights of the adaptive distances, are found. When used for dynamic clustering of interval data, these prototypes and weights ensure that the clustering criterion decreases at each iteration.
Let
be a set of
nobjects indexed by
iand described by
pinterval variables indexed by
j. An
interval variable X[57] is a correspondence defined from
in
such that for each
i
,
X(
i) = [
a,
b]
, where
is the set of closed intervals defined from
, i.e.,
= {[
a,
b]:
a,
b
,
a
b} . Each object
iis represented as a vector of intervals
, where
xij= [
aij,
bij]
.
An interval data table
{
xij}
n×
p which is used by our clustering method is made up of
nrows that represent
nobjects to be clustered and
pcolumns that represent
pinterval variables. Each cell of this table contains an interval
xij= [
aij,
bij]
. In our approach
[19] a prototype
yk of cluster
Ck
P is also represented as a vector of intervals
, where
ykj= [
kj,
kj]
.
It is now a matter of choosing an adaptive distance between vectors of intervals and properly defining the representation step of the dynamic algorithm with adaptive distances given in the previous section. In other words, we will give an explicit formula for the prototype
yk and for the vector of weights
k that minimizes both the adequacy criterion
.
Agglomerative 2-3 Hierarchical Clustering: study and visualization
Keywords : 2-3 AHC, clustering, aggregation index, hierarchies.
Participants : Sergiu Chelcea, Mihai Jurca, Brigitte Trousse.
This work was conducted in the context of the PhD of S. Chelcea.
We have continued [28] this year our study of the Agglomerative 2-3 Hierarchical Clustering [60] , [59] as a part of Chelcea Sergiu's PhD thesis. A study of different aggregation indexes and cluster indexing measures combined with the 2-3 AHC algorithm execution has revealed a particular case of clusters merging, which can influence the resulting induced dissimilarity matrix. This case that we denoted blind merging , is present when two clusters are merged whilst one of them is not maximal. Based on our previous theoretical study [58] , the next (intermediate) merging, will merge together two clusters possibly at a high indexing degree. This can be avoided by minimizing the final cluster's indexing degree when choosing the two merging clusters.
A slightly modified version of a 2-3 AHC algorithm was proposed and implemented in order to avoid such situations. The interest of this new 2-3 algorithm variant is its resulting induced dissimilarity matrix which is ``better'' or equal to the classic ultrametric.
We experimentally validated this new 2-3 AHC algorithm variant on different artificial datasets and we also integrated it into our Hierarchies Visualization Toolbox (http://axis.inria.fr/ ).
This new 2-3 AHC algorithm variant was also applied and validated on other types of datasets: on Web Usage Data [28] , on Sanskrit XML documents [55] (see also Section 6.2.2 ) and on tourists itineraries [47] .
Sequential Pattern Extraction in Data Streams
Keywords : sequential pattern, data stream, sequence alignment.
Participants : Alice Marascu, Florent Masséglia.
This work was conducted in the context of the master of A. Marascu.
In recent years, emerging applications introduced new constraints for data mining methods. These constraints are particularly linked to new kinds of data that can be considered as complex data. One typical kind of such data is known as data streams . In a data stream processing, memory usage is restricted, new elements are generated continuously and have to be considered as fast as possible, no blocking operator can be performed and the data can be examined only once.
At this time and to the best of our knowledge, no method has been proposed for mining sequential patterns in data streams. We argue that the main reason is the combinatory phenomenon related to sequential pattern mining. Actually, if itemset mining relies on a finite set of possible results (the set of combinations between items recorded in the data) this is not the case for sequential patterns where the set of results is infinite. In fact, due to the temporal aspect of sequential patterns, an item can be repeated without limitation leading to an infinite number of potential frequent sequences.
The SMDS (Sequence Mining in Data Streams) method, proposed in [37] , [36] , is designed for extracting sequential patterns from a data stream. More precisely, our goal is to extract significant patterns that will be representative of Web usage streaming data. To this end, SMDS performs as follows:
-
cutting down the data stream into batches of fixed size. The following operations are then performed for each batch;
-
clustering the sequences of the batch;
-
for each cluster c, providing the alignment of the sequences embedded in c. The aligned sequence will be considered as a summary of c;
-
filtering the aligned sequence in order to keep 1) frequent items only and 2) aligned sequences obtained on clusters having size greater than 2 only;
-
maintaining a prefix tree structure that will keep the history of frequency for each extracted sequence (the operations on this structure may be insertion , update or deletion ).
All those steps have to be performed as fast as possible in order to meet the constraints of a data stream environment. Approximation has been recognized as a key feature for this kind of applications, explaining our choice for an alignment method for extracting the summaries of clusters. The SMDS method is illustrated in figure 7 . SMDS has been tested over both real and synthetic datasets. Experiments could show the efficiency of our approach and the relevance of the extracted patterns on the Web site of Inria Sophia Antipolis.