Section: New Results
Keywords : symbolic data analysis, unsupervised clustering, Self Organizing Map, complex data, neural networks, hierarchical clustering, hierarchies.
Data Mining Methods
Partitioning Methods on Interval Data
Keywords : unsupervised clustering, Quantitative Data, dynamic clustering algorithm.
Participants : Marc Csernel, F.A.T. de Carvalho, Yves Lechevallier, Rosanna Verde, Renata Souza.
In the publication of a special issue on interval data edited by F. Palumbo  we propose a survey of the partionning methods of interval data. They use different homogeneity criteria as well as different kinds of clusters representation (prototypes). For the first two methods we introduce some tools to interpret the final partitions. Finally the methods are compared and corroborated on a real data set.
This year we progressed our research on adaptive distances. An article was published  . The main contribution is the proposal of a new partitional dynamic clustering method for interval data based on the use of an adaptive or a global Hausdorff distance at each iteration. The idea of dynamical clustering with adaptive distances is to associate a distance to each cluster, which is defined according to its intra-class structure. The advantage of this approach is that the clustering algorithm recognizes different shapes and sizes of clusters. Here the adaptive distance is a weighted sum of Hausdorff distances. Explicit formulas for the optimum class prototype, as well as for the the weights of the adaptive distances, are found. When used for dynamic clustering of interval data, these prototypes and weights ensure that the clustering criterion decreases at each iteration.
Self Organizing Maps on Dissimilarity Matrices
Keywords : dissimilarity, self organizing maps, neural networks, clustering, visualization.
Participants : Fabrice Rossi, Nicolas Lopes, Yves Lechevallier.
In 2006, we have continued our previous work on the adaptation of the Self Organizing Map (SOM) to dissimilarity data (the DSOM). We have in particular improved the quality of our software implementation available on INRIA's GForge (cf. section 5.5 ).
Our previous work on the DSOM and of its applications in collaboration with ex-members of AxIS i.e. A. El Golli and B. Conan-Guez has been published in various journals and conference  ,  ,  .
Functional Data Analysis
Keywords : functional data, neural networks, curves classification, support vector machines, machine learning.
Participant : Fabrice Rossi.
Functional Data Analysis is an extension of traditional data analysis to functional data. In this framework, each individual is described by one or several functions, rather than by a vector of Rn . This approach allows to take into account the regularity of the observed functions.
In 2006, we have continued our work on Support Vector Machines (SVMs) for functional data analysis. We have shown in particular how some specific spline based kernels can be use to define SVMs on the derivatives of the input functions, without calculating explicitly those derivatives. We showed that the SVMs defined this way are consistent (i.e., they can reach the Bayes error rate asymptotically)  ,  .
We have also started to combine our work on feature selection (see section 6.2.1 ) with our work on functional data analysis in collaboration with the DICE laboratory (Belgium, Louvain). One of the limitations of the method proposed in  is its computational cost, related to the high number of original spectral variables. We have investigated in  and  how a B spline representation of the spectra can be used to reduce the number of features prior to the application of the feature selection method studied in  . We use the locality of the B spline representation to preserve interpretation possibilities. The new features, i.e. the coordinates of the spectra on a B splines basis, are obtained from limited ranges of the original spectral band: a spectral interval can be associated to each selected feature and used for interpretation purpose.
We have in addition started to work on the application of FDA to time series prediction in  . We applied a general idea from  in which a time series is splited into sub-series. Each sub-series is considered as a function, which leads to a function value time series. The resulting series is predicted via an autoregressive model. In our approach, Radial Basis Function networks are used to represent the functions and a functional Least Square Support Vector Machine is used to implement the autoregressive model.
In 2006, our earlier works on functional neural methods made in collaboratiuon with N. Villa from GRIMM-SMASH team (Université Toulouse Le Mirail) have been published in international journals  ,  .
Keywords : data visualization, graph visualization, non linear projection, machine learning, metric studies.
Participant : Fabrice Rossi.
In 2006, we conducted two surveys on information visualization  ,  . The first one  outlines the important relationships between machine learning and information visualization, while the second survey  is dedicated to the usage of visualization methods for metric studies (such as bibliometrics, for instance). Metric studies provide challenging non vector data, generally large graphs with different types of nodes and links. While we have not applied our work on self organizing map for dissimilarity data  to metric studies, this is a promising research topic.
Sequential Pattern Extraction in Data Streams
Keywords : sequential pattern, data stream.
Participants : Alice Marascu, Florent Masseglia, Yves Lechevallier.
This work was conducted in the context of A. Marascu's Ph.D study.
In recent years, emerging applications introduced new constraints for data mining methods. These constraints are mostly related to new kinds of data that can be considered as complex data. One typical such data are known as data streams . In data stream processing, memory usage is restricted, new elements are generated continuously and have to be considered as fast as possible, no blocking operator can be performed and the data can be examined only once. In 2005 ( , ) we have proposed a method called SMDS (Sequence Mining in Data Streams) for extracting sequential patterns from data streams. This year, our main goal was to improve the execution time and meanwhile the quality of the results. To this end, we have proposed the SCDS (Sequence Clustering in Data Streams) method  ,  ,  . To summarize this method, we cut the data stream in batches of a same size and we process the batches one by one. For each batch, at the very beginning, we place the first sequence s1 in a cluster c1 and decide that the centroid of c1 (i.e. c1 ) is equal to s1 . Then, for each other sequence si of the cluster, we perform the following steps:
Compare si with all clusters' centroids;
Find the nearest cluster cj ;
Add si to cj ;
Update cj the centroid of cluster cj .
The general idea of this method is illustrated in Figure 6 .
We needed to define: 1) A computation method of the centroid of a cluster; 2) A similitude measure between a sequence and a centroid; 3) An update method, performed after adding a new sequence to a cluster.
The centroid of a cluster is found thanks to an alignment method. Let's consider the following cluster:
As illustrated in the figure 5 , the centroid of a cluster is the result of an alignment method applied to the sequences contained in that cluster. Because this alignment process is often applied we have optimized the method with an incremental alignment based on a sort of the sequences applied in real time. In fact, the quality of the alignment depends on the order of the sequences. This order thus has to be maintained in real time.
All those steps have to be performed as fast as possible in order to meet the constraints of a data stream environment. Approximation has been recognized as a key feature for this kind of applications, explaining our choice for an alignment method for extracting the summaries of clusters. The dynamic feature of data streams imposes an execution time constraint, but, in meantime, we must assure a good quality of results. To this aim, we have performed some quality tests.
SCDS has been tested over both real and synthetic datasets. Experiments could show the efficiency of our approach and the relevance of the extracted patterns on the Web site of Inria Sophia Antipolis.
Agglomerative 2-3 Hierarchical Classification (2-3 AHC)
Keywords : evaluation, AHC, 2-3 HAC, stress.
Participants : Sergiu Chelcea, Brigitte Trousse, Yves Lechevallier.
This work has been done in the context of S. Chelcea's thesis. The past years, we proposed a new Agglomerative 2-3 Hierarchical Classification general algorithm in collaboration with P. Bertrand (ENST B) and four 2-3 AHC algorithm variants which can create different 2-3 hierarchies on the same dataset according to the use or not of the blind merging. A previous theoretical complexity analysis of our 2-3 AHC algorithm proved that the complexity was reduced from O(n3 ) in the initial 2-3 AHC algorithm to O(n2log(n) ) for our algorithm.
This year, in collaboration with J. Lemaire (IUT Menton University of Sophia Antipolis), we pursued our tests on the obtained 2-3 hierarchies and the classical hierarchy on different datasets (Ruspini, urban itineraries, simulated data, Abalone) for complexity execution times and structure quality. The obtained execution times verified our theoretical complexity of O(n2log(n) ). To determine the created structures quality, we have chosen the Stress coefficient for comparing the initial data and the induced dissimilarity matrices. Using the complete link, we obtained an average gain of 23% (for the Stress) while the maximum gain was around 84% on the Abalone dataset.
Moreover, we finalised our study of the applicability of our 2-3 AHC method in two fields: Web Mining and XML Document Mining.. For Web Mining field, we found that the 2-3 AHC produced interesting results, richer than the classical AHC and better than the AxIS ones obtained with another method  . For XML document mining, we applied our 2-3 AHC algorithm on the INRIA activity reports. One objective was to compare different 2-3 AHC algorithms using as reference the classical AHC one: we found that the best results are obtained with the 2-3 AHC algorithm avoiding the blind merging (V3), which was the only one to always have a positive Stress gain compared to the classical AHC  . For applicative results, see the section 6.5.1 .