# Project : axis

## Section: New Results

Keywords : symbolic data analysis, unsupervised clustering, Self Organizing Map, complex data, neural networks, hierarchical clustering, hierarchies.

### Data Mining Methods

#### Symbolic Data Extraction and Self-Organizing Maps

Keywords : symbolic data analysis, relational database, unsupervised clustering, Self Organizing Map, dissimilarity.

Participants : Aicha El Golli, Yves Lechevallier, Brieuc Conan-Guez, Fabrice Rossi.

The aim of symbolic data analysis is to provide a better representation of the variations and imprecision contained in real data. As such data express a higher level of knowledge, the representation must offer a richer formalism than the one provided by classical data analysis.

A generalization process exists that allows data to be synthesized and represented by means of an assertion formalism that was defined in symbolic data analysis. This generalization process is supervised and often sensitive to virtual and atypical individuals. When the data to be generalized is heterogeneous, some assertions include virtual individuals. Faced with this new formalism and the resulting semantic extension that symbolic data analysis offers, a new approach to processing and interpreting data is required.

The original contributions of our work concern new approaches to representing and clustering complex data.

First, we propose a decomposition step, based on a divisive clustering algorithm, that improves the generalization process while offering the symbolic formalism [37]. We also propose a unsupervised generalization process based on the self-organizing map. The advantage of this method is that it enables the data to be reduced in an unsupervised way and allows the resulting homogeneous clusters to be represented by symbolic formalism.

The second contribution of our work is a development of a clustering method to handle complex data [33]. The method is an adaptation of the batch-learning algorithm of the self-organizing map to dissimilarity tables. Only the definition of an adequate dissimilarity is required for the method to operate efficiently. This adaptation can handle both numerical data and complex data. The experiments showed the usefulness of the method and that it can be applied to a wide variety of complex data once we can define a dissimilarity for those data. This method has also given good results for real applications [12][35] with different types of data (meteorological data of China [13], functional data [47] and Web Usage Mining [34].

#### Functional Data Analysis

Keywords : functional data, neural networks, curves classification.

Participants : Fabrice Rossi, Brieuc Conan-Guez, Aicha El Golli, Yves Lechevallier.

Functional Data Analysis is an extension of traditional data analysis to
functional data. In this framework, each individual is described by one or
several functions, rather than by a vector of R^{n}. This approach allows to
take into account the regularity of the observed functions.

In earlier works, we proposed the extension of MLPs (Multi-Layer Perceptrons) to functional inputs: the Functional Multi-Layer Perceptron (FMLP) [51]. We demonstrated two important properties: this model is a universal approximator, and the parameter estimation is consistent when we only know a finite number of functions known on a finite number of evaluation points.

In 2004, we studied the advantages of a functional pre-processing of input functions before processing by functional neural models:

We applied FMLPs to a phoneme recognition problem [28]. The goal is to classify 5 different phonemes (TIMIT database). We used a functional PCA in order to smooth noisy spectra, and to reduce the input space dimensionality (each spectrum, described by a vector of 256 components, is projected on 10 eigenfunctions thanks to the Functional PCA). This approach based on Functional PCA gives better results than those obtained by previous studied approaches (for example, representation of input functions thanks to B-spline bases).

We showed that FMLPs and more generally functional approaches are not very sensitive to missing data when data are functions [48]: we compared functional approach to other traditional approaches (mean value approach, K-NN approach) and best performances were obtained by Functional neural models.

We extended the Radial Basis Function Networks to functional inputs (FRBFN): compare to FMLPS, the learning stage of this new model can be conducted very quickly, as it involves only algebraic calculus. This allows to explore a wide range of learning parameters (number of neurons, pre-processing applied to functional inputs,...). We apply FRBFNs to a spectrometric application from food industry : results are satisfactory [31].

Finally for non supervised classification problems (clustering), we extended the self-organizing map method (SOM) to the functional framework [47]. Thanks to the input space dimensionality reduction, this new algorithm is efficient. Moreover, the use of functional transformation (derivative calculation for instance) allows to compare very different clustering of the same data and therefore provides new exploratory representations of functional data.

#### Partitioning Method : a Clustering Approach for Reducing the Size of Data

Keywords : unsupervised clustering, Self Organizing Map, Large Data Base.

Participants : Yves Lechevallier, Luc Baudois.

Clustering methods reduce the volume of data in data warehouses, preserving the possibility to perform needed analyzes [38]. An important issue in databases and data warehouses is that they describe several entities (populations) which are linked together by relationships. In this situation compressed data has no interpretation and cannot be used unless decompressing them. Our work made in cooperation with Antonio Ciampi (Univ of McGill, Canada) and Georges Hébrail (ENST, Paris) differs from this work in the sense that our compression technique has a semantic basis.

Our approach is based on two key ideas [39]:

A preliminary data reduction using a Kohonen Self Organizing Map (SOM) is performed. As result, the individual measurements are replaced by the means of the individual measurements over a relatively small number of micro-clusters corresponding to Kohonen neurons. The micro-clusters can now be treated as new 'cases' and the means of the original variables over micro-clusters as new variables. This 'reduced' data set is now small enough to be treated by classical clustering algorithms. A further advantage of the Kohonen reduction is that the vector of means over the micro-clusters can safely be treated as multivariate normal, owing to the central limit theorem. This is a key property, in particular because it permits the definition of an appropriate dissimilarity measure between micro-clusters.

The multilevel feature of the problem is treated by a statistical model which assumes a mixture of distributions, each distribution representing a cluster or group. Although more complex dependencies can be modeled, for example we will assume that the group only affects the mixing coefficients, and not the parameters of the distributions.

#### Partitioning method : Clustering of Quantitative Data

Keywords : unsupervised clustering, Quantitative Data, dynamic clustering algorithm.

Participants : Marc Csernel, F.A.T. de Carvalho, Yves Lechevallier, Renata Souza.

We proposed an approach [30], which generalizes easily the dynamic cluster method for the case of the adaptive and non-adaptive Lr distances. This approach can be used with numerical data alone, interval data alone or numerical and interval data together. We did a theoretical study for r = 1 and r = 2: in that case we rediscovered the usual exemplars (median (r = 1) and mean (r = 2), respectively) of the clusters. In the case r>2, the difficulty is to find a realistic interpretation for the cluster representatives. For the future, we would like to continue to study the mathematical properties of these distances and to implement the corresponding algorithms and an empirical framework to their evaluation.

We worked on adaptive and non-adaptive dynamic cluster methods for interval data [49], which generates a partition of the input data, and a corresponding prototype (a vector of intervals) for each class by optimizing an adequacy criterion that is based on Mahalanobis distances between vectors of intervals. In a first approach we used a particular Mahalanobis family of distances but there are other possibilities, which we intend to explore in the near future. We would like also to implement the corresponding algorithms for these others distances and an empirical framework to their evaluation.

We proposed an approach to cluster constrained symbolic data using the dynamic clustering algorithm applied to a dissimilarity table [29]. The clustering criterion is based on the sum of dissimilarities between the objects belonging to the same class. We introduced a suitable dissimilarity function between symbolic data constrained by rules. To be able to compute dissimilarities between constrained symbolic data in a polynomial time, we used a method, called Normal Symbolic Form, which decomposes the data according to the rules in such a way that only the valid parts of the description are represented. For the future, we would like to implement the corresponding algorithms and an empirical framework to their evaluation.

#### Partitioning Method : Crossed Clustering Method for WUM

Keywords : unsupervised clustering, contingence table, dynamic clustering algorithm, WUM.

Participants : Yves Lechevallier, Rosanna Verde.

We proposed a crossed clustering algorithm in order to partition a set of objects in a predefined number of classes and to determine, at the same time, a structure (taxonomy) on the categories of the object descriptors [40]. This procedure is a simultaneous clustering algorithm on contingency tables [41]. The convergence of the algorithm is guaranteed at the best partitions of the objects in r classes and of the categories of the descriptors in c groups, respectively.

In our context we extent the crossed clustering algorithm to look
for the partition P of the set E in r classes of objects and
the partitions Q in c column-groups of V, according to the
^{2} criterion on set-valued variables. In this perspective,
we generalize a crossed clustering algorithm ( [68], [69]).

It is worth to notice that the criterion optimized in such algorithm is additive:

where Q^{v} is the partition associated to the modal variable
y_{v} and .

The cells of the crossed tables can be modeled by marginal distributions (or profiles) summarizing the classes descriptions of the rows and columns.

#### Agglomerative 2-3 Hierarchical Clustering: study and visualization

Keywords : 2-3 hierarchies, clustering, hierarchies, visualization, aggregation index.

Participants : Sergiu Chelcea, Mihai Jurca, Brigitte Trousse.

**Improvement of the 2-3 AHC algorithm**

In the context of Chelcea's thesis concerning clustering methods for usage analysis
and more particularly the agglomerative hierarchical methods,
we have continued in 2004 our work on the Agglomerative 2-3 Hierarchical Clustering (2-3 AHC).
We have proposed a new version
of the 2-3 AHC algorithm [25] with the same
(n^{2} l o g n) algorithmic complexity as the classical AHC.
Comparative tests between the classical AHC and our 2-3 AHC algorithm were performed on simulated data
and proved the richer quality of the 2-3 AHC structures against the classical AHC ones.

We also studied the influence of the aggregation index (single-link and complete-link) on the created 2-3 hierarchy when clusters properly intersect between themselves and improved its quality.

**Hierarchies visualization toolbox**

To better visualize and compare the created hierarchies and 2-3 hierarchies on same data sets, we developed (in Java) the Hierarchies visualization Toolbox.

Using it, the input data can be randomly generated, loaded from files (e.g. xml, sds, text) or extracted via SQL queries from a specified database server. Next, different methods (AHC, 2-3 AHC with integrated refinement, 2-3 AHC without integrated refinement) and aggregation indexes (single-link and complete-link) can be chosen and executed successively. The results can be then compared based on the number of created clusters, on the induced dissimilarities, on the execution time, etc. for quality based analyses.

The toolbox was also made available via the axis Web server(http://axis.inria.fr:8002) to the other team members for testing purposes. In 2004, it has been integrated in our Clustering Toolbox .