Team AxIS

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities
Inria / Raweb 2004
Project: AxIS

Project : axis

Section: New Results

Keywords : symbolic data analysis, unsupervised clustering, Self Organizing Map, complex data, neural networks, hierarchical clustering, hierarchies.

Data Mining Methods

Symbolic Data Extraction and Self-Organizing Maps

Keywords : symbolic data analysis, relational database, unsupervised clustering, Self Organizing Map, dissimilarity.

Participants : Aicha El Golli, Yves Lechevallier, Brieuc Conan-Guez, Fabrice Rossi.

The aim of symbolic data analysis is to provide a better representation of the variations and imprecision contained in real data. As such data express a higher level of knowledge, the representation must offer a richer formalism than the one provided by classical data analysis.

A generalization process exists that allows data to be synthesized and represented by means of an assertion formalism that was defined in symbolic data analysis. This generalization process is supervised and often sensitive to virtual and atypical individuals. When the data to be generalized is heterogeneous, some assertions include virtual individuals. Faced with this new formalism and the resulting semantic extension that symbolic data analysis offers, a new approach to processing and interpreting data is required.

The original contributions of our work concern new approaches to representing and clustering complex data.

First, we propose a decomposition step, based on a divisive clustering algorithm, that improves the generalization process while offering the symbolic formalism [37]. We also propose a unsupervised generalization process based on the self-organizing map. The advantage of this method is that it enables the data to be reduced in an unsupervised way and allows the resulting homogeneous clusters to be represented by symbolic formalism.

The second contribution of our work is a development of a clustering method to handle complex data [33]. The method is an adaptation of the batch-learning algorithm of the self-organizing map to dissimilarity tables. Only the definition of an adequate dissimilarity is required for the method to operate efficiently. This adaptation can handle both numerical data and complex data. The experiments showed the usefulness of the method and that it can be applied to a wide variety of complex data once we can define a dissimilarity for those data. This method has also given good results for real applications [12][35] with different types of data (meteorological data of China [13], functional data [47] and Web Usage Mining [34].

Functional Data Analysis

Keywords : functional data, neural networks, curves classification.

Participants : Fabrice Rossi, Brieuc Conan-Guez, Aicha El Golli, Yves Lechevallier.

Functional Data Analysis is an extension of traditional data analysis to functional data. In this framework, each individual is described by one or several functions, rather than by a vector of Rn. This approach allows to take into account the regularity of the observed functions.

In earlier works, we proposed the extension of MLPs (Multi-Layer Perceptrons) to functional inputs: the Functional Multi-Layer Perceptron (FMLP) [51]. We demonstrated two important properties: this model is a universal approximator, and the parameter estimation is consistent when we only know a finite number of functions known on a finite number of evaluation points.

In 2004, we studied the advantages of a functional pre-processing of input functions before processing by functional neural models:

Partitioning Method : a Clustering Approach for Reducing the Size of Data

Keywords : unsupervised clustering, Self Organizing Map, Large Data Base.

Participants : Yves Lechevallier, Luc Baudois.

Clustering methods reduce the volume of data in data warehouses, preserving the possibility to perform needed analyzes [38]. An important issue in databases and data warehouses is that they describe several entities (populations) which are linked together by relationships. In this situation compressed data has no interpretation and cannot be used unless decompressing them. Our work made in cooperation with Antonio Ciampi (Univ of McGill, Canada) and Georges Hébrail (ENST, Paris) differs from this work in the sense that our compression technique has a semantic basis.

Our approach is based on two key ideas [39]:

Partitioning method : Clustering of Quantitative Data

Keywords : unsupervised clustering, Quantitative Data, dynamic clustering algorithm.

Participants : Marc Csernel, F.A.T. de Carvalho, Yves Lechevallier, Renata Souza.

We proposed an approach [30], which generalizes easily the dynamic cluster method for the case of the adaptive and non-adaptive Lr distances. This approach can be used with numerical data alone, interval data alone or numerical and interval data together. We did a theoretical study for r = 1 and r = 2: in that case we rediscovered the usual exemplars (median (r = 1) and mean (r = 2), respectively) of the clusters. In the case r>2, the difficulty is to find a realistic interpretation for the cluster representatives. For the future, we would like to continue to study the mathematical properties of these distances and to implement the corresponding algorithms and an empirical framework to their evaluation.

We worked on adaptive and non-adaptive dynamic cluster methods for interval data [49], which generates a partition of the input data, and a corresponding prototype (a vector of intervals) for each class by optimizing an adequacy criterion that is based on Mahalanobis distances between vectors of intervals. In a first approach we used a particular Mahalanobis family of distances but there are other possibilities, which we intend to explore in the near future. We would like also to implement the corresponding algorithms for these others distances and an empirical framework to their evaluation.

We proposed an approach to cluster constrained symbolic data using the dynamic clustering algorithm applied to a dissimilarity table [29]. The clustering criterion is based on the sum of dissimilarities between the objects belonging to the same class. We introduced a suitable dissimilarity function between symbolic data constrained by rules. To be able to compute dissimilarities between constrained symbolic data in a polynomial time, we used a method, called Normal Symbolic Form, which decomposes the data according to the rules in such a way that only the valid parts of the description are represented. For the future, we would like to implement the corresponding algorithms and an empirical framework to their evaluation.

Partitioning Method : Crossed Clustering Method for WUM

Keywords : unsupervised clustering, contingence table, dynamic clustering algorithm, WUM.

Participants : Yves Lechevallier, Rosanna Verde.

We proposed a crossed clustering algorithm in order to partition a set of objects in a predefined number of classes and to determine, at the same time, a structure (taxonomy) on the categories of the object descriptors [40]. This procedure is a simultaneous clustering algorithm on contingency tables [41]. The convergence of the algorithm is guaranteed at the best partitions of the objects in r classes and of the categories of the descriptors in c groups, respectively.

In our context we extent the crossed clustering algorithm to look for the partition P of the set E in r classes of objects and the partitions Q in c column-groups of V, according to the $ \Phi$2 criterion on set-valued variables. In this perspective, we generalize a crossed clustering algorithm ( [68], [69]).

It is worth to notice that the criterion optimized in such algorithm is additive:

Im1 ${{\#916 (P,(}Q^1,...,Q^p{))}=\#8721 _{v=1}^p\#934 ^2{(P,}Q^v{|Q)}}$

where Qv is the partition associated to the modal variable yv and Im2 ${{Q=(}Q_1,...,Q_c{)=(}\#8899 _{v=1}^pQ_1^v,...,\#8899 _{v=1}^pQ_k^v,...,\#8899 _{v=1}^pQ_c^v{,)}}$.

The cells of the crossed tables can be modeled by marginal distributions (or profiles) summarizing the classes descriptions of the rows and columns.

Agglomerative 2-3 Hierarchical Clustering: study and visualization

Keywords : 2-3 hierarchies, clustering, hierarchies, visualization, aggregation index.

Participants : Sergiu Chelcea, Mihai Jurca, Brigitte Trousse.

Improvement of the 2-3 AHC algorithm

In the context of Chelcea's thesis concerning clustering methods for usage analysis and more particularly the agglomerative hierarchical methods, we have continued in 2004 our work on the Agglomerative 2-3 Hierarchical Clustering (2-3 AHC). We have proposed a new version of the 2-3 AHC algorithm [25] with the same $ \Theta$(n2 l o g n) algorithmic complexity as the classical AHC. Comparative tests between the classical AHC and our 2-3 AHC algorithm were performed on simulated data and proved the richer quality of the 2-3 AHC structures against the classical AHC ones.

We also studied the influence of the aggregation index (single-link and complete-link) on the created 2-3 hierarchy when clusters properly intersect between themselves and improved its quality.

Hierarchies visualization toolbox

To better visualize and compare the created hierarchies and 2-3 hierarchies on same data sets, we developed (in Java) the Hierarchies visualization Toolbox.

Using it, the input data can be randomly generated, loaded from files (e.g. xml, sds, text) or extracted via SQL queries from a specified database server. Next, different methods (AHC, 2-3 AHC with integrated refinement, 2-3 AHC without integrated refinement) and aggregation indexes (single-link and complete-link) can be chosen and executed successively. The results can be then compared based on the number of created clusters, on the induced dissimilarities, on the execution time, etc. for quality based analyses.

The toolbox was also made available via the axis Web server( to the other team members for testing purposes. In 2004, it has been integrated in our Clustering Toolbox .


Logo Inria