Team AxIS

Overall Objectives
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Block clustering and Web Content Data Mining

Participants : Malika Charrad, Yves Lechevallier.

Simultaneous clustering, usually designated by biclustering, co-clustering or block clustering, is an important technique in two way data analysis. The goal of simultaneous clustering is to find sub-matrices, which are subgroups of rows and subgroups of columns that exhibit a high correlation. Our aim is to analyze textual data of a web site. Our approach [27] , [28] consists of three steps: Web pages classification, preprocessing of web pages content and block clustering. The first step consists in classifying web site pages into to major categories: auxiliary pages and content pages. In the second step, web pages content is preprocessed in order to select descriptors to represent each page in the web site. As a result, a matrix of web site pages and vectors of descriptors is constructed. In the last step, a simultaneous clustering is applied to rows and columns of this matrix to discover biclusters of pages and descriptors.

One of the major problems of simultaneous clustering algorithms, similarly to the simple clustering algorithms, is the need of specifying the optimal number of clusters. This problem has been subject of wide research. Numerous strategies have been proposed for finding the right number of clusters. However, these strategies can only be applied with one way clustering algorithms and there is a lack of approaches to find the best number of clusters in block clustering algorithms.

Ms Malika Charrad [17] defended her PhD in June 2010 at CNAM.


Logo Inria