Section: New Results
Block clustering and Web Content Data Mining
Participants : Malika Charrad, Yves Lechevallier.
Simultaneous clustering, usually designated by biclustering, co-clustering or block clustering, is an important technique in two way data analysis. The goal of simultaneous clustering is to find sub-matrices, which are subgroups of rows and subgroups of columns that exhibit a high correlation. Our aim is to analyze textual data of a web site. Our approach [28] consists of three steps: Web pages classification, preprocessing of web pages content and block clustering. The first step consists in classifying web site pages into to major categories: auxiliary pages and content pages. In the second step, web pages content is preprocessed in order to select descriptors to represent each page in the web site. As a result, a matrix of web site pages and vectors of descriptors is constructed. In the last step, a simultaneous clustering is applied to rows and columns of this matrix to discover biclusters of pages and descriptors.
One of the major problems of simultaneous clustering algorithms, similarly to the simple clustering algorithms, is the need of specifying the optimal number of clusters. This problem has been subject of wide research. Numerous strategies have been proposed for finding the right number of clusters. However, these strategies can only be applied with one way clustering algorithms and there is a lack of approaches to find the best number of clusters in block clustering algorithms. Our goal [28] is to extend the use of these indices to block clustering algorithms. We propose in [36] to use The Laplace operator and the differential operator in two dimensions to detect the good number of clusters.