Section: New Results
Using Multiple Dissimilarity Data Tables for Documents Categorization
Participants : Thierry Despeyroux, Yves Lechevallier.
In collaboration with F.A.T De Carvalho and Filipe M de Melo, we have developped a clustering algorithm that is able to partition objects taking into account simultaneously their relational descriptions given by multiple dissimilarity matrices. These matrices could have been generated using different sets of variables and a fixed dissimilarity function, using a fixed set of variables and different dissimilarity functions or using different sets of variables and dissimilarity functions. This method, which is based on the dynamic hard clustering algorithm for relational data, is designed to provide a partition and a prototype for each cluster as well as to learn a relevance weight for each dissimilarity matrix by optimizing an adequacy criterion that measures the fit between clusters and their representatives. These relevance weights change at each algorithm iteration and are different from one cluster to another.
To illustrate the usefulness of the proposed clustering algorithm, we use it to categorize a document database. The document database is a collection of reports produced by every Inria research team in 2007. Research teams are grouped into scientific themes that do not correspond to an organizational structure (such as departments or divisions), but act as a virtual structure for the purpose of presentation, communication and evaluation. Choice of themes and team allocation are mostly related to strategic objectives and scientific closeness between existing teams, but also take into account some geographical constraints, such as the desire for a theme to be representative of most Inria centers. Our aim is to compare the categorization given automatically by the clustering algorithm that we have developped with the a priori expert categorization given by INRIA.
To do that, we used the XML parser described in section 5.4.2 to parse and extract the proper parts of the reports, the treetagger program lemmatizes the extracted text, before giving this intermediate result to our clustering algorithm.
The comparison between the automatic clustering and the a priori expert categorization shows minor divergences that can be explained by political choices of Inria [32] , [40] .