Section: New Results
XML Document Mining and XML Search
Structure and Content Mining
Keywords : Document mining, XML clustering, XML classification.
Participants : Thierry Despeyroux, Mounir Fegas, Saba Gul, Yves Lechevallier, Anne-Marie Vercoustre.
XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Standard methods have been used to classify XML documents, reducing them to their textual parts. These approaches do not take advantage of the structure of XML documents that also carries important information.
Last year we studied the impact of selecting different parts (sub-structures) of XML documents for specific clustering tasks. Our approach integrated techniques for extracting representative words from documents elements with unsupervised classification of documents. We illustrated and evaluated this approach with the collection of XML activity reports written by Inria research teams for year 2003. The objective was to cluster projects into larger groups (Themes), based on the keywords or different chapters of these activity reports. We then compared the results of clustering using different feature selections, with the official theme structure used by Inria between 1985 and 2003, and with the new one proposed officially in 2004. The results (published this year) show that the quality of clustering strongly depends on the selected document features [33] , [32] .
This year we developed a new representation model for clustering XML documents. The standard vector model for classification or clustering of documents represents documents by weighted vectors of words contained in the documents. This model takes into account only the textual content of documents. With XML documents, we want a representation that takes into account either the structure of the documents or both the structure and the content. Since XML documents can be seen as trees, we represent documents by the set of their (node) paths of length L, n< = L < = m, n and m being two given values. Paths can be contrained to be root-beginning paths, or leaf-ending paths. For dealing with both the structure and the content, we define text paths that extend the node paths with the word contained in the subtree of their final node. Then by regarding paths as words, we can cluster documents by applying standards clustering methods based on the vector model. There is one difficulty, though, since the vector model is based on the independance between the dimensions of the vectors. In our case, when two paths are embedded in each other they are obviously not independant. To deal with this problem of dependency, we partition the paths by their length and treat each set of paths as a different modality in the clustering algorithm.
We evaluate our approach using four standard metrics, namely the F-measure, the Corrected-Rand, the entropy and the purity. That for a given clustering task, we compare the resulting clusters with a priori known classes. We made several experiments using the INEX IEEE collections and INRIA activity reports [48] . The results that will be published in the EGC 2006 conference show that our approach works both for structure-based clustering and Structure-and-content clustering. However, using leaf-ending paths may result in damaging the clustering time, as the number of paths increases dramatically. We need to find good ways to reduce the number of paths, especially for text paths.
We also started to apply this approach to the collections proposed by the INEX XML Document Mining tracks [56] .
Sequential Pattern Mining for Structure-based XML Document Classification
Keywords : sequential pattern, structure mining, XML document, classification.
Participants : Calin Garboni, Florent Masséglia, Brigitte Trousse.
The goal of this work is to provide a classification (``classification supervisée'' in french) over a collection of XML documents. For this purpose we consider that we are provided with a set of clusters coming from a previous clustering on an past collection. More formally: let us consider S1 a first collection of XML documents and C= { c1, c2, ... cn} the set of clusters defined for the documents of S1 . Let us now consider S2 a new collection of XML documents. Our goal is to provide a classification on S2 by taking into account the distribution of documents in C.
To this end, our method will perform as illustrated in figure 9 . It is based on the following three steps:
-
Pre-processing: first of all, we extract the frequent tags embedded in the collection. This step corresponds to step ``1'' in figure 9 . The main idea is to remove irrelevant tags for clustering operations. A tag which is very frequent in the whole collection may be considered as irrelevant since it will not help in separating a document from another (the tag is not discriminative).
-
Characterising existing clusters: then we perform a data mining step on each cluster from the previous collection (namely `` C'' in the foreword of this section). This step corresponds to step ``2'' in figure 9 . For each cluster, the goal is to transform each XML document into a sequence. Furthermore, during the mapping operation, the frequent tags extracted from step 1 are removed. Then on each set of sequences corresponding to the original clusters, we perform a data mining step intended to extract the sequential patterns. For each cluster Ci we are thus provided with SPi the set of frequent sequences that characterizes Ci .
-
XML Document matching: finally the key step of our method relies on a matching between each document of the collection and the sequences extracted from the second step. This last step corresponds to step 3 in figure 9 .
The matching techniques developed in this work are described in [49] .
Relevance in XML search
Keywords : XML Search, User Relevance.
Participant : Anne-Marie Vercoustre.
When searching information from structured documents collections such as XML collections, it is expected that the use of the structure will help in two ways:
-
specifying more precise queries
-
identifying specific and relevant parts of documents instead of full documents.
In the context of INEX (an International Initiative for the Evaluation of XML search), we are interested in evaluating what granularity of elements the users find relevant. We first analysed the relevance assessments to identify the types of highly relevant elements and identified three retrieval scenarios: Original , general and specific and compare the performance of three different systems - a native XML database, a full text retrieval engine and a hybrid system -, for those different scenarios [39] . We then developed a novel retrieval heuristics that dynamically determines the preferable units of retrieval called Coherent retrieval Element [21] .
In [40] , we analyse and compare the assessors'judgement on the relevance of returned document components with the users' behaviour when interacting with components of XML documents. By analysing the level of agreement between the assessor and the users, we show that the highest level of agreement is on highly relevant and on non-relevant document components, suggesting that only the end points of the INEX 10-point relevance scale are perceived in the same way by both the assessor and the users.