Section: New Results
Keywords : Classification, Clustering, Data Mining, Entity Ranking, INEX, Mining Complex data, XML Document, XML mining, ontology construction, Context-Aware Information Retrieval.
Document Mining and Information Retrieval
Entity Ranking
Keywords : Entity ranking, named entities, linkrank, categories.
Participants : Anne-Marie Vercoustre, Jovan Pehcevski, James Thom.
The goal of entity ranking is to retrieve entities as answers to a query. The objective is no longer to tag the names of the entities in documents (in batch mode) but rather to return a list of the relevant entity names, and possibly a page or some description associated with each entity. We have developped a system for Entity Ranking in Wikipedia that addresses two specific tasks: a task where the category of the expected entity answers is provided; and a task where a few (two or three) examples of the expected entity answers are provided.
In our approach, candidate pages are ranked by combining three different scores: a linkrank score, a category score, and the initial search engine similarity score. The architecture of our system provides a general framework for evaluating entity ranking which allows for replacing some modules by more advanced modules and evaluate alternatives or different combinations of the score funtions [Oops!] , [128] . We also experimented with different category similarity between the category of the entity examples and the potential entity answers [Oops!] . An evaluation module assists in tuning the parameters of the system and to globally evaluate the entity ranking approach [Oops!] , [Oops!] .
The current system has been developped in the context of the INEX (Initiative for the Evaluation of XML Retrieval) track on Entity Ranking.
Web HTML Pages Clustering For Ontology Construction
Keywords : ontology construction, clustering, Web pages.
Participant : Marie-Aude Aufaure.
We proposed an approach for ontology construction from Web pages that is based on a contextual and incremental clustering of terms. Our approach defines and evaluates a context-based clustering algorithm for ontology learning included in a global architecture for knowledge discovery for the semantic Web [107] . This algorithm is based on an incremental use of the partitioning K-means algorithm and is guided by a structural context. This context is based on the HTML structure and the location of words in the documents. This contextual representation guides the clustering algorithm to delimit the context of each word by improving the word weighting, the word pair's similarity and the semantically closer cooccurent selection for each word. Our algorithm refines the context of each word cluster and improves the conceptual quality of the resulting clusters and consequently of the extracted concepts [Oops!] . This year, we have defined a set of criteria for evaluating the ontological concepts [Oops!] . We also experiment the contextual clustering algorithm on a HTML document corpus related to the tourism domain (in French) and we evaluate the extracted ontological concepts with our contextual algorithm. The results show that the appropriate context definition and the successive refinements of clusters improve the relevance of the extracted concepts in comparison with a simple K-means algorithm. Our evaluation of ontological concepts can be applied to any domain and provides qualitative and quantitative criteria.
Semantic and Conceptual Context-Aware Information Retrieval
Keywords : formal concept analysis, semantics, contextual information retrieval.
Participant : Marie-Aude Aufaure.
In this work, we define an information retrieval methodology that uses Formal Concept Analysis in conjunction with semantics to provide contextual answers to Web queries [Oops!] . The conceptual context defined can be global - i.e. stable- or instantaneous- i.e. bounded by the global context. Our methodology consists first in a pretreatment providing the global conceptual context and then in an online contextual processing of users requests, associated with an instantaneous context. The pretreatment consists in computing offline a conceptual lattice from data sources in order to build an overall conceptual context. Then, the information retrieval is performed in real-time: users formulate their query with terms from the thesaurus/ontology. Users may then navigate within the lattice by generalizing or on the contrary by refining their query [Oops!] . This year, we define a similarity measure to find the closer concepts starting from an entry point of the lattice, in order to help the user to navigate (master thesis of Saoussen Sakji, University Paris-Dauphine). Our information retrieval process was illustrated through experimentations in the tourism domain. One interest of our approach is to perform a more relevant and refined information retrieval, closer to the users expectation. We add a semantic layer to the conceptual and data ones. The similarity measure helps the user to navigate through big lattices by ranking the neighbour concepts. This method is generic and can be applied to any heterogeneous data sources (Web data, personal data, etc.).
This method has several advantages:
-
Results are provided according to both the context of the query and the context of available data. For example, only query refinements corresponding to existing tourism pages are proposed;
-
The added semantics can be chosen depending on the target user(s);
More powerful semantics can be used, in particular ontologies. This allows enhanced query formulation and provides more relevant results.
Our information retrieval process is illustrated through experimentation results in the tourism domain. One interest of our approach is to perform a more relevant and refined information retrieval, closer to the users expectation. In this work, we strongly collaborate with Bénédicte Le Grand and Michel Soto, from LIP 6 Laboratory.