Team ACACIA

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Dissemination
Bibliography

Section: New Results

Keywords : Knowledge Acquisition, Knowledge Engineering, Knowledge Management, Corporate Memory, Knowledge Server, Semantic Web, Semantic Web Server, XML, RDF, OWL, Conceptual Graphs, Ontology, Information Retrieval.

Information Retrieval in a Corporate Semantic Web

We study the problems involved in the dissemination of knowledge through a knowledge server via Intranet or Internet: we consider the Web, and in particular the semantic Web, as a privileged means for the assistance to management of knowledge distributed within a firm or between firms. A knowledge server allows the search for information in a heterogeneous corporate memory, this research being intelligently guided by knowledge models or ontologies. It also allows the proactive dissemination of information by intelligent agents. We look further into the case of a memory materialized in the form of a corporate semantic Web, i.e. in the form of resources (such as documents) semantically annotated by RDF statements relating to an ontology.

Corese Semantic Web Engine

Participants : Virginie Bottollier, Olivier Corby [ resp. ] .

Corese relies on the conceptual graph model to implement RDF/S. A new version of the conceptual graph projection algorithm of Corese has been designed. It takes advantage of a static graph index to project upon. The graph index has been extended to n-ary relations as well as the projection algorithm itself. Hence Corese now processes n-ary relations. An extension of SPARQL for n-ary relations is under design.

The projection algorithm also integrates smart backtrack. In case of failure on a query edge, the projection backtracks directly to an edge that may solve the failure and not systematically to the preceding edge as before. An edge may solve a failure if it is the latest edge that binds one of the variables of the failing edge.

The approximate projection algorithm has been extended to compute and test the similarity score of a subpart of a query. An extension to SPARQL syntax is available to retrieve the score of a given pattern: score ?s { PATTERN } filter (?s >= .5) . Several scores can be computed in one query and a score of 1 means perfect match.

The query compiler now performs query rewriting, for example in the case of a ?x = ?y filter, we replace all occurrences of ?y by ?x in the query edges and filters. A set of such patterns are recognized and rewritten to speed up projection:

?x $ \le$ ?y && ?x $ \ge$ ?y $ \rightarrow$ ?x = ?y

?x = ?y && ?x != ?y $ \rightarrow$ false .

In the case of non connected edges carrying filters with equality of functions as shown below, an optimization is introduced:

select * where {

   ?x rdfs:label ?l1 . ?y rdfs:label ?l2 .

   filter (str(?l1) = str(?l2))}

In that case, search is optimized by sorting the candidate list and searching with dichotomy in the sorted list instead of searching through the whole list of candidates.

A set of optimizations has been integrated among which direct access to edges carrying constant values (URI or literal), new heuristics to sort the query edges before projection, e.g. prefer edges that have more filters with bound variables.

In addition to this, Corese has been improved on two main parts: the integration of standards (SPARQL) and the development environment. SPARQL (http://www.w3.org/TR/rdf-sparql-query )is a RDF query language designed by the W3C Data Access Working Group to access easily to RDF stores. Corese is now provided with a new JavaCC parser, based on SPARQL, with some additional functionalities (approximate search, path patterns, group by, etc.). SPARQL functionalities have been included in the core of Corese, such as construct, describe, ask, offset, etc. Corese RDF Rule Language has been adapted accordingly and is now SPARQL compliant.

On the other hand, the development environment has been improved: it has moved from CVS to subversion, to the Inria's forge and also from Java 1.4 to Java 1.5. A user manual is being written and a major version of the software (that contains the new SPARQL parser) is being released and distributed.

The Acacia team received a grant from Inria to hire an engineer to participate in the development of Corese.

A new release of Corese has been designed (http://www.inria.fr/acacia/soft/corese ).

Sewese

Participants : Priscille Durville, Fabien Gandon [ resp. ] .

We are designing and developing a new semantic web portal platform called Sewese: All semantic web applications using a semantic engine (like Corese) provide common functionalities that could be factorized into a semantic web application development platform. This is the goal of Sewese i.e. to provide reusable, configurable and extensible components in order to reduce the amount of time spent to develop new semantic web applications and to allow these applications to focus on their domain specificities. Sewese provides a set of functionalities like the generation of interfaces for requests, the edition and navigation, and the management of transverse functions of a portal (presentation, internationalization, security...). An ontology editor, a generic annotation editor and a basic rule editor are parts of the Sewese platform.

SweetWiki

Participants : Michel Buffa [ resp. ] , Guillaume Ereteo, Fabien Gandon.

We designed, developed and are testing an innovative wiki engine leveraging semantic web technologies: SweetWiki.

SweetWiki is an example of an application reconciling two trends of the future web: a semantically-augmented web and a social web. Our example makes heavy use of semantic web concepts and languages to build a semantic wiki. We demonstrate how the use of such paradigms can improve navigation, search, and usability. By annotating semantically the resources of the wiki and by reifying the wiki object model itself, SweetWiki provides reasoning and querying capabilities. All the models are defined in OWL schemata capturing concepts of the wikis (wiki word, wiki page, forward and backward link, author, etc.) and concepts manipulated by the users (user folksonomy, external ontologies). These ontologies are exploited by an embedded semantic search engine (Corese) allowing us to support and ease the lifecycle of the wiki (e.g. restructuring pages), to propose new functionalities (e.g. semantic search, profile-based monitoring) and to allow for extensions (e.g support new media in pages, integrate legacy software).

Relying on semantic web technologies implies to pay attention to usability. In SweetWiki we paid special attention to preserve the essence of the wiki: their simplicity and their social dimension. Thus SweetWiki supports all the common wiki features such as easy page linking using WikiWords, versioning, etc. but also innovates, integrating a WYSIWYG editor extended to support social tagging functionalities masking the OWL-based annotation implementation. Users can freely enter tags and an auto-completion mechanism proposes existing ones by issuing queries to identify existing concepts with compatible labels. Thus tagging is both easy and motivating (real time display of the number of related pages) and concepts are collected as in folksonomies. Wiki pages are served directly in XHTML embedding semantic annotations ready to be reused by other semantic web agents [30] , [29] , [28] , [23] .

SweetWiki is one of the tools used in the Palette project and it serves as collaborative tool for the e-WOK_HUB consortium. An online version is available for public testing at:

http://argentera.inria.fr:8080/wiki .

Edccaeteras

Participants : Fabien Gandon [ resp. ] , Alain Giboin, Yann-Vigile Hoareau.

Intuitively we all are inclined to say that the concept of ``car'' is closer to the concept of ``truck'' than to the concept of ``plane''; however we also think that the concept of ``car'' is closer to the concept of ``plane'' than to the concept of ``book''. In the information systems, it is important to us to be able to simulate these ``intuitive distances'' to improve search engines in the way they select relevant results (e.g. to control the constraint relaxation of a request) and sort them (e.g. to cluster answers).

The current work in the Edccaeteras COLOR action focuses on similarity measurement applied to information search and indexing. The objectives so far were to compare two methods of similarity measurement. The first method is based on distances defined over the graph structure of the ontology in Sewese. The second method is based on the similarity of words and documents represented as vectors in semantic spaces with the Latent Semantic Analysis (LSA, [70] ). An experiment to provide natural queries is under development to evaluate the ability of each method to propose relevant answers for information retrieval. Using LSA, we have created different semantic spaces from corpora composed of titles and abstracts of research reports of INRIA (two spaces in French and two spaces in English with a lemmatised version for each one and one semantic space where the corpus has been modified with adding information corresponding to the sub-category relations the ACM98 thesaurus).

In a first experiment we calculate the similarity between the vector corresponding to the abstract and the vectors corresponding to the keyword lists of a report, evaluating the hypothesis that the target keywords list should be more similar to the abstract tested than other keyword lists.

In a second experiment, we calculate the similarity between a keywords list and all the abstracts composing the data-base (over 4000) evaluating the hypothesis that the keywords list tested should be more similar to its associated abstract than the other abstracts.

In a third experiment we calculate the similarity between an abstract and a set of keyword lists, we calculate the similarity between an abstract and a set of five hundred of individual keywords and we evaluate the hypothesis that the keywords that actually correspond to the abstract tested should be more similar than others.

The comparison of the results for the different spaces make it possible (a) to evaluate the effect of lemmatization, (b) to evaluate the effect of the subsumption information added in the corpora and (c) to optimize the parameters of the spaces in both French and English to make it possible to build a French-English semantic space [35] , [27] , [18] .

Web Mining for Technological and Scientific Watch

Keywords : Multi Agent System, Corporate Memory, Semantic Web, Web Mining, Ontology, Semantic Annotations, Technological Watch, Technological Monitoring.

Participants : Tuan-Dung Cao, Rose Dieng-Kuntz.

This work was performed in the context of the PhD of Tuan-Dung Cao, in cooperation with the CSTB (Scientific and Technical Centre for Building) [31] , [21] . The main objective of this thesis was to use the Semantic Web technologies, in particular ontologies, to develop a system for technology monitoring (OntoWatch). This system is guided by ontologies, in order to collect, capture, filter, classify and structure the Web content coming from several information sources in a scenario of assistance to the technological and scientific watch.

Previously, we modeled the CSTB's technological watch process relying on the generic model of monitoring proposed by Lesca. We identified the potential contributions of ontology in the various stages of the process then we built an ontology dedicated to the technological watch system. We developed an ontology, O'Watch, integrating part of an existing ontology (O'CoMMA) and vocabularies offered in thesauri of the CSTB domain. We analyzed the problems linked to the integration of a thesaurus into an ontology: this integration cannot be automatic since the hierarchical links in the CSTB thesaurus do not necessarily correspond to ontological subsumption links.

After that, we proposed several algorithms using an ontology to improve document search on the Web and to generate automatically semantic annotations in RDF format for these documents. The two first algorithms use the branches of the ontology to generate system queries and send them to Google. The third algorithm is based on the balanced choice between descendants of the original concepts in the user query in order to form a system query [31] . The generated annotations feed the annotation bases of the system on which the semantic search of information relies. These annotation bases can be organized according to various criteria: type of documents, type of information sources, etc.

Finally, we proposed a multi-agents architecture to implement the OntoWatch system. Among three sub-societies of agents respectively dedicated to ontologies, to semantic search, and to search and automatic annotation of documents on the Web, we focused in particular on the design of the last one.


previous
next

Logo Inria