Section: New Results
Keywords : Corporate Memory, Corporate Semantic Web, Knowledge Acquisition, Knowledge Management, Knowledge Engineering, Ontology, Assistance to the User, Semantic Web, Semantic Annotation, Language Technology, Knowledge Acquisition from Texts, Natural Language Processing, Text-mining, Co-operation, Information Extraction, Evolution.
Annotation of Information Resources
The objective of this research direction is to propose (1) a methodological guide for collaborative, semantic-Web-based, annotation process in a community of practice; (2) an ontology-based, service-oriented annotation toolkit offering a service of semi-automatic annotation from textual documents, a service of collaborative annotation and management of evolution of the annotations. The methodological guide and the toolkit will tackle complex contextualization of annotations, various kinds of Web-accessible external resources, reflective annotations and more complex types of heterogeneous resources and services.
Management of Corporate Semantic Web Evolution
This work is being carried out within the framework of Phuc-Hiep Luong's PhD thesis [Oops!] that aims at solving some problems related to the life cycle and evolution of a Corporate Semantic Web (CSW): evolution of each component of a CSW (resources, ontologies and semantic annotations) as well as evolution of relations among these components.
We focused on the ontology evolution, its influence on semantic annotations expressed with the vocabulary provided by the underlying ontology and the evolution of these semantic annotations. We proposed a new approach for the evolution management of a CSW and we focused on two main scenarios of ontology evolution : (i) with trace and (ii) without trace of ontology changes which are carried out during its evolution [Oops!] .
Corresponding to these scenarios, we proposed a procedural approach and a rule-based approach in order to manage semantic annotations evolution and particularly to detect inconsistent annotations and to guide the process of resolution of these inconsistencies. We have established a set of ontology changes and all the possible solutions for each ontology change operation allowing users to select an appropriate way to repair inconsistencies in ontology or in semantic annotations.
In order to detect and to correct automatically inconsistent semantic annotations, we implemented in our rule-based approach some inconsistency detection rules: they enable to find different kinds of annotation inconsistencies (i.e. inconsistency on concept, property, domain, range or datatype) with respect to the new ontology. These detected inconsistencies will be solved with the help of correction rules and resolution strategies for semantic annotations. These propositions were implemented and validated in CoSWEM (Corporate Semantic Web Evolution Management) system. CoSWEM facilitates the evolution management of the CSW and particularly manages the propagation of the ontology changes to the semantic annotations depending on this ontology [Oops!] , [Oops!] .
This system has been developed as a Web-based application integrating the semantic search engine Corese and the Sewese library dedicated to semantic data manipulation (e.g. querying, modifying, updating) of ontology and annotation bases. CoSWEM enables to carry out automatically or semi-automatically some tasks such as comparison of different ontologies, inconsistency detection and correction of the semantic annotations, etc. This system was also experimented within the framework of the IST project Palette and of the ANR RNTL project e-WOK_HUB with a set of real and evolving data from these projects.
Word sense disambiguation
Participant : Khaled Khelif.
In order to improve our term detection method [Oops!] , [Oops!] , we proposed a technique to solve the ambiguity problem confronting MeatAnnot results. The main idea of this method is to use the ambiguous word context to decide to which semantic type we can affect it. This context consists of the set of terms which occur with the ambiguous word in the same sentence or in the same paragraph.
So, if MeatAnnot affects several semantic types to the same candidate term, the disambiguation module tries to find the right semantic type. This module computes similarities between the semantic types affected to the ambiguous word and other semantic types affected to the neighbours of the ambiguous word in the text. The semantic type which has the highest similarity is then selected. The calculation of similarity between semantic types is based on Corese semantic distance.
The algorithm was tested on the WSD (Word Sense Disambiguation) test corpus (which is a standard collection for evaluating disambiguation methods  ) and had a good results.
This work is applied to the Sealife IST project.
Our approach consists of generating semantic annotations by relying on structuring and building a semantic representation of patent documents. The generated annotation comprises three parts: a structure annotation, a metadata annotation and a domain-based annotation. These different annotations are merged into the so-called Patent Semantic Annotation.
To structure these annotations, we designed and implemented a modular ontology called PatOnto describing the different aspects we took into account : the structure and the content. The domain ontology is an existing biomedical ontology (UMLS) that is used to study the textual content of a patent.
User profile detection
One of the use cases in the IST project SeaLife consists of linking information on biomedical Websites to appropriate secondary knowledge (existing ontologies/terminologies, RSS feeds). This case study will demonstrate how to provide the user with additional information on resources he/she is viewing on biomedical Websites, using a semantic mapping to appropriate online portals and databases (called targets). In this purpose, the SeaLife browser must recognize the user profile in order to select the appropriate ontology and targets.
In this use case, we proposed a generic and domain-independent approach to detect Web user profiles. Profiles generated can be used for Web browsing personalization, document recommendation, professional activities discovery, etc. We implemented this approach in a system called SUPROD and we tested it on the biomedical domain through the Neli (National Electronic Library of Infection) Web site .
We developed (i) a biomedical profile ontology describing: classification of biologists and doctors, (ii) a generic log analyser, (iii) a profile classifier algorithm and (iv) a profile detector algorithm [Oops!] , [Oops!] .
Semantic Annotation Generation and Use: GeneRif corpus
We used MeatAnnot [Oops!] , an automatic system for the generation of ontology-based semantic annotations to annotate the GeneRIF (Gene Reference Into Function) corpus. GeneRif documents consist of a set of concise phrases, limited to 255 characters in length, describing a function related to a specific gene, supported by at least one PubMed ID. Actually, there are 214354 GeneRifs decribing 36052 genes. For each Generif, we generated a structured semantic annotation, based on three subontologies (1) UMLS (Unified Medical Language System) to describe the biomedical domain, (2) DocOnto to describe metadata about scientific articles, the structure of articles and to link documents to UMLS concepts and (3) GO (Gene Ontology) to add knowledge about the human genes.
We developed a navigation interface which allows the navigation in the Generif corpus. The search module is based on Corese  . By using SPARQL (Standard RDF query language implemented by Corese), the interface allows to perform searches and inferences on the annotation base for retrieving relevant information. User can find different information about a gene:
(a) Its synonyms.
(b) GO concepts it is attached to (such as carbohydrate binding, ATPase activity...).
(c) Documents about this gene.
(d) All entities it is associated to in the corpus.
We use the Gene Ontology to find (a) and (b) and the annotations we have made on the Generif documents using MeatAnnot to extract (c) and (d).
The information (d) is presented in a hyper graph which is a more interactive way to present this information. User can click on an edge of the hyper graph to see in which documents the relation between the two entities appears. He can also click on an entity of the graph to have more information about it (generation of a new query). Another graph, "Gene Cart", shows a group of genes in which the user is interested and the interaction between these genes of interest. It can be considered as the result of all the searches the user has done.
A filter will be added to the graphs to allow user to choose which relations and/or which type of entities he would like to store in the graphs. For example, if the user wants to know which entities affect the gene g, he can explicitly ask to hide all the relations that are different from e affects g (using a check box for example).
This work is applied to the ImmunoSearch project in the P.A.S.S (Parfums, Arômes, Senteurs, Saveurs) Competitivity Pole.
Annotation Processing for Earth Sciences
Some work have been done with e-WOK_HUB project partners in order to identify the significant terms for geological and CO2 domains from a set of geological articles. One of the objectives of this work is to specify and implement a system that will be able to extract and generate significant annotations from geological and CO2 articles in a semi-automatic way.
Document repository annotation
highlight the important concepts and relations contained in the document text with respect to one or several ontologies of the domain;
mark-up texts according to given ontologies of the domain with accurate annotations in order to retrieve the text by querying a semantic search engine, namely Corese.
We have extended our Document ontology [Oops!] to the so-called DocumentContents [Oops!] dedicated to support text annotation process. The main modifications consist of separating the document (considered as an object which may contain text, audio, and video materials) from the text contents itself to be analysed. The DocumentContents ontology formalizes both structural and contents information. The original Document ontology is extended based on the document genre (norm, assembly instructions, etc.).
The Natural Language Processing (NLP) analysis of the documents is strongly related to knowledge resources available. In order to feed the SevenPro corporate repository, and to refer to the right document according to the specific engineering activity conducted, a precise study of the end-users ontology requirements is carried out. Additional knowledge resources (named entities, black lists, white lists, etc.) are necessary. We have emphasized the need of dedicated concepts/properties within end-user ontology in order to successfully map the terms/verbs present in the document and the concepts/properties present in end-user ontology, and consequently generate accurate annotations for the document. This task has led us to refine end-user ontologies.
Our annotator being based on GATE, an open-source platform for language technology developed by the University of Sheffield. We developed wrappers of two term extractors, FASTR and ACABIT, so as to integrate them into GATE.
We have tested the annotation generation on sentences coming from end user real-world texts. A number of features are designed for generating correct annotations from different grammatical patterns, including sentences which contain subordinate phrases. Far from being exhaustive, we aim at progressively increasing the complexity of sentences for which the text annotator can extract accurate annotations.
A precise evaluation will follow together with continuing the task of handling complex sentences. In particular, future test phase, with real-world end-user texts will show that capabilities of text annotation are sensitive to the text genre (i.e., the topic of the document). Actually, depending on the kind of document to annotate, one or several information elements are important. For example: assembly and instructions of product usage: verbs matter first, e.g., screw the bolt, open passenger car window, start engine, etc. For standard and norm documents, named entities are important. For test reports, contracts, commercial offers: quantities and figures matter. The first release of the text annotator will be tested during next year by end-users on real-world documents. Finally, we will explore the possibility to annotate Spanish/Italian texts. The process remains identical. However, we have to find out efficient NLP parsers for these two languages.
Last, we proposed a solution based on statistical calculations for automatic generation of semantic annotations from the texts associated to an image.
Extraction and Exploitation of Contextual, Evolving Semantic Annotations for a Virtual Community
This work is carried out within the framework of Noureddine Mokhtari's PhD.
The aim of this work is to propose a system of automatic extraction and exploitation of contextual semantic annotations from the text based on semantic web technologies.
Initially, we started by a state of the art on knowledge engineering, semantic Web (RDF/S, ontology) and various uses of context and their definitions. We considered the context of annotation as a semantic and a structural relationship (spatial, temporal and various) between annotations. Then, we concentrated on the approaches of knowledge extraction from texts and we compared some statistical and linguistic tools of natural language processing (NLP) in the aim to propose a method for extraction of semantic annotation (concepts and relationships of a reference domain ontology) from texts. Then, we proposed and we implemented automatic extraction algorithms of text structure to identify contextual structural relationship (successor of, proximity, belonging) between annotations. In addition, we propose to exploit contextual semantic relationships: for example, spatial relations (e.g. under, beside), temporal relations (e.g. since, during) and various complex semantic relations such as rhetorical relationship (e.g. moreover, in fact, so that, however, except).
We made a first experimentation of our approach on concept extraction from text and we obtained good scores (89,82% of precision). In addition, we obtained 75% and 69,71% of precision for respectively identification of contextual semantic relationships and their arguments. The precision corresponds to the rate of correctly extracted entities among the extracted entities.
The originality of this work consists of (a) the integration of the semantic annotations context which gives new ways of reasoning and more information based on both structure and semantic of text, (b) the use of several technologies such as NLP (GATE, JAPE, TreeTagger), Semantic annotations, Knowledge representation and semantic Web (RDF/S, OWL, Ontology, SPARQL, XQuery, Corese) to build a system of automatic extraction and exploitation of contextual annotation from texts.
We proposed a methodology for semi-automatic creation of an ontology along with the subsequent annotation base extracted from a mailing list belonging to a community dedicated to computer assistance. This study raises many original issues which are unusual for NLP techniques because it starts from an email-list corpus. One of the aspects is how to deal with texts which are intended to be informal and generally grammatically erroneous. The challenging annotation extraction process from this email-list will feed frequently asked questions (FAQ) with typical questions and answers according to the domain knowledge.
In order to build the ontology specific to a community of practice exchanging mails about the problems encountered with ICT tools, we applied term extraction tools (ACABIT, FASTR and LIKES) on the corpus of their emails. Then, we designed a modular ontology containing four modules: the OeMail ontology (that describes a MIME message), the Problem Ontology, the Component Ontology designed from the WeboPedia online hierarchy (OntoPedia) and the O'CoP ontology. The Problem ontology was bootstrapped from terms extracted from the mails, using linguistic tools, such terms revealing component problems.
We also studied automatic attachment of new problem terms to the top level of Problem Ontology (Hardware Problem, Software Problem, etc.).
The messages received in the corpus of emails were then automatically annotated with the ontology. To extend the work for future messages of the community members, we developed an application which monitors an email server, detects new messages, and automatically generates their corresponding annotations.
Last, we developed a Web-based GUI allowing the members of the community to navigate through Problem and Component ontology in a hyperbolic manner and to query the semantic search engine Corese to get the mails annotated by the chosen concepts and their replies [Oops!] , [Oops!] .
Text and Data Mining for Knowledge Management
Participant : Martine Collard.
This work is carried out in the framework of Martine Collard's visit at INRIA. It will be based on her previous work in Data Mining, work devoted to model discovery, model evaluation and interpretation. Traditional data mining algorithms are applied on numeric or categorical structured data. But current evolutions tend to extend researches and applications to semi-structured and non-structured data like textual sources and other non-organized sources of explicit or implicit knowledge.
For instance, in biology, numerical expressions of genes in a DNA micro-array experiment cannot be interpreted without paying attention to some heterogeneous sources of information from the biological domain such as ontologies, scientific publications or research results of similar experiments. Among research interests of Edelweiss team, two have a closed relationship with Martine Collard's work:
on one hand, the objective to provide solutions for knowledge representation, and particularly for ontology management,
on another hand, the interest in text mining: the participation of Edelweiss in European projects like Sealife or Sevenpro or in the Immunosearch project is partly related to this topics.
As a consequence, Martine Collard's work in Edelweiss aim at studying the integration of data mining, text mining and knowledge management techniques. They are organized according the following objectives:
extending current Edelweiss solutions on text mining (mainly based on linguistic tools) with a data mining point of view,
using Edelweiss knowledge management solutions for extending data mining methodologies defined in her team at I3S laboratory.
These objectives are being pursued through collaboration in Sealife and Immunosearch projects.
Semantic Annotation of Usages and Persons
This work is carried out in the framework of the PhD of Freddy Limpens. It is devoted to study how annotating the varied resources of the Web can help its users better interact. The use of metadata has grown among Web applications, be it in a free manner with tags, or grounded on ontologies. Moreover, the formal status of the Web switched from a global library to a virtual platform of interaction and exchange of different kinds of services and resources. In this context, it is possible to acquire valuable information from the usages, and to represent them through semantically enriched metadata. These metadata can then in turn be integrated and exploited by semantically enabled applications to enhance the global experience of the Web users.
To achieve this objective, we need to investigate the possibilities to obtain semantically rich metadata in an unobtrusive way, and then to find how they can be efficiently exchanged or retrieved by users or other Web applications. A growing number of Web services, labelled Web 2.0, propose their user to attribute a list of arbitrary keywords or tags in order to annotate their resources, thus creating folksonomies. This approach has the great benefit of requiring little efforts, and to rapidly and enthusiastically inject human's intelligence within the overwhelming flow of data constantly aired on the Web  . However, it revealed somewhat tough for computers to reason and make valid inferences on these extremely versatile, and most of the time ambiguous, structures of knowledge.
To tackle this problem a number of researchers tried to detect stable semantic structures within folksonomies  , or to use ontologies to semantically constrain tagging  . Gruber  also launched a call to action and proposed to build collaboratively an ontology about the act of tagging, to account for all its varied aspects. Another approach is to adapt the technological structures that sustain the activity and exchange of data within communities with original methods to produce semantically rich metadata at the earliest stages of the process of data creation. In this prospect, the preliminary task is to model the interactions between users or group of users and the resources they manipulate. To date, a number of different research works are relevant to this aspect. Breslin et al.  proposed to model how data are exchanged among online communities. Thus, the Semantically Interlinked Online Communities(SIOC) framework provides an ontology that describes concepts and relations about online communities (http://sioc-project.org ).
Other approaches focused on the communities of practice, a term coined by Wenger  naming groups of persons who share a common interest in a specific matter, and at the same time are aware of this common tie and actively collaborate. Research has been lead to establish appropriate typologies of this type of communities  and to provide them with suitable ontologies in order to sustain their activity and annotate their resources  . To complement this state of the art, we need to further investigate the cognitive and social phenomenons bound to the interactions with resources on the Web in general. Some contributions heading this direction can be noticed, such as Sinha's work on the process of tagging  , but some more work needs to be done on evaluating how satisfying and helpful are collaborative tagging tools. For this purpose, a survey could be conducted among a targeted community in order to get feedbacks about the use of such systems.