Section: New Results
Keywords : Corporate Memory, Corporate Semantic Web, Knowledge Management, Knowledge Engineering, Ontology, Assistance to the User, Semantic Web, Semantic Annotation, Knowledge Acquisition from Texts, Natural Language Processing, Text-mining, Co-operation, Information Extraction, Evolution.
Annotation of Information Resources
The objective of this research direction is to propose (1) a methodological guide for collaborative, semantic-Web-based, annotation process in a community of practice; (2) an ontology-based, service-oriented annotation toolkit offering both a service of semi-automatic annotation from textual documents and a service of collaborative annotation and management of evolution of the annotations. The methodological guide and the toolkit will tackle complex contextualization of annotations, various kinds of Web-accessible external resources, reflective annotations and more complex types of heterogeneous resources and services.
Extraction and Exploitation of Contextual, Evolving Semantic Annotations for a Virtual Community
For a good exploitation of the semantics which texts want to transmit, techniques are proposed to extract semantic annotations. These annotations often represent a set of terms, that identify concepts, connected by relations. The aim of this work is to propose a system for automatic extraction and exploitation of contextual semantic annotations from the text based on Semantic Web principles. This work is carried out within the framework of Noureddine Mokhtari's PhD.
In this work, the proposed approach takes as input the texts and a domain ontology, and gives as output an XML document representing its structure as well as the semantic objects represented in the RDF formalism and linked by contextual relations identified by discourse markers. The proposed approach of extracting contextual annotations is summarised as follows: i) identify structure (titles, paragraphs, etc.) and semantics (classes, properties, and candidates values of properties); ii) identify discourse markers and their arguments; iii) reconstruct the structure of the document (titles, paragraphs, sentences, arguments); iv) deduce the contextual scope from text structure; v) generate "semantics objects" represented by a set of RDF triples by using a specific algorithm. We have implemented the algorithm and tested them on the SevenPro framework. This experimentation provides good scores (87,02% of precision, 84,44% of recall).
The originality of this work consists of two features: 1) the integration of semantic annotation context which gives new ways of reasoning and more information based on both structure and semantic of text  ; 2) the use of several technologies such as NLP (GATE, JAPE, TreeTagger), semantic annotations, knowledge representation and Semantic Web (RDF/S, OWL, Ontologies, SPARQL, XQuery, Corese) to build a system of automatic extraction and exploitation of contextual annotations from texts.
Semantic Virtual Environment for Engineering Product Design
The SevenPro European project lasted from January 2006 to October 2008. Based-on research activities done within this project, we presented a study on using semantic annotations extracted from texts in the so-called context definition in order to limit the scope/validity of the semantic annotation to its genuine text part origin  . These RDF and SPARQL extensions operate on the RDF annotations with named graphs. We also proposed new semantic metrics for approximate search in RDF stores w.r.t. domain ontologies  . To finish, we studied SPARQL query performance in a distributed system which consists of several RDF stores  .
Based on existing tools from NLP techniques (GATE  , RASP  , etc.), semantic search engine (Corese), Java APIs, databases and 3D design software (Catia, SolidWorks), we designed processes that can extract useful information from various sources available during a design engineering project. The information sources targeted in SevenPro are: textual documentation, CAD files, and ERP repositories. We are work-package leader and responsible for the task entitled extracting annotations from texts described hereafter.
The semantic annotations of texts require extraction of semantic relations between domain relevant terms in texts. Several studies address the problem of capturing complex relations from texts - more complex relations than subsumption relations between terms identified as domain concepts. The studies combine statistical and linguistic analysis. Basically, these approaches consist of the detection of new relations between domain terms; whereas in the semantic annotation generation, we aim at identifying existing relations, belonging to the domain ontology, with instances in texts. We also aim at completing the annotations with the description of the domain concepts related by these identified relations. In SevenPro, the text annotator is able to:
Extract the plain text from various document formats: MS-Office (Word, Excel, Powerpoint presentations, etc.), PDF, and Open-Office documents, relying on the existing POI (Java API to Access proprietary document Format Files. The Apache foundation: http://poi.apache.org )and Java Tika (Content Analysis Toolkit The Apache incubator working group: http://incubator.apache.org/tika )libraries;
Analyse text sentences using NLP techniques: split the text into sentences, then into words (tokens), then assign a grammatical category to each token;
Identify the grammatical constituents (subject, verb, and object) of the sentence by using the RASP parser;
Map the constituents identified by the NLP tools: we map the identified constituents to the formalised RDFS concepts/properties of the domain ontologies;
And finally, generate the correct RDF triple annotations by identifying the instances of RDF triples (modifiers of the subject/object in the sentence).
The text annotator may also suggest new properties (and their annotations) which are not present in the knowledge resources (ontology and/or grammar relations)  .
Virtual Reality Reasoning
The Virtual Reality Reasoning Module (VR-ReasoM) is the bridge between the virtual reality module (VRM) and the SevenPro knowledge base. The semantic reasoning is based on information retrieval techniques and semantic rule applications. The VR reasoning module is used to display knowledge and to control object behaviour in the virtual reality scenario. In the presentation mode of the VRM, we display knowledge about the current scene status, i.e. labels of VR objects. In the guide mode, we control the user actions and give help to reach the next step, for example, in assembly procedure by means of reasoning. From the VR point of view, the reasoning is a background process which has a noticeable impact on the VR scene.
We implemented a bidirectional version of a Java API between the VRM and Corese semantic engine. We tested the semantic rule application where conclusion of rules can trigger VRM method (e.g., for VR object selections). Each step of the VR scenario consists of a co-action of two VR objects. The co-actions are controlled by rules which apply different tests on both engineering items (for example size fit, material compatibility, subsumption, etc.). When the test (i.e., semantic rule condition) succeeds, it triggers a VR event. We have detailed the reasoning task in a deliverable  .
Moreover, we had research activities on the knowledge visualisation issue. When knowledge pieces, dedicated to inform the user, are integrated and rendered in a VR scene; it can be complex to control how to display these information (e.g. semantic annotation of a given VR object). To simplify this task, we use the Fresnel  language: an RDF vocabulary that was defined to specify the visualization of RDF graph pieces. This vocabulary enables us to define viewpoints on RDF/S data called lenses. Lenses define which semantic data should be shown for a given type of resource and how it should be displayed. The user can change the viewpoint to visualize knowledge according to his/her needs. To extend this mechanism, we defined user profiles (engineer, marketing agent, etc) as an aggregation of Fresnel lenses.
Engineering Memory Tool
Based on an ontological distance, Corese supports an approximate search process. It distinguishes between exact answers for which there exists a projection of the query upon their annotations and approximate answers for which there exists an approximate projection of the query upon their annotations. For example, an engineer can specify the type of mill and lifter he is working on, and the system comes up with related cases sorted by closeness. Hence, our objective is that these suggestions should be relevant enough to foster efficient reuse of engineering knowledge from past cases i.e. enable users considering a current case to use these suggested previous cases as a starting point to the new design they are doing.
There are two main families of distances: the ones using information external to the model (e.g. statistics on a corpus) or the ones relying solely on the structures of the models (e.g. a hierarchy of types). We studied how these two approaches can be applied and extended in an RDF store.
The first extension consists in considering the 'proximity of usage' of two types i.e. the frequency with which these two types are used together in descriptions. It is called distance in extension (or co-occurrence distance). Most of the distances, relying on an ontology, limit their use of the metric space to the hierarchy of classes i.e. only the graph of direct subsumption links is used in defining the metric space.
The second extension goes beyond by considering property signature and class hierarchies for a new metric space. The Corese engine extends the SPARQL language in order to offer the possibility for computing paths in RDF/S graphs. This extension also allows us to specify constraints on the types of the properties that can be used in a path and we apply it to extract paths using a subtype-aware-signature regular path expression. Finally, we have detailed this task in a deliverable  .
Semantic Grid Browser for the Life Sciences Applied to the Study of Infectious Diseases
This work is done in the context of the SeaLife targeted research european project. The objective of SeaLife is the design and development of a semantic Grid browser for the Life Sciences, which will link the existing Web to the currently emerging eScience infrastructure. The SeaLife browser will allow users to automatically link a host of Web servers and Web/Grid services to the Web content they are visiting. This will be accomplished using eScience's growing number of Web/Grid Services, XML-based standards and ontologies. The browser will identify terms in the pages being browsed through the background knowledge held in ontologies. Through the use of semantic hyperlinks, which link identified ontology terms to servers and services, the SeaLife browser will offer a new dimension of context-based information integration.
This SeaLife browser will be demonstrated within three application scenarios in evidence-based medicine, literature and patent mining, and molecular biology, all relating to the study of infectious diseases. The three applications vertically integrate the molecule/cell, the tissue/organ and the patient/population levels by covering the analysis of high-throughput screening data for endocytosis (the molecular entry pathway into the cell), the expression of proteins in the spatial context of tissue and organs, and a high-level library on infectious diseases designed for clinicians and their patients.
In this project, we take part in 6 among the 7 work packages and we are coordinator of the Textmining and natural language processing work package. Our main contributions for this year are:
Word sense disambiguation
In the aim to improve our term detection method  , we proposed a technique to solve the ambiguity problem confronting MeatAnnot results. The main idea of this method is to use the ambiguous word context to decide to which semantic type we can affect it. This context consists of the set of terms which occur with the ambiguous word in the same sentence or in the same paragraph. So, if MeatAnnot affects several semantic types to the same candidate term, the disambiguation module tries to find the right semantic type. This module computes similarities between the semantic types affected to the ambiguous word and other semantic types affected to the neighbours of the ambiguous word in the text. The semantic type which has the highest similarity is then selected. The calculation of similarity between semantic types is based on Corese semantic distance. The algorithm was tested on a standard collection for evaluating disambiguation methods and had good results  ,  ,  .
In the SeaLife use case called 'literature and patent mining', we proposed three approaches for semantic patent clustering for biomedical communities  :
The standard approach: in this case, we kept the simple TF-IDF (Term frequency-inverse document frequency.)vector for each patent claims section. Clusters were computed on the base of the standard similarity function cosine measuring the deviation of angles between the patents vectors. A cosine value of zero means that the two patents are orthogonal.
The hierarchical weight propagation approach: in this case, we introduced semantic concept relationship on weights. We assume for instance that: if a patent concerns microbiology with a weight 'n' it also concerns biology with a weight 'm' which is lower than 'n'. Therefore, we incremented the weight of concepts-ancestors for each concept detected in the claims text. We divided the weight by 2 when passing from the concerned concept to its parent concepts. Then, we spread a decreasing weight through the ontology hierarchy.
The semantic distance approach: in this case, we introduced a semantic similarity function between patents without modifying basic TF-IDF weights. The idea is to use the conceptual distance (defined in  ) between concepts annotating patents. This distance relies on the subsumption path in the UMLS metathesaurus. The semantic function is defined to reinforce the similarity between patent claim documents which use close concepts.
In this work, we assumed that semantics can improve text clustering which is confirmed by the obtained results. The approaches rely on standard Semantic Web technologies (RDF, SPARQL, etc.). As further improvements, we are working on the development of a semantic clustering toolbox allowing the interpretation of the obtained results and the combination of several semantic approaches.
In the SeaLife use case called 'evidence-based medicine', we proposed the Corese-NeLI Semantic Web browser  ,  ,  dedicated to navigating resources in the infectious disease domain. This browser supports the navigation of a portal by the use of a structured vocabulary or a domain ontology. It supports two main functionalities:
Semantic search of a Web portal relying on semantic annotations that are generated from Web pages using a provided knowledge artefact. The search is based on the generated annotations. During the search process, Corese uses the taxonomical relationships in the SKOS (W3C Simple Knowledge Organization System.)thesaurus (i.e., narrower, broader, etc.) to retrieve annotated pages which are related to the user's query.
Semantic browsing of a Web portal: the Corese-based engine offers the possibility to identify and highlight terms retrieved from a structured vocabulary on a visited Web page. From the highlighted terms, it can then create dynamic links to related pages within the portal, thereby enabling the semantic browsing. Moreover, a query can be built from the highlighted terms in order to query external resources such as Google and PubMed.
We participated in several deliverables and we are the editors of the I2D3 deliverable  : This deliverable describes the evaluation of techniques developed in the SeaLife project : (i) information extraction from texts and word sense disambiguation, (ii) information extraction from patents, and (iii) information extraction from navigation logs.
Ontology and Annotations for a Discussion Forum of a Community of Practice
In order to facilitate navigation among past e-mails and to find solutions to problems previously discussed, we propose an approach for automatic creation of semantic annotations on such e-mails, annotations based on an ontology partly created from linguistic analysis of this corpus of e-mails. SemanticFAQ portal relies on such generated annotations and on a semantic search engine for offering ontology-guided navigation through the e-mails. The @pretic ontology consists of the following sub-ontologies:
OntoPedia: all possible computer components on which problems may occur are not necessarily mentioned in the e-mails; so relying on a linguistic analysis of the e-mail corpus would have lead to an incomplete ontology. Therefore, we preferred to reuse an existing term hierarchy (WeboPedia). We developed a program that, from the term hierarchy of this online encyclopaedia, generates automatically an ontology represented in RDFS.
Oemail: it describes metadata on e-mails by defining generic concepts (e.g. E-mailMessage), more specific concepts (e.g. ReplyMessage) and semantic relationships (e.g. author, date, recipient, etc.).
O'CoP: this ontology detailed in  comprises concepts enabling to describe a CoP, its actors, their roles and competences, the resources they use, etc. We used O'CoP ontology to describe @pretic CoP members.
Computer-Problem ontology: it is the main module of @pretic ontology and it aims to provide concepts and properties enabling to describe the computer problems faced by CoP members. To initiate and enrich this ontology, we applied NLP techniques on the corpus of e-mails.
The ontology and the annotations thus obtained are then used in a semantic portal that facilitates ontology-guided and personalized navigation of the CoP members. This work was published in  ,  ,  .
Semi-Automatic Identification of n-ary Relations in Textual Corpus
The objective of this work is to propose a methodology for the identification and the extraction of n-ary relations within a text. The use cases are the ones described by the W3C that describe best practices for the RDF representation of n-ary relations aiming at solving the identification and extraction issues. We proposed a method based on linguistic approaches. The main idea is that each use case which determines the type of an n-ary relation is characterized by a set of grammatical relation patterns identified from the result of sentence syntactic analyses. Basically, the main steps of our approach are:
The identification of an n-ary relation category by setting-up the set of grammatical relations which characterize each use case. The use cases are four, with many sub-cases, the categories identified apply to simple and complex sentences.
The extraction of the relation arguments. By stating that each sentence can be considered as a direct-labeled graph, the arguments of the n-ary relation are extracted by setting-up a graph corresponding to the sentence. We apply a traversal search algorithm in order to explore the generated graphs.
The process takes as input a text and provides as output an XML file which describes n-ary relations found in this text. Our system can detect and extract most of n-ary relations present in simple and complex sentences.
Semantic Web for Biomarker Experiments
This work is done in the context of the BioMarker project whose objective is to design biomarkers for controlling the harmlessness of molecules used in perfumes, aromatics and cosmetics. The purpose of this research is to conduct comparative studies of in vivo and in vitro test models on the skin (irritation, allergy) and to propose alternative methods defining new norms applicable in this field.
Our role, in this project, is to provide biologists with methodological tools allowing them (i) to explore the huge amount of heterogeneous data such as data description vocabularies (e.g. Gene Ontology (http://www.geneontology.org/ )), scientific literature, gene expression data analysis stored in public databases (e.g. GEO (http://www.ncbi.nlm.nih.gov/geo/. )), or biologists background knowledge, (ii) to make meta-analysis on multiple independent microarray data sets, in order to identify gene profiles for specific biological process.
We propose an approach based on Semantic Web techniques in order to describe and semantically query the huge set of heterogeneous information sources related to gene expression data resulting from micro-arrays experiments. Our main contributions for this year are:
GEOnto: this ontology describes gene expression data experiments and its conditions. Some concepts in GEOnto cover general biology fields (in vivo, inductor, subject, sample...) and others are specific to a particular field. In a first step, we limit it to dermatology (skin, eczema, contact dermatitis...) but GEOnto can be extended towards other biologic fields. To build GEOnto, we rely on (i) a corpora of experiment descriptions used to pick out candidate terms, (ii) biologists who help us to structure the concepts and validate the proposed ontology and (iii) existing ontologies (UMLS (http://www.nlm.nih.gov/research/umls/ )and OntoDerm (http://www.gulfdoctor.net/ontoderm/ )) to extract specific concepts.
GMineOnto: this ontology provides concepts for the description of statistical analysis and more complex mining processes on expression data (e.g. clustering method, cluster).
(Semi-)automatic Semantic Annotation Generation
GEAnnot: considering the public experiments selected from the public repository GEO, we annotated the MINiML formatted family file which is an XML document relying on the MIAME formalism (http://www.mged.org/Workgroups/MIAME ). The annotation process is semi-automatic. Instances of GEOnto concepts are detected in the document, some of them are directly used to generate the annotation describing the experiment (exp. contributors, pubmedID, keywords, condition titles), and others are proposed to the biologist who selects the more relevant instance for each condition (exp. time point, treatment, subject). An interactive interface is proposed to annotate experiments.
MeatAnnot-V2: this tool uses a declarative method to generate, starting from a scientific paper, a structured annotation based on MeatOnto  that describes interactions between genes/proteins and other concepts. Each sentence of the text is described with (i) an XML document which is an abstract syntax parse tree coming from a transformation of the RASP NLP tool  result and (ii) the instances of MeatOnto relationship and concepts detected in the sentence using MeatAnnot  . We designed SPARQL extensions that include XPath to detect which instances of MeatOnto concepts are linked by a relationship according to the ontology and to generate an annotation describing this interaction.
We proposed an ”intelligent” information retrieval that uses not only semantic annotations but also data stored in XML documents and/or in classic databases. Indeed, some information, such as information about gene behaviour (expressed, inhibited or stable), are stored in a classic database referenced in the semantic annotations of the experiment. When a user needs to find ”experiments that use an inductor x and where the gene g is expressed”, it seems to be useful to have one query that finds the relevant data, combining information stored in the annotations and in the database. The idea here is to query the database using SQL embedded in SPARQL through the Corese semantic search engine  . In the same way, we used Corese to query XML documents (containing information about clusters of genes and referenced in the semantic annotations of the experiment) using XPath embedded in SPARQL. This work was published at I-Semantics 2008  .
Ontology for Open Source Development Communities
Participant : Isabelle Mirbel.
Due to the rise of the Web, several online professional communities dealing with software development have emerged. They are communities of people who organize themselves and interact primarily through the Web for work and knowledge sharing. Open source software development may be seen has a particular case of distributed software development having a volatile project structure, without clearly-defined organization and assigned tasks for all of its members. It requires a long term commitment as well as a common vision of the participants and raises new challenges in terms of knowledge management. Indeed, these communities generate huge amount of information as result of their interactions. This information is mostly structured in order to be quickly reused (mailing lists, forums, etc.). There are few means (like FAQ for instance) to capitalize the information over a longer period of time and to turn it into knowledge (through a semantic FAQ for instance).
In this context, our current consideration focuses on means to improve knowledge spreading and sharing in these kind of communities. We choose a Semantic Web approach relying on an ontology allowing the annotation of the community resources in order to enhance the exploitation of these resources through dedicated knowledge management services.
While building this ontology, our aim was twofold. On one hand we modeled the pertinent concepts for open source development community resource annotation from a community of practice point of view and we therefore started from the O'CoP generic ontology provided in the framework of the Palette European project. On the other hand, we reused the ontologies about FLOSS (Free/Libre Open Source Software) provided in the literature (Dhruv  ,  and OSDO  ) as well as the ontology provided in the framework of the SIOC, Semantically-Interlinked Online Communities (http://sioc-project.org ), project. The proposed ontology has been formalized in RDFS/OWL. It has been published in  and core concepts are available online (http://ns.inria.fr/oflossc ).
We now plan to focus our efforts on knowledge management services. We will focus more particularly on complex and context-dependent search procedures. Such procedures may be seen as sequences of several steps (or sub-goals) dealing with elementary information searches. Howtos are examples of resources highlighting this kind of complex processes to be followed to perform a task. As it has been highlighted in the literature, dedicated strategies are built by domain experts to search for information and it may be difficult for novice users to acquire such search procedures. Moreover, these procedures become critical because of the current multiplication of knowledge bases, growing specialization of information sources and therefore spreading of information.
In this context, the aim of our work will be to provide means, based on models and techniques of the semantic Web, to specify search queries and complex search procedures in order to facilitate their reuse, share and spreading inside a virtual community of practice.