Team Orpailleur

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: New Results

The Mining of Complex Data

Participants : Zainab Assaghir, Rokia Bendaoud, Nicolas Jay, Mehdi Kaytoue, Florence Le Ber, Amedeo Napoli, Frédéric Pennerath, Yannick Toussaint.

Formal concept analysis, itemset search, and association rule extraction, are suitable symbolic methods for KDDK, that may be used for real-sized applications. Global improvements may be carried on the ease of use, on the efficiency of the methods, and on the ability to fit evolving situations. Accordingly, the team is working on extensions of these symbolic methods to be applied on complex data such as objects with multi-valued attributes, n-ary relations, graphs, texts, etc.

FCA, RCA, and Pattern Structures

Recent advances in data and knowledge engineering have emphasized the need for Formal Concept Analysis (FCA) tools taking into account structured data. There are a few extensions of FCA for handling contexts involving complex data formats, e.g. graph-based or relational data. Among them, Relational Concept Analysis (RCA) is a process for analyzing objects described both by binary and relational attributes [93] . The RCA process takes as input a collection of contexts and of inter-context relations, and yields a set of lattices, one per context, whose concepts are linked by relations. RCA has an important role in KDDK, especially in text mining [71] .

Another extension of FCA is based on so-called Pattern Structures (PS) [79] , which allows to build a concept lattice from complex data, e.g. nominal, numerical, and interval data. In (major [6] ), pattern structures are used for building a concept lattice from intervals, in full compliance with FCA (thus benefiting of the efficiency of FCA algorithms). Actually, the notion of similarity between objects is closely related to these extensions of FCA: two objects are similar as soon as they share the same attributes (binary case) or attributes with similar values or the same description (at least in part). A research work is currently under development on the relations existing between classification methods based on FCA with explicit similarity measure (Formal Concept Analysis driven by Similarity or FCAS [14] , [33] , [60] and Pattern Structure classification. The parallel study of SCA and PS helps to understand how these two methods are interrelated and how they can be applied to complex data for building concept lattices.

PC classification and FCAS have been applied in the field of decision support in agronomy. There exists a set of agro-ecological indicators aimed at helping farmers to improve their agricultural practices. Actually, an indicator estimates the impact of cultivation practices on the “agrosystem” [81] . The modeling and the assessment of environmental risk generally require a large number of parameters whose measure is imprecise. Thus, it is important to study how imprecision is propagated in the various steps of decision support, and, as well, which are the different types of imprecision that are combined in the computation of the value of an indicator [58] . This is not really straightforward, but the computation of an indicator value, decision support based on indicators (that cab be seen as a special decision tree), and pattern structure classification, are all linked and can be studied in the same classification framework.

Still in the context of agronomy, a series of research work is in concern with the design of representation and reasoning models of spatial structures in knowledge-based systems, and in parallel, with the design of concept lattices for mining and understanding complex hydrobiological data, requiring specific algorithms [48] , [37] , [49] , [38] . These studies are of general interest as they try to push forward the computational capabilities of standard FCA algorithms by considering complex data with multiple nested modalities.

For completing the work on FCA, there is still on-going work on frequent itemset search for improving standard algorithms, but also for being able to build lattices from very large data. In this case, closed itemsets are searched for first, then generators, i.e. minimal itemsets in their equivalent classes, and finally the association of each equivalent classes between each other, giving in fact the concept ordering in the underlying concept lattice (major [10] ).

KDDK in Medico-Economical Databases

Since 30 years, many patient classification systems (PCS) have been developed. Theses systems aim at classifying care episodes into groups according to different patient characteristics. In France, the so-called “Programme de Médicalisation des Systèmes d'Information” (PMSI) is a national wide PCS in use in every hospital. It systematically collects data about millions of hospitalizations. Though it is essentially used for funding purposes, it holds potentially very useful knowledge for other public health domains such as epidemiology or health care planning. Our main objective is to extract knowledge units from this database in order to explore “Patient Care Trajectories”. Our approach aims at assisting domain experts with automated classification tools to define or to detect particular groups of patients having similar health condition, treatments or journeys through the healthcare system. To achieve these tasks, we propose a methodology based on Formal Concept Analysis (FCA). From a theoretical point of view, our research focuses on the ability of FCA to deal with large amounts of data. We especially study means of reducing complexity of large concept lattices for easy visualization and selection of the most interesting results. Our methods have been applied for data quality assessment of the PMSI in epidemiology [53] and diagnostic strategies comparison [27] .

Another way of research consists in data driven ontology building. The idea is to reuse knowledge discovered during the FCA step for providing an ontology of PCT that will perform reasoning tasks on patient profiles. Such an ontology could, for example, help to qualify a chronic disease made of a succession of pathological states.

KDDK in Chemical Reaction databases

The mining of chemical chemical reaction databases is an important task for at least two reasons: (i) the challenge represented by this task regarding KDDK, (ii) the industrial needs that can be met whenever substantial results are obtained. Chemical reactions are complex data, that may be modeled as undirected labeled graphs. They are the main elements on which synthesis in organic chemistry relies, knowing that synthesis —and thus chemical reaction databases— is of first importance in chemistry, but also in biology, drug design, and pharmacology. From a problem-solving point of view, synthesis in organic chemistry must be considered at two main levels of abstraction: a strategic level where general synthesis methods are involved –a kind of meta-knowledge– and a tactic level where specific chemical reactions are applied. An objective for improving computer-based synthesis in organic chemistry is to discover general synthesis methods from currently available chemical reaction databases for designing generic and reusable synthesis plans.

A preliminary research work has been carried on in the Orpailleur team, based on frequent levelwise itemset search and association rule extraction, and applied to standard chemical reaction databases. Given the results of this work, a subsequent research has been carried out involving this time a graph-mining process used for extracting knowledge from chemical reaction databases, directly from the molecular structures and the reactions themselves.

This research work is currently under development, in collaboration with chemists and in accordance with needs of chemical industry. This year, once more, a number of substantial results have been obtained and presented in some high-level conferences (major [9] ) [30] , [15] .


Logo Inria