Team Orpailleur

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: Software

KDD Systems

The Coron Platform

Keywords : data mining, frequent itemsets, frequent closed itemsets, frequent generators, association rule extraction, rare itemsets.

Participants : Mehdi Kaytoue [ contact person ] , Florent Marcuola, Amedeo Napoli, Yannick Toussaint.

The Coron platform [95] is a KDD toolkit organized around three main components: (i) Coron-base, (ii) AssRuleX, and (iii) pre- and post-processing modules. The software has been registered at the “Agence pour la Protection des Programmes” (APP) and is freely available( ).

The Coron-base component includes a complete collection of data mining algorithms for extracting extract different kinds of itemsets, e.g. frequent itemsets, frequent closed itemsets, frequent generators, etc. The algorithms are APriori, APriori-Close, Close, Pascal, Eclat, Charm, and, as well, original algorithms such as Pascal+, ZART, Carpathia, Eclat-Z, and Charm-MFI. AssRuleX (Association Rule eXtractor) generates different sets of association rules (from itemsets), e.g. minimal non-redundant association rules, generic basis, informative basis, etc. The Coron-base component contains also algorithms for extracting rare itemsets and rare association rules, e.g. APriori-rare, MRG-EXP, ARIMA, and BTB.

The Coron system supports the whole life-cycle of a data mining task and proposes modules for cleaning the input dataset, and for reducing its size if necessary. The module RuleMiner facilitates the interpretation and the filtering of the extracted rules. The association rules can be filtered by (i) attribute, (ii) support, and/or (iii) confidence. It is also possible to color the most important attributes in the list of rules, for finding the most interesting rules from a given viewpoint.

The Coron toolkit is developed entirely in Java, is operational, and has already been used within several research projects, e.g. for mining the Stanislas cohort, or in the CabamakA project (which is part of the Kasimir system, see § 4.2 ). An extension of the system, named BioCoron, is aimed at taking into account gene expression [82] .

The CarottAge system

Keywords : Hidden Markov Models, stochastic process.

Participants : Florence Le Ber, Jean-François Mari [ contact person ] .

CarottAge ( ) is a data mining system, freely available (GPL license) and based on Hidden Markov Models of second order. It provides provides a synthetic representation of temporal and spatial data.

In applications, the systems aims at building a partition –called the hidden partition– in which the inherent noise of the data is withdrawn as much as possible. The CarottAge system takes into account: (i) the various shapes of the territories that are not represented by square matrices of pixels, (ii) the use of pixels of different size with composite attributes representing the agricultural pieces and their attributes, (iii) the irregular neighborhood relation between those pixels, (iv) the use of shape files to facilitate the interaction with GIS (geographical information system).

CarottAge is currently used by INRA researchers interested in mining the changes in territories related to the loss of biodiversity (projects ANR BiodivAgrim and ACI Ecoger) and/or water contamination.

GenExp-LandSiTes: KDD and simulation

Keywords : Simulation, Hidden Markov Models.

Participants : Florence Le Ber [ contact person ] , Jean-François Mari.

In the framework of the project “Impact des OGM” initiated by the French ministry of research, we have developed a software called GenExp-LandSiTes for simulating bidimensional random landscapes, and then studying the dissemination of vegetable transgenes. The GenExp-LandSiTes system is linked to the CarottAge system, and is based on computational geometry and spatial statistics. The simulated landscapes are given as input for programs such as Mapod-Maïs or GeneSys-Colza for studying the transgene diffusion [57] (major [7] ). The last version of GenExp allows an interaction with R subroutines and has received a gpl License.

This work is now part of an INRA-INRIA project about landscape modeling, PAYOTE (2009-10), that gathers eleven research teams of agronomists, ecologists, statisticians, and computer scientists.

KDD systems in Biology

Participants : Marie-Dominique Devignes [ contact person ] , Nizar Messai, Malika Smaïl-Tabbone.

Participants : Marie-Dominique Devignes [ contact person ] , Birama Ndiaye, Malika Smaïl-Tabbone.

Participants : Marie-Dominique Devignes [ contact person ] , Bernard Maigret, Malika Smaïl-Tabbone.

Automatic extraction of metadata for biological database retrieval and discovery (BioRegistry).

There are a growing number of biological databases which deal with the huge amount of data produced by genomic and post-genomic research. The need for a well-maintained searchable directory is therefore an important issue to make full use of these databases. The BioRegistry repository aims at associating content metadata with biological databases in view of retrieval or discovery. It is automatically generated from a publicly available list of biological databases (The Molecular Biology Database Collection published in Nucleic Acids Research). The content metadata are terms belonging to a biomedical thesaurus. Querying modalities have been implemented including a search by semantic similarity. A classification method based on extended formal concept analysis allows a user to browse and discover databases through the BioRegistry. A publication on this work has been accepted in the International Journal of Metadata, Semantics and Ontology. The BioRegistry repository is available at .

MOdel-driven Data Integration for Mining (MODIM).

A position of “Ingénieur Jeune Diplomé INRIA” has been granted to the Orpailleur team to develop the MODIM software (MOdel-driven Data Integration for Mining). This software for data integration can be summarized along three steps: (i) building a data model taking into account mining requirements and existing resources; (ii) specifying a workflow for collecting data, leading to the specification of wrappers for populating a target database; (iii) defining views on the data model for identified mining scenarios. MODIM was inspired by a previous work on an Approach for Candidate Gene Retrieval (ACGR) (major [11] ).

Graphical interface for the Virtual Screening platform (Virtual Screening Manager for the computing grid: VSM-G).

The graphical interface for the virtual screening platform VSM-G is currently in used and declared as an INRIA APP at the beginning of 2009.


Logo Inria