Section: New Results
Current Research and New Perspectives in Life Sciences
Participants : Yasmine Assess, Sid-Ahmed Benabderrahmane, Emmanuel Bresso, Matthieu Chavent, Marie-Dominique Devignes, Léo Gemthio, Anisha Ghoorah, Mehdi Kaytoue, Florence Le Ber, Vincent Leroux, Bernard Maigret, Jean-François Mari, Lazaros Mavridis, Nizar Messai, Amedeo Napoli, Dave Ritchie, Vishwesh Venkatraman, Malika Smaïl-Tabbone.
KDDK in Life Sciences
One of the major challenges in the post genomic era consists in analyzing terabytes of biological data stored in hundreds of heterogeneous databases (DBs). The extraction of knowledge units from these large volumes of data would give sense to the present data production effort with respect to domains such as disease understanding, drug discovery, and pharmacogenomics or systems biology. Research reported here addresses these important issues and shows the spreading of KDDK over such domains.
Virtual screening (VS) techniques are nowadays widely recognized as interesting techniques as part of early drug discovery strategies, since when successful they provide an excellent cost-to-efficiency ratio. In a high-throughput screening context (millions of candidates), VS techniques are still under-exploited. In particular, the popular molecular docking programs are either too slow or considered as not reliable enough compared to more expensive experimental protocols. One way to overcome such limitations involves coupling multiple techniques in a funnel-like filtering process. Several filtering strategies can be set up in this context such as in VSM-G software. VSM-G uses as large-scale first filtering step a crude geometrical docking algorithm based on spherical harmonics. We have studied a knowledge-oriented approach that could complement this algorithm in reducing the number of false positives. The rationale of this approach is that extracting patterns from data relative to known active compounds can be used to filter out inactive compounds from chemical libraries. This approach was tested on the Liver X receptor (LXR), on the Apelin receptor, and on the C-Met receptor, which are all targets of interest  ,  ,  ,  ,  (major  ).
A KDD approach for designing filters to improve virtual screening
Virtual screening has become an essential step in the early drug discovery process. It consists in using computational techniques for selecting compounds from chemical libraries in order to identify drug-like molecules acting on a biological target of therapeutic interest. We consider virtual screening as a particular form of KDD process. The knowledge units to be discovered concern the way a compound can be considered as a consistent ligand for a given target. The data from which knowledge has to be discovered derive from diverse sources such as chemical, structural, and biological data related to ligands and their cognate targets. More precisely, an objective is to extract “filters” from chemical libraries and protein-ligand interactions. Three basic steps of a KDD process have been implemented. Firstly, a model-driven data integration step is applied to appropriate heterogeneous data found in public databases. This facilitates subsequent extraction of various datasets to be mined. In particular and for specific ligand descriptors, it allows transforming a multiple-instance problem into a single-instance one. In a second step, mining algorithms were applied to datasets and finally the most accurate knowledge units are assessed as new virtual screening filters. The experimental results obtained with a set of ligands of the hormone receptor LXR have been published in  .
Knowledge Discovery from Transcriptomic Data
This work concerns the interpretation of transcriptomic data from colorectal cancer samples and is the subject of an ongoing PhD thesis funded by the INCa (Institut National du Cancer) in collaboration with Olivier Poch (IGBMC, Strasbourg). DNA microarray technologies allow to monitor the expression of several thousands of genes in different situations. The expression levels measured for each gene in a set of situations define a gene expression profile. Usually a functional analysis is then applied to genes with similar expression profiles. We have proposed a new approach based on a priori modeling of Differential Expression Profiles (DEP) considering the relations between the situations. Fuzzy logic is used for assigning genes to DEPs. Results with data on colorectal cancer show that this modeling of DEP lead to relating biological functions with defined transcriptional behavior  .
Further functional analysis of DEPs requires a flexible gene-gene similarity measure that takes into account at best domain knowledge mostly represented here by the annotation vocabulary known as Gene Ontology (GO). Various semantic similarity measures exist  that consider both semantic relationships between annotation terms and their information content. However none of them includes yet the quality of gene annotations which is reported as evidence codes in the public databases. We are currently testing a new similarity measure, that is defined in a vectorial framework inspired from information retrieval and considers both semantic relationships of terms, their information content, and quality metadata.
Relational data mining applied to 3D protein patches for characterizing and predicting phosphorylation sites
An ongoing study is in concern with the prediction of phosphorylation sites through the design of models exploiting information on the 3D structure of proteins and methods of logical relational data mining based on Inductive Logic Programming (ILP)  . Indeed, relational data mining appears as a relevant way to extract knowledge units from 3D structures and the prediction of phosphorylation sites constitutes a well-documented case-study. During the nineties, several ILP success stories were reported on biological problems concerning the prediction of protein 3D structure starting from the protein primary sequence. An idea is to use here the same ILP methods to further predict biological phenomena. We are motivated by testing the ability of ILP techniques to provide explicit insights about the considered biological problem. Current results reveal interesting features of what constitutes a phosphorylation site in terms of predicates describing the 3D patch surrounding this site.
Using FCA for analyzing biological data
FCA is the basic classification method used in two research topics: (i) classification of biological DBs on the Web, (ii) analysis and classification of Gene Expression Data (GED). In the first track, the BioRegistry project aims at organizing metadata about biological DBs in order to ease classification and retrieval tasks. Metadata do not have the same importance and the same structure. Thus, two main extensions of FCA have been designed. The first allows the introduction of dependencies between attributes, e.g. attribute hierarchies, which are considered for concept lattice construction. The second is aimed at handling many-valued contexts: the so-called SimBA for “Similarity-Based Complex Data Analysis System” algorithm builds a many-valued concept lattice using similarity between attribute values. The results of these two extensions of FCA are substantial and are detailed in  , and, as well, have given birth to new research perspective on similarity and pattern structures as explained in § 6.1.1 .
Another research work involving FCA holds on the analysis of gene expression data (GED) for discovering groups of co-expressed genes  . Microarray biotechnology is able to measure the expression of a gene (related to its activity) in a given biological situation. A gene expression profile (GEP) is considered as a numerical m -dimensional vector, describing the behavior of the gene. A gene expression data (GED) is a collection of n gene expression profiles and is represented as an n×m numerical table. FCA is applied for analyzing and interpreting the data, knowing that genes having a similar expression profile may participate in a same biological process. Accordingly, formal concepts in the resulting concept lattice are representing sets of genes presenting similar variations of expression in biological situations. Substantial results have been obtained applying FCA to a real dataset related to the fungus “Laccaria Bicolor” for studying interaction between fungus and poplars (a very important tree in the industry of wood).
Mining Biological Data with HMMs
In this particular research direction for KDDK, we have designed a new data mining method based on stochastic analysis (Hidden Markov Model or HMM) and combinatorial methods for discovering new transcriptional factors in bacterial genome sequences (major  ). Sigma factor binding sites (SFBSs) were described as patterns corresponding to DNA motifs of bacterial promoters. High-order HMM are used in which the hidden process is a second-order HMM chain and applied to the genome of bacterium Streptomyces coelicolor and Bacillus subtilis. Short DNA sequences were extracted by HMM and clustered with a hierarchical classification algorithm. Some selected motif consensuses were combined with over-represented motifs found by a word enumeration algorithm. This original and new mining methodology applied to several genomes was able to retrieve SFBSs and to suggest new potential transcriptional factor binding sites.
In another investigation field, namely agricultural landscapes, methods for identifying and describing meaningful landscape patterns play an important role to understand the interaction between landscape organization and ecological processes. We have proposed an innovative stochastic modeling method of agricultural landscape organization where the temporal regularities in land-use are first identified through recognized Land-Use Successions before locating these successions in landscapes  ,  . These time-space regularities within landscapes are extracted using a data mining method based on HMM. We applied this method to the Niort Plain (West of France). Implications and perspectives of such an approach, which links together the temporal and the spatial dimensions of the agricultural organization, have been investigated by assessing the relationship between the agricultural landscape patterns and ecological issues.
Structural Systems Biology
High Performance Algorithms for Structural Systems Biology (HPASSB)
The HPASSB project started in January 2009 following Dave Ritchie's successful application for funding to the ANR Chaires d'Excellence 2008 (Senior Courte Durée) programme. The overall aim of HPASSB is to help the building of a new Centre of Excellence in France in the emerging discipline of structural systems biology. The HPASSB project complements existing competencies in the Orpailleur team represented by M.-D. Devignes (CR CNRS) who is coordinating the MBI project (Modelling Biomolecules and their Interactions, http://bioinfo.loria.fr ), Malika Smaïl-Tabbone (MCU Nancy University) who is working on data integration and relational data-mining approaches, and Bernard Maigret (DR CNRS) who has an extensive experience of molecular dynamics and virtual screening. We are currently developing advanced computing techniques for molecular shape representation, protein-protein docking, protein-ligand docking, high-throughput virtual drug screening, and knowledge discovery in databases dedicated to protein-protein interactions.
Accelerating Protein Docking Calculations Using Graphics Processors
In this framework, we have recently adapted the Hex protein docking software to use modern graphics processors (GPUs) to carry out the expensive FFT part of a docking calculation. Compared to using a single conventional central processor (CPU), a high-end GPU gives a speed-up of 45 or more. Furthermore, the Hex code has been re-written to use multi-threading techniques in order to distribute the calculation over as many GPUs and CPUs as are available. Thus, a calculation which formerly took many minutes or several hours can now be performed in a matter of seconds on a modern desk-top computer. This advance will facilitate future docking-based studies of large-scale protein interaction networks and building multi-protein systems. We will present this work as a poster entitled “Fast FFT Protein-Protein Docking on Graphics Processors” at the 4th CAPRI Evaluation Meeting in Barcelona in December 2009.
3D-Blast: A New Approach for Protein Structure Alignment and Clustering
We have recently developed a new sequence-independent protein structre alignment approach, which we call 3D-Blast, based on the spherical polar Fourier (SPF) correlation approach used in the Hex protein docking software  . The utility of this approach has been demonstrated by clustering subsets of the CATH protein structure classification database  for each of the four main CATH fold types, and by searching the entire CATH database of some 12,000 structures using several protein structures as queries. Overall, the automatic SPF clustering approach agrees very well with the expert-curated CATH classification, and ROC-plot analyses of the database searches show that the approach has very high precision and recall. Database query times can be reduced considerably by using a simple rotationally-invariant pre-filter in tandem with a more sensitive rotational search with little or no reduction in accuracy. Hence it should soon be possible to perform on-line 3D structural searches in interactive time-scales  .
KDD-Dock: Protein Docking Using Knowledge-Based approaches
Protein docking is the difficult computational task of predicting how a pair of three-dimensional protein structures come together to form a complex. There is considerable interest in developing improved ab initio techniques which can make protein-protein docking predictions using only knowledge of their three-dimensional structures. The Hex docking program developed by Dave Ritchie is one such example. However, as structural genomics initiatives continue to populate the space of protein 3D structures, and as several on-line databases of protein interactions have recently become available, using structural database systems to perform docking by homology will become an increasingly powerful approach to predicting protein interactions. We recently used the SCOPPI  and 3DID  protein interaction databases to help make some very good predictions to two or the recent CAPRI target complexes, and we are now working to incorporate additional knowledge from other databases and to automate the overall approach. This work will be presented as a poster at the 4th CAPRI Evaluation Meeting in Barcelona in December 2009.