The aim of the project MODBIO is to develop computational models for molecular and cell biology. We are focusing on two types of problems:
Determining the structure of biological macromolecules,
Discovering and understanding the function of biological systems.
We approach these questions by combining techniques from constraint programming, combinatorial optimization, hybrid systems, and statistical learning theory.
Sequence and structural alignment, phylogeny.
Determination and analysis of macromolecular envelopes.
Protein structure prediction and protein docking.
Modeling alternative splicing regulation.
Metabolic pathway analysis.
Participation in the "Génopole Strasbourg Alsace-Lorraine"
Participation in the Bioinformatics project of the Région Lorraine
Participation in the ACI project GENOTO3D
Participation in the ARC INRIA "Process calculi and molecular networks"
Participation in the "Décrypthon" programme
Various national and international collaborations
Laboratoire «Maturation des ARN et Enzymologie Moléculaire» (MAEM), UMR 7567, Nancy
Laboratoire de Cristallographie, LCM3B, Nancy
Institut de Biologie et Chimie des Protéines, IBCP, Lyon
Institut Supérieur d'Agriculture, ISA, Beauvais, France
Center for Bioinformatics, Saarbrücken, Germany
DFG Research Center Matheon, Berlin, Germany
Institute of Mathematical Problems in Biology, Russian Academy of Sciences
University of California, Irvine, USA
In constraint programming, the user programs with constraints, i.e., he or she describes a problem by a set of constraints, which are connected by combinatorssuch as conjunction, disjunction, or temporal operators ( always). Each constraint gives some partialinformation about the state of the system to be studied. Constraint programming systems allow one to deduce new constraints from the given ones and to compute solutions, i.e., values for the variables that satisfy all constraints simultaneously.
One of the main goals of constraint programming is to develop programming languages that allow one to express constraint problems in a natural way, and to solve them efficiently.
In our work, we are first interested in constraint problems over finite domains. In this case, the domain of each variable (the set of values it may take) is a finite set of integer numbers. Theory tells us that most constraint problems over finite domains are NP-hard, which means that there is little hope to solve them by algorithms polynomial in the size of the input. In practice, these problems are handled by tree search methods which try successively different valuations of the variables until a solution is found. Because of the exponential number of possible combinations, it is crucial to reduce the search space as much as possible, i.e., to eliminate a priorias many valuations as possible.
There exist two generic methods to solve such problems. The first one is classical integer linear programming(see also Sect. ), which has been studied in mathematical programming and operations research for more than 40 years. Here, constraints are linear equations and inequalities over the integer numbers. In order to reduce the search space, one typically uses the linear relaxation of the constraint set. Equations and inequalities are first solved over the real numbers, which is much easier; then the information obtained is used to prune the search tree.
The second method is
finite domain constraint programmingwhich arose in the last 15 years by combining ideas from declarative programming languages and constraint satisfaction techniques in artificial intelligence. In contrast to integer linear optimization one uses, in addition to simple arithmetic
constraints, more complex constraints, which are called
symbolic constraints. For instance, the symbolic constraint
alldifferent(
x_{1}, ...,
x_{n})expresses that the values of the variables
x_{1}, ...,
x_{n}must be pairwise distinct. Such a constraint is difficult to express in a compact way using only linear equations and inequalities. Symbolic constraints are handled individually by specific filtering algorithms that reduce the domain of the variables. This information is
propagated to other constraints which may further reduce the domains.
A state-of-the-art survey of finite domain constraint programming, with special emphasis on its relation to integer linear programming can be found in .
In concurrentconstraint programming (cc) , different computation processes may run concurrently. Interaction is possible via the constraint store. The store contains all the constraints currently known about the system. A process may tellthe store a new constraint, or askthe store whether some constraint is entailed by the information currently available, in which case further action is taken.
Hybridconcurrent constraint programming ( Hybrid cc) is an extension of concurrent constraint programming which allows one to model and to simulate the temporal evolution of hybrid systems, i.e., systems that exhibit both discrete and continuous state changes. Constraints in Hybrid ccmay be both algebraic and differential equations. State changes can be specified using the combinators of concurrent constraint programming and default logic. Hybrid ccis well-suited to model dynamic biological systems, as shown in .
Statistical learning theory is one of the fields of inferential statistics the bases of which have been established by V.N. Vapnik in the late 1960s. The goal of this theory is to specify the conditions under which it is possible to «learn» from empirical data obtained by random sampling. Learning amounts to solving a problem of function or model selection. Basically, given a task characterized by a joint probability distribution on pairs made up of observations and labels, and a class of functions, of cardinality ordinarily infinite, the goal is to find in the class a function with optimal performance. Training can thus be reformulated as an optimization problem. In many cases, the objective function is related to the capacity of the class of functions . The learning tasks considered belong to one of the three following areas: pattern recognition (discriminant analysis), function approximation (regression) and density estimation.
This theory considers more specifically two inductive principles. The first one, named empirical risk minimization (ERM) principle, consists in minimizing the training error. If the sample is small, one substitutes to this the structural risk minimization (SRM) principle. It consists in minimizing an upper bound on the expected risk (generalization error), a bound sometimes called a guaranteed risk. This latter principle is implemented in the training algorithms of the support vector machines (SVMs), which currently constitute the state-of-the-art for numerous problems of pattern recognition.
SVMs are connectionist models conceived to compute indicator functions, to perform regression or to estimate densities. They have been introduced during the last decade by Vapnik and co-workers , as nonlinear extensions of the maximal margin hyperplane . Their main advantage is that they can avoid overfitting in the case where the size of the sample is small , .
``Combinatorial optimization is a lively field of applied mathematics, combining techniques from combinatorics, linear programming, and the theory of algorithms, to solve optimization problems over discrete structures''
. A combinatorial optimization problem can be defined as follows: we are given a ground set
Nand consider a finite collection of subsets, say
. For each subset
S_{k}there is an objective function value,
f(
S_{k}), typically a linear function over the elements in
S_{k}. The task is to find the subset
S_{k}that minimizes
f(
S_{k}). Typically, the feasible subsets are represented by inclusion or exclusion of members such that they satisfy certain conditions. Well known examples of combinatorial optimization problems are assignment, covering, cutting stock, knapsack, matching, packing, partitioning,
routing, sequencing, scheduling (jobs), shortest path, spanning tree, and traveling salesman problems.
This then becomes a special class of integer programs (IP) whose decision variables are binary valued:
x_{i}= 1if the
i-th element is in the optimal solution; otherwise,
x_{i}= 0. In this case, feasible subsets have to be expressed by linear constraints. IP formulations are not always easy, and often there is more than one formulation, some better than others. Many good formulations have exponential size.
Molecular biology is concerned with the study of three types of biological macromolecules: DNA, RNA, and proteins. Each of these molecules can initially be viewed as a string on a finite alphabet: DNA and RNA are nucleic acids made up of nucleotides A,C,G,T and A,C,G,U, respectively. Proteins are sequences of amino acids, which may be represented by an alphabet of 20 letters.
Molecular biology studies the information flow from DNA to RNA, and from RNA to proteins. In a first step, called transcription, a DNA string (``gene'') is transcribed into messenger RNA (mRNA). In the second step, called translation, the mRNA is translated into a protein, where each triplet of nucleotides encodes one amino acid (``genetic code''). During transcription, an intermediate maturation step can occur, which happens mainly in eukaryotic cells. In the so-called splicingprocess, introns are removed from the premessenger RNA. The remaining exons are concatenated yielding the mature RNA molecule.
Biological macromolecules are not just sequences of nucleotides or amino acids. Actually, they are complex three-dimensional objects. DNA shows the famous double-helix structure. RNA and proteins fold into complex three-dimensional structures, which depend on the underlying sequence. RNA is a single-stranded chain of nucleotides. However, a nucleotide in one part of the molecule can base-pair with a nucleotide in another part, following the Watson-Crick complementarity rules. This results in a folding of the molecule. The secondary structureof RNA indicates the set of base pairings in the three dimensional structure of the molecule. This information can be represented by a graph.
Proteins have several levels of structure. Above the primary sequence is the secondary structure, which involves three basic types: -helices, -sheets, and structure elements that are neither helices nor sheets, called loops. A domainof a protein is a combination of secondary structure elements with some specific function. It contains an active sitewhere an interaction with an external molecule may happen. A protein may have one or several domains.
The ultimate goal of molecular biology is to understand the functionof biological macromolecules in the life of the cell. Function results from the interactionbetween different macromolecules, and depends on their structure. The overall challenge is to make the leap from sequence, through structure, to understand about the function.
X-ray structure analysis is the main tool to establish the three-dimensional atomic structure of biological macromolecules and their complexes. The determination of a structure in X-ray crystallography passes through several stages:
purification and crystallization of the object under study (a protein, DNA, RNA, virus, or a huge macromolecular complex, such as ribosome or lipoprotein particles);
X-ray experiment (usually at synchrotron accelerators); data collection (up to a million of independent observations) and their primary processing;
the solution of the inverse problem of the theory of diffraction to find the electron density distribution in the studied object and to interpret it in terms of atoms.
A key problem of X-ray structure analysis is the so-called phase problem. In an X-ray experiment, one can measure only the magnitudes of the complex Fourier coefficients of the electron density distribution under study, but not their phases. Half of the necessary information is therefore lost, and must be restored by other means.
While molecular biology has become the main application area of our work, we continue to study selected problems from other domains, in particular operations research. During this year, we have been working on graph and network design problems, and also on problems from computational geometry and linguistics. The corresponding results are presented in Sect. .
We have extended the functionalities and optimized the code of the application devoted to the standard M-SVM (M-SVM1 in ). The corresponding pieces of software have been registred at the APP under the IDDN number IDDN.FR.001.170014.000.R.P.2005.000.10000.
We have continued our study of the generalization error of large margin multi-class discriminant models, laying emphasis on the use of bounds for model selection. A first algorithm of model selection, dedicated to M-SVMs, was based on a bound on the entropy numbers of the evaluation operator . The computation of tighter bounds on those entropy numbers is still a work in progress. It takes the form of the derivation of a generalized formulation of the Maurey-Carl theorem . Those bounds will then be compared with those involving extended Sauer's lemmas and generalized VC dimensions . In parallel, the work on the computation of estimates of the risk based on the leave-one-out procedure has given birth to a first theorem , extending Chapelle's "radius-margin bound". All the aforementioned bounds are progressively incorporated in our M-SVM software, where they can be used to select the soft margin parameter C.
In Probabilistic Grammatical Inference, it is supposed that learning data consist in a sequence of words over a finite alphabet
drawn according to a fixed but unknown probability distribution
Pcalled a
stochastic language. Then, the goal is to find a model, which can be a probabilistic automata (PA) or a Hidden Markov Model (HMM) for instance, consistent with the data. Hidden Markov Models and Probabilistic Automata have the same expressivity and their relationship have been
precisely studied in
. With Yann Esposito, from the "Laboratoire d'informatique fondamentale de Marseille" (LIF), we have proved in
that stochastic languages
pgenerated by probabilistic automata
Adepend continuously on the parameters of
A, for the
norm. As a corollary, we prove that probabilistic automata can be identified in the limit and that the identification is exact when the parameters of the target are rational numbers. However, this result is theoretical and does not lead to a practical learning algorithm. The main
difficulty is to infer an appropriate structure from the data: this is possible when natural components of the model correspond to intrinsic components of the target language. We defined the notions of
residual languagesof a stochastic language and
Probabilistic Residual Automata. A PRA is a PA whose states directly correspond to the residual of the language it generates. When the target stochastic language can be generated by a PRA, an efficient learning algorithm can be defined (see
). Stochastic languages defined from probabilistic automata are rational languages and we feel necessary to study
Rational Stochastic Languages from a Language Theoretical point of view. Main results have been described in
. A main publication is in preparation.
Semi-supervised learning algorithms aimed to exploit simultaneously labeled and unlabeled data for classification. We have been working for several years on a specific semi-supervised learning problem: binary classification from positive and unlabeled data. Theoretical results, strengthened by experimental results, have proved that many learning algorithm can be adapted to this context (see ). With Christophe Magnan, who is doing a PhD on this subject at the LIF, we are currently studying applications of this paradigm to a biological problem: disulfide bridges prediction . We are also working, with Liva Ralaivola (MdC, Université de Provence), on a more sophisticated model in order to deal with contact maps in proteins.
The function of a single protein is mainly carried out by a
domainwhich is a subsequence of amino-acids within the whole sequence of the protein. During evolution, the sequence of such a domain can be significantly modified while the function is still conserved. Our work deals with functional families whose domains are not well conserved during
evolution. Let
Fbe a functional family, let
P= {
p_{1}, ...,
p_{n}}a set of annotated proteins which are known to belong or not to
F, our problem is to decide whether any new protein
pbelongs to
F.
In many cases, comparing a new sequence of protein
pwith some sequences of the family
Fis enough for predicting whether
pF. Such a similarity search may be achieved by using either an alignment program such as
Blastor any model of the family's sequences, for example stochastic and probabilistic models such as Hidden Markov Models. Unfortunately, none of these methods is satisfactory whenever the sequences of the domains of the family are not conserved. Our
proposal is to use a
boostingalgorithm associated with
Blastto deal with this problem. First results have published in
,
. Cécile Capponi, at the LIF, is leader on this thema.
Shape recognition is the field of computer vision which addresses the problem of finding out whether a query shape lies or not in a shape database, up to a certain invariance. Most shape recognition methods simply sort shapes from the database along some (dis-)similarity measure to the query shape. Their Achilles' heel is the decision stage, which should aim at giving a clear-cut answer to the question: ``do these two shapes look alike?'' In , , the proposed solution consists in bounding the number of false correspondences of the query shape among the database shapes, ensuring that the obtained matches are not likely to occur ``by chance''. As an application, one can decide with a parameterless method whether any two digital images share some shapes or not. In a paper submitted to VISAPP'06, we propose to apply the above a contrariomethodology to shapes which are described by size functions, in order to design a perceptual matching algorithm.
A further step consists in grouping matching shapes that share the same respective positions in two corresponding images. In , we intend to form spatially coherent groups of shapes. Each pair of matching shape elements indeed leads to a unique transformation (similarity or affine map.) A unified a contrariodetection method is proposed to solve three classical problems in clustering analysis. The first one is to evaluate the validity of a cluster candidate. The second problem is that meaningful clusters can contain or be contained in other meaningful clusters. A rule is needed to define locally optimal clusters by inclusion. The third problem is the definition of a correct merging rule between meaningful clusters, permitting to decide whether they should stay separate or unit. As an application, the present theory on the choice of the right clusters is used to group shapes by detecting clusters in the transformation space.
Knowing the three-dimensional structure of a protein can greatly help to infer its function. Predicting this tertiary structurefrom the sequence of amino acids (or primary structure), remains one of the central open problems in structural biology. This is the subject of the «GENOTO3D» project that we coordinate. This year, our main efforts have been concentrated on the development of a new kernel for our M-SVM dedicated to protein secondary structure prediction, a kernel based on a pair-HMM.
Our collaboration with Nicolas Sapay and Gilbert Deléage, at IBCP, in Lyon, on the prediction of amphipathic in-plane membrane anchors in motopic proteins, has given birth to a new prediction method, «AmphipaSeek» , which is available from the website of the PBIL, at the following address : http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_amphipaseek.html.
Imposing a classification on an otherwise unordered protein fold space aids our understanding of protein evolution and the relationship between three-dimensional structure and function. We describe a similarity model that provides the objective basis for clustering proteins of similar structure. More specifically, we consider the following variant of the protein-protein similarity problem: We want to find proteins in a large database pdbasethat are very similar to a given query protein in terms of geometric shape. We give experimental evidence, that the shape similarity model of Osada, Funkhouser, Chazelle and Dobkin can be transferred to the context of protein structure comparison. This model is very simple and leads to algorithms that have attractive space requirements and running times. For example, it took 0.39 second to retrieve the eight members of the seryl family out of 26,600 domains. Furthermore, a very high agreement with one of the most popular classification schemes proved the significance of our simplified representation of complex proteins structure by a distribution of - distances.
At the moment, we are working in improving the computational efficiency of your approach. The bottleneck of our implementation is the solution of the (exponentially large) linear relaxation of our integer program. We try to develop methods to approximate the value of the linear program efficiently.
The Steiner tree problem is one of the most studied NP-hard optimization problems (probably second after the Traveling Salesman problem). Here we are interested in the variant where
Uis the set of strings of a certain length
dand
cis the Hamming distance between two strings.
The main application of this variant of the Steiner tree problem is to compute evolutionary trees in bioinformatics and computational linguistics.
Among all methods for finding such trees, algorithms using variations of a branch and bound method developed by Penny and Hendy have been the fastest for more than 20 years. We describe a new pruning approach that is far superior to previous methods and outline its implementation.
Our main result is an algorithm that computes a tree of depth at most
kand total expected cost
O(log
n)times that of a minimum-cost
k-hop spanning-tree. The result is based upon earlier work on metric space approximation due to Fakcharoenphol et al, and Bartal. In particular, we show that the
problem can be solved exactly in polynomial time when the cost metric
cis induced by a so called
hierarchically well-separated tree.
We participate in the «Génopole Strasbourg Alsace-Lorraine» together with the laboratory MAEM («Maturation des ARN et Enzymologie Moléculaire»), UMR 7567, in Nancy and the IGBMC in Strasbourg.
In the framework of the CPER Lorraine 2000-2006, we participate in the project «Bioinformatics and Applications to Genomics» of the PRST «Intelligence Logicielle». Our partners here are the Laboratory of Crystallography LCM3B (UMR 7036), the «équipe de Dynamique des Assemblages Membranaires» (eDAM, UMR 7565) and the MAEM (UMR 7567) at the University Henri Poincaré, Nancy 1.
Since February 2002, we have been participating in the cooperative research action ARC CPBIO «Process calculi and Biology of Molecular Networks». Our partners are the project team CONTRAINTES from INRIA Rocquencourt (F. Fages), the Genoscope (V. Schächter) and the laboratory PPS (V. Danos) in Paris.
We have regular contacts with the INRIA project teams HELIX (Rhône-Alpes), SYMBIOSE (Rennes) and COMORE (Sophia-Antipolis). In particular, we have been collaborating with Hidde de Jong (HELIX) in modeling the regulation of alternative splicing.
Since September 2003, we are coordinating a project called GENOTO3D, which is funded by the «Action Concertée Incitative» (ACI) «Masses de Données». The aim of this project is to apply machine learning approaches to the prediction of the tertiary structure of globular proteins. Our partners are the IBCP in Lyon, the LIF in Marseille, the project team SYMBIOSE from IRISA, the LIRMM in Montpellier, and the MIG laboratory of INRA in Jouy-en-Josas.
Within the French-Russian Institute Liapunov, we have a joint project with the Institute for Mathematical Problems in Biology (IMPB) of the Russian Academy of Sciences in Pushchino (V. Y. Lunin).
We have been collaborating with researchers from Carnegie-Mellon University (E. Balas, John N. Hooker), the Center of Operations Research CORE in Louvain-la-Neuve (L. Wolsey), the Max Planck Institute for Computer Science in Saarbrücken (working groups of K. Mehlhorn and F. Eisenbrand), SAP AG (T. Kasper), the University of California at Irvine (P. Baldi), IBM at Zurich (A. Elisseeff), and the Wiener laboratories in Rosario (D. Zelus).
François Denis has been the program commitee chair of CAP05, the french national conference on Machine Learning, which was held in Nice on June 1-3, 2005.
Yann Guermeur has been a member of the program committee of CAP'05.
Ernst Althaus has taught the following lectures (unless otherwise indicated, all teaching has been done at the Universität der Saarlandes, Saarbrücken, Germany):
Datastructures and Algorithms, October 2004 - February 2005 (together with Dr. Ulrich Meyer)
Part of the course Pépites Algorithmiques, March 2005 at the École des Mines, Nancy, France
Lecture Optimization, April 2005 - July 2005 (together with Dr. Benjamin Dörr)
Lecture Datenstrukturen und effiziente Algorithmen, October 2005 - March 2006 at the Johannes-Gutenberg Universität, Mainz, Germany
Seminar Online-Algorithmen, October 2005 - March 2006 at the Johannes-Gutenberg Universität, Mainz, Germany (together with Elmar Schömer and Marcel Marquardt)
Lecture Bioinformatik, October 2005 - March 2006 at the Johannes-Gutenberg Universität, Mainz, Germany (together with several lectures from the department of biology)
Yann Guermeur has been teaching bioinformatics at a Master of the INPL and the M2P speciality "Génomique et Informatique" of the Master "Sciences de la Vie et de la Santé" (SVS), at the UHP.
Frédéric Sur has been a member of the board of the "Banque PT" entrance examination in mathematics ("Grandes Écoles" entrance examination).