Section: Scientific Foundations
Combinatorics and Enumeration
Participants : Alain Denise, Pierre Nicodème, Yann Ponty, Mireille Régnier, Cédric Saule, Jean-Marc Steyaert.
We aim at enumerating or generating sequences or structures that are admissible in the sense that they are likely to possess some given biological property. Team members have a common expertise in enumeration and random generation of combinatorial structures. They have developped computational tools for probability distributions on combinatorial objects, using in particular generating functions and analytic combinatorics. Admissibility criteria can be mainly statistic; they can also rely on the optimisation of some biological parameter, such as an energy function.
The ability to distinguish a significant event from statistical noise is a crucial need in bioinformatics. In a first step, one defines a suitable probabilistic model (null model) that takes into account the relevant biological properties on the structures of interest. A second step is to develop accurate criteria for assessing (or not) their exceptionality. An event observed in biological sequences, is considered as exceptional, and therefore biologically significant, if the probability that it occurs is very small in the null model. Our approach to compute such a probability consists in an enumeration of good structures or combinatorial objects. Thirdly, it is necessary to design and implement efficient algorithms to compute these formulae or to generate random data sets. Two typical examples that motivate research on words and motifs counting are Transcription Factor Binding Sites , TFBSs, and consensus models of recoding events. The project has a significant contribution in word enumeration area. When relevant motifs do not resort to regular languages, one may still take advantage of combinatorial properties to define functions whose study is amenable to our algebraic tools. One may cite secondary structures and recoding events.
A starting project considers an algorithm of desambiguisation of automata,
that uses the powerful
techniques developed by Cyril Nicaud (Igm -Marne-la-Vallée University) to generate random automata;
An other appealing problem is the random walk problem, considered
as a modelization of ranked genes expression that could
be used for medical diagnosis. In the mathematical setting, we want to know the probability that
a random bridge of length n with increments Xi = ( + d, -c) exits of a strip -Hy
H .
The increments have expectation zero and it is possible to assume that they are independent,
later on conditionning the walk to come back to zero at time n . If the increments Xn are
bounded, the limit of the walk as n tends to infinity is a Brownian bridge, the statistics of
which is well known; however, practically, on one hand the value of d may be large, and on the
other we are in the range of large
deviations for small p -values. For these reasons, it is necessary to consider the
discrete case. Banderier and Flajolet provided in 2002 a large
account on discrete random walks, although they do not consider the
heights of the walks. A collaboration has begun with Cyril Banderier
(Lipn , University Paris-North) on the subject;
Nicolas Broutin (Inria-Algorithms ) and Thomas Feierl (joining Inria-Algorithms
on Dec. 1st) should join this collaboration. The bioinformatics aspects will
be considered by Marcel Shulz (Max-Planck Institut Berlin-Dahlem).
Analytical methods fail when both sequential and structural constraints of sequences are to be modelled or, more generally, when molecular structures such as RNA structures have to be handled. For these more complex models, an experimental approach (i.e. a computational generation of random sequences) is still necessary. Typically, context-free grammars can handle certain kinds of long-range interactions such as base pairings in secondary RNA structures. Stochastic context-free grammars (SCFG's) have long been used to model both structural and statistical properties of genomic sequences, particularly for predicting the structure of sequences or for searching for motifs. They can also be used to generate random sequences. However, they do not allow the user to fix the length of these sequences. We developed algorithms for random structures generation that respect a given probability distribution on their components. For this purpose, we first translate the (biological) structures into combinatorial classes, according to the framework developed by Flajolet et al . Our approach is based on the concept of weighted combinatorial classes, in combination with the so-named recursive method for generating combinatorial structures. Putting weights on the atoms allows to bias the probabilities in order to get the desired distribution. The main issue is to develop efficient algorithms for finding the suitable weights.
Knowledge extraction
Participants : Jéroôme Azé, Sarah Cohen-Boulakia, Christine Froidevaux, Bastien Rance, Mireille Régnier.
Our main goal is to design semi-automatic methods for annotation. A possible approach is to focus on the way we could discover relevant motifs in order to make more precise links between function and motifs sequence. Indeed, a commonly accepted hypothesis is that function depends on the order of the motifs present in a genomic sequence. Examples of relevant motifs can be frameshift motifs, RNA structural motifs, TFBS or PFAM domains. General tools must then be developed in order to assess the significance of the motifs found out. Likewise we must be able to evaluate the quality of the annotation obtained. This necessitates giving an estimate of the reliability of the results that includes a rigorous statement of the validity domain of algorithms and knowledge of the results provenance. We are interested in provenance resulting from workflow management systems that are important in scientific applications for managing large-scale experiments and can be useful to calculate functional annotations. A given workflow may be executed many times, generating huge amounts of information about data produced and consumed. Given the growing availability of this information, there is an increasing interest in mining it to understand the difference in results produced by different executions.