## Section: Scientific Foundations

### Combinatorics and Enumeration

Participants : Alain Denise, Pierre Nicodème, Yann Ponty, Mireille Régnier, Cédric Saule, Jean-Marc Steyaert.

We aim at enumerating or generating sequences or structures
that are *admissible*
in the sense that they are likely to possess some given biological property.
Team members have a common expertise in enumeration and random generation
of combinatorial structures.
They have developped computational tools for probability distributions on
combinatorial objects, using in particular generating functions and analytic
combinatorics.
Admissibility criteria can be mainly statistic; they can also rely on the
optimisation of some biological parameter, such as an energy function.

The ability to distinguish a
significant event from statistical noise is a crucial need in
bioinformatics.
In a first step, one defines a suitable
probabilistic model (null model) that takes into account the relevant
biological properties on the structures of interest.
A second step is to develop accurate criteria for
assessing (or not) their exceptionality.
An event observed in biological sequences,
is considered as exceptional, and
therefore biologically significant, if the probability that it occurs
is very small in the null model.
Our approach to compute such a probability consists in an enumeration of
good structures or combinatorial objects.
Thirdly, it is necessary to design and implement efficient algorithms to
compute these formulae or to generate random data sets.
Two typical examples that motivate
research on words and motifs counting
are *Transcription Factor Binding Sites* ,
TFBSs, and consensus models of recoding events.
The project has a significant contribution in word enumeration area.
When relevant motifs do not resort to regular languages,
one may still take advantage of combinatorial properties to
define functions whose study is amenable to our algebraic tools.
One may cite secondary structures and recoding events.

A starting project considers an algorithm of desambiguisation of automata,
that uses the powerful
techniques developed by Cyril Nicaud (Igm -Marne-la-Vallée University) to generate random automata;
An other appealing problem is the random walk problem, considered
as a modelization of ranked genes expression that could
be used for medical diagnosis. In the mathematical setting, we want to know the probability that
a random bridge of length n with increments X_{i} = ( + d, -c) exits of a strip -HyH .
The increments have expectation zero and it is possible to assume that they are independent,
later on conditionning the walk to come back to zero at time n . If the increments X_{n} are
bounded, the limit of the walk as n tends to infinity is a Brownian bridge, the statistics of
which is well known; however, practically, on one hand the value of d may be large, and on the
other we are in the range of large
deviations for small p -values. For these reasons, it is necessary to consider the
discrete case. Banderier and Flajolet provided in 2002 a large
account on discrete random walks, although they do not consider the
heights of the walks. A collaboration has begun with Cyril Banderier
(Lipn , University Paris-North) on the subject;
Nicolas Broutin (Inria-Algorithms ) and Thomas Feierl (joining Inria-Algorithms
on Dec. 1st) should join this collaboration. The bioinformatics aspects will
be considered by Marcel Shulz (Max-Planck Institut Berlin-Dahlem).

Analytical methods fail when both sequential
and structural constraints of sequences are to be modelled or, more
generally, when molecular *structures* such as RNA
structures have to be handled. For these more complex models, an
experimental approach (*i.e.* a computational generation of random
sequences) is still necessary.
Typically, context-free grammars can handle certain kinds
of long-range interactions such as base pairings in secondary RNA
structures. Stochastic context-free grammars
(SCFG's) have long been used to model both structural and statistical
properties of genomic sequences, particularly for predicting the
structure of sequences or for searching for motifs.
They can also be used to generate random sequences. However, they do
not allow the user to fix the length of these sequences.
We developed algorithms for random structures generation that respect
a given probability distribution on their components. For this
purpose, we first translate the (biological) structures into
combinatorial classes, according to the framework developed by
Flajolet *et al* . Our approach is based on the
concept of *weighted* combinatorial classes, in combination with
the so-named *recursive* method for generating combinatorial
structures. Putting weights on the atoms allows to bias the
probabilities in order to get the desired distribution. The main
issue is to develop efficient algorithms for finding the suitable
weights.

#### Knowledge extraction

Participants : Jéroôme Azé, Sarah Cohen-Boulakia, Christine Froidevaux, Bastien Rance, Mireille Régnier.

Our main goal is to design semi-automatic methods for annotation. A possible approach is to focus on the way we could discover relevant motifs in order to make more precise links between function and motifs sequence. Indeed, a commonly accepted hypothesis is that function depends on the order of the motifs present in a genomic sequence. Examples of relevant motifs can be frameshift motifs, RNA structural motifs, TFBS or PFAM domains. General tools must then be developed in order to assess the significance of the motifs found out. Likewise we must be able to evaluate the quality of the annotation obtained. This necessitates giving an estimate of the reliability of the results that includes a rigorous statement of the validity domain of algorithms and knowledge of the results provenance. We are interested in provenance resulting from workflow management systems that are important in scientific applications for managing large-scale experiments and can be useful to calculate functional annotations. A given workflow may be executed many times, generating huge amounts of information about data produced and consumed. Given the growing availability of this information, there is an increasing interest in mining it to understand the difference in results produced by different executions.