## Section: New Results

### Annotation

#### Combinatorics

##### Word counting and trie profiles

Cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors, or clusters.
Formally, such sites can be viewed as *words* co-occurring in the DNA sequence.
This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. This is partially solved by our previous algorithm, AhoPro . OvGraph [6] and , developed with our associate team Migec , intends to solve memory problems. We introduced a new concept of overlap graphs to count word occurrences and their probabilities. The concept led to a recursive equation that differs from the classical one based on the finite automaton accepting the proper set of texts. In case of many occurrences, our approach yields the same order of time and space complexity as the approach based on minimized automaton. OvGraph algorithm relies on traversals of a graph, whose set of vertices is associated with the overlaps of words from a set . Edges define two oriented subgraphs that can be interpreted as equivalence relations on words of . Let P be the set of equivalence classes and S be the set of other vertices. The run time for the Bernoulli model is . In a Markov model of order K, additional space complexity is O(pm|V|^{K}) and additional time complexity is O(npm|V|^{K}) . Our preprocessing uses a variant of Aho-Corasick automaton and achieves time complexity. Our algorithm is implemented
for the Bernoulli model and provides a significant space improvement in practice.

A new problem addressed by MPV , developed with J. Bourdon (LINA -Nantes and Inria-Symbiose ) and Migec , is the significance assessment for motifs clusters. The classical method to study a set of motifs (defined, for instance, by their *Position Weight Matrices* , PWM), computes a significance score for each motif in the sequence set to be studied and then choses (arbitrarily) a threshold to select the most significant motifs (10 top motifs, motifs with a pvalue smaller than 5%,...). Such a type of choice makes very difficult to keep under control the number of false positive induced by this selection. We have developed a method, that relies on generating functions, that allows to computes a significance criterium for the selection. Therefore, it provides the number of false positive. Such an information is beyond the scope of other methods that correct the pvalues for multiple tests: Bonferroni,Benjamini-Hochberg,...A prototype is available on line http://www.lina.sciences.univ-nantes.fr/bioatlanstic/MPV/ .

Some related theoretical aspects have been considered by P. Nicodème. The non-reduced case of words statistics is considered,
where words of the searched motif may be factors of other words of the motif. This is a joint
work with Frédérique Bassino (Lipn , University Paris-North) and Julien Clément (Greyc , University of
Caen); an article about this matter has been submitted to the Journal *Transaction on Algorithms* .
Since DNA is a text sequence, it is ubiquitous to present the importance of analysis
of suffix-trees. This latter analysis is often coupled with the analysis of tries.
A joint work of P. Nicodème with Gahyun Park (University of Wisconsin), Hsien-Kuei Hwang
(Academia Sinica, Taiwan) and Wojciech Szpankowski (University of Purdue) about Profiles of Tries
has been published in the SIAM Journal on Computing [8] .

##### Random Generation

The random generation of combinatorial objects is a alternative, yet natural, framework to assess
the significance of observed phenomena. General and efficient techniques have been developed over the last decades to draw objects uniformly at random from an abstract specification. However, in the context of biological sequences and structures, the uniformity assumption fails and one has to consider non-uniform distributions in order to obtain relevant estimates. To that purpose we introduced a weighted random generation, which we previously implemented within the `GenRGenS` software http://www.lri.fr/~genrgens/ . The weighted distributions induced by our generation generalizes both Markov models for genomic sequences
and the Boltzmann distribution used by state-of-the-art methods for RNA folding.

In this collaboration between two of the team members and M. Termier (Igm -University Paris-Sud XI), we introduced and
studied a generalization of the weighted models to general decomposable classes, defined using
different types of atoms ** ** .
We addressed the random generation of such structures with
respect to a size n and a targeted distribution in k of its
*distinguished* atoms. We consider two variations on this problem.
In the first alternative, the targeted distribution is given by k
real numbers _{1}, ..., _{k} such that
0<_{i}<1 for all i and .
We aim to generate
random structures among the whole set of structures of a given size
n , in such a way that the *expected* frequency of any
distinguished atom equals _{i} . We address this problem by
weighting the atoms with a k -tuple ** ** of real-valued
weights, inducing a weighted distribution over the set of structures
of size n . We first adapt the classical recursive random generation
scheme into an algorithm taking O(n^{1 + o(1)} + mnlogn)
arithmetic operations to draw m structures from the
** **-weighted distribution. Secondly, we address the analytical
computation of weights such that the targeted frequencies
are achieved asymptotically, i. e. for large values of n .
We derive systems of functional equations
whose resolution
gives an explicit relationship between ** ** and N . Lastly,
we give an algorithm in O(kn^{4}) for the inverse problem,
*i.e.* computing the frequencies associated with a given k -tuple
** ** of weights, and an optimized version in O(kn^{2}) in
the case of context-free languages. This allows for a heuristic
resolution of the weights/frequencies relationship suitable for
complex specifications.
In the second alternative, the targeted distribution is given by
k natural numbers n_{1}, ..., n_{k} such that
where r0 is the number of undistinguished
atoms. The structures must be generated uniformly among the set of
structures of size n that contain *exactly* n_{i} atoms
(1ik ). We give
a algorithm for generating m structures, which simplifies
into a for regular specifications.

These results provide new foundations and tools for tackling structural bioinformatics
problems, such as RNA design. They are described in a manuscript [23] submitted
to *Theoretical Computer Science* .

##### Score function for SNK

Recent work by Forslund and Sonnhammer has investigated to which extent the hypothesis that protein function should follow largely from domain architecture can be true. They have shown that domain functional interplay may not follow directly from the properties of the domains in isolation, and suggested that it could be interesting to take into account conservation of sequential order of the domains. To achieve this, we have proposed a new method [3] , called Snk (Sequential Nuggets of Knowledge) http://www.lri.fr/~rance/SNK/ , which systematically analyses domain combinations and outlines characteristic patterns potentially associated with targeted properties, such as sets of GO terms or membership to some taxonomic group. We are currently applying this method to discover new associations in some proteins families. Also, we are defining a robust probability model on the variables involved in the sequential association rules to highlight their relevance.

#### Ontology and provenance

##### Ontology mapping

Identifying correspondences between concepts of two ontologies has become a crucial task for genome annotation. We have proposed O'Browser [14] , a semi-automatic method to solve that issue in the case of two functional hierarchies. O'Browser is based on a classical ontology mapping architecture, but strongly uses expertise on the underlying domain. First, experts are asked to validate obvious correspondences discovered by O'Browser and to identify functional groups of concepts in the ontologies. Then, they are requested to validate the correspondences given by the combination of results found in the automatic steps of our system. These steps consist in matchers designed to fit the characteristics of the ontologies. Especially, we have introduced a new instance-based matcher which uses homology relationships between proteins. We also proposed an original notion of adaptive weighting for combining the different matchers. O'Browser has been used to map concepts of Subtilist to concepts of FunCat , two functional hierarchies.

##### Browsing biomedical datasources

One of the most popular ways to access public biological data is using portals, like Entrez NCBI . Data entries are inspected in turn and cross-references between entries follow. However, this navigational process is so time-consuming and difficult to reproduce that it does not allow scientists to explore all the alternative paths available (even though these paths may provide new information). BioBrowsing [13] is a tool providing scientists with data obtained when all the possible paths between NCBI sources have been followed (source paths generation is done by BioGuide ). Querying is done on-the-fly (no warehousing). BioBrowsing has a module able to update automatically the schema used by its query engine to consider the new sources and links which appear in Entrez . Finally, profiles can be defined as a way of focusing the results on userâ€™s specific interests.

##### Differencing two workflows

In this context, we have studied the problem of differencing two workflow runs with the same specification. Our contributions [10] are three-fold: (i) while in general this problem is NP-hard, we have proposed to consider a natural restriction of graph structures (series-parallel graph overlaid with well-nested forking and looping) general enough to capture workflows encountered in practice; (ii) for this model of workflows, we have presented efficient, polynomial-time algorithms for differencing workflow runs [18] ,[11] ; (iii) we have developed a prototype [4] and conducted experimental results demonstrating the scalability of our approach.