Section: New Results
Comparative analysis and Noncoding RNAs
Participants : Mathieu Giraud, Benjamin Grenier-Boley, Antoine de Monte, Azadeh Saffarian, Hélène Touzet.
Finding ncRNAs by comparative analysis
We have deviced a new method to find ncRNAs in genomes by comparative analysis. First, sequences are preprocessed for masking known annotated features, redundancy, ...Then the target sequence is compared to all other sequences to detect similar sequences across species. Pairwise alignments are combined into clusters of conserved regions. For that, the algorithm searches for regions whose conservation is supported by a significantly high number of pairwise alignments. Finally, conserved sequences are investigated by inspection of evolutionary patterns to identify conserved consensus secondary structures. This work has been implemented in the CG-seq software, and should to give rise to publication.
RNA pattern matching
Given a description for an RNA family, the goal is to identify all its potential occurrences on a genomic sequence, in a database or in a large set of small sequences. Stochastic context-free grammars turned out to be successful models for that, both in terms of sensitivity and specificity  . However, a high computational complexity of the related dynamic programming algorithms limits their practical application. More generally, an exhaustive benchmarking for RNA pattern matching shows that existing methods should compromise between efficiency and sensitivity, and even the fastest programs are not suitable for a genome-scale analysis  . We are currently working on filtering strategies, exploiting the approximate relative location of structural elements within the RNA motif, as well as conserved motifs within the alignment  . This filtering approach is intended to be used complementarily to exact methods as a preprocessing of the sequence. On a longer term, it also opens the way to the creation of new indexing structures whose goal is to store genomic data and to speed up RNA motif queries on this data.
RNA locally optimal structures
When the structure of a noncoding RNA is not known, it is still possible to enhance the search by considering the set of all plausible secondary structures. This gives rise to a new problem, that we call the multi-structure matching. This is the subject of the thesis of A. Saffarian. Her work aims at defining better data models for the set of all secondary structures of a given RNA, including suboptimal and locally optimal ones, and associated efficient algorithms for pattern matching  .
The RNAspace open-source platform
Besides these theoretical issues, we are part of a new consortium for a national collaborative open-source platform devoted to noncoding RNA analysis, called RNAspace . The project is conducted in collaboration with INRA Toulouse(INRA: French National Institute for Agricultural Research) (Christine Gaspin) and Institut de Génétique et Microbiologie de l'Université Paris Sud (Daniel Gautheret). Its goal is to develop and integrate functionalities allowing structural and functional noncoding RNA annotation. The platform allows the user to run a set of tools including most appropriate noncoding RNA gene finders, to integrate results and to explore and analyse RNA gene candidates. Sequoia is involved in RNAspace as a main contributor to this project. This is also a stepping stone for other tools developed in the team: carnac , gardenia , Yass and CG-seq are made available in the first release of RNAspace.