Section: New Results
Protein coding sequences
Participants : Arnaud Fontaine, Marta Girdea, Gregory Kucherov, Laurent Noé, Hélène Touzet.
Back-translation is the process of computing the putative DNA sequences that encode a given protein. Despite the fact that the number of back-translated sequences increase exponentialy with the size of the protein, such sequences are usefull, especially when dealing with frameshift proteins.
For the last two years, we have been interested by such approach to detect remote homologies. We have proposed an efficient dynamic programming alignment algorithm over the complete set of putative DNA sequences of each protein, to determine the two putative DNA sequences according to a scoring scheme designed to reflect the most probable evolutionary process. This allows us to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.
The results have been published in the WABI workshop this year (  ), and the extended version of this work is currenly accepted to the Algorithms in Molecular Biology journal. A web interface of the tool developped within this framework is now proposed at http://bioinfo.lifl.fr/path/
Computational identification of protein-coding sequences
Gene prediction is an essential step in understanding the genome of a species once it has been sequenced. For that, a promising direction in current research on gene finding is a comparative genomics approach. We designed a novel approach to identify evolutionary conserved protein-coding sequences in genomes. The rationale behind the method is that protein coding sequences should feature mutations that are consistent with the genetic code and that tend to preserve the function of the translated amino acid sequence. The algorithm takes advantage of the specific substitution pattern of coding sequences together with the consistency of reading frames. It has been implemented in a software called protea . We have conducted a large scale analysis on thousands of conserved elements across eighteen eukaryotic genomes, including the Human genome. This experiment reveals the existence of new putative protein-coding sequences. Most of them are likely to be involved in alternative splicing transcripts, or to correspond to unannotated exons of predicted genes. This work appeared in  .