Section: New Results
Intensive sequence comparison and filtering
The first step of the analysis of a new sequence is to compare it with already known sequences. We work on all aspects allowing to speed up this time consuming task: efficient architectures, efficient indexing schemes and efficient filtering of sequences.
Intensive Comparison on FPGA
We propose a very efficient alternative to BLAST for sequence bank comparison called PLAST (Parallel Local Alignment Search Tool). An Implementation of PLAST has been developed using the SGI RASC 100 architecture. This platform, composed of two large high performance FPGAs (200K logics cells), is linked to an Altix350 (Intel Itanium2 Core2 1.6GHz) through SGI Numalink bus, providing a theoretical bandwidth of 3,2GBytes/s in each direction. The time-consuming part of the PLAST algorithm (sequence comparison) is deported on FPGA, which implements a specific parallel sequence comparison operator. It is architectured as an array of 192 small dedicated processing elements, each one computing a single alignment. Speed-up of X53 has been achieved over software execution on the Altix350, and X19 over the NCBI tBLASTn software  ,  .
Genome mapping and Next generation sequencing
Next generation sequencing technologies (NGS) produce large quantities of genomic data that are useful for a wide range of large-scale applications. This triggers the need for new algorithms able to map accurately and efficiently millions of short sequences on large genomes. We developed GASSST, a Global Alignment Short Sequence Search Tool. It is a new short read aligner which can map reads with gaps and mismatches at very high speed. It uses the standard seed and extend strategy. The novelty of our approach stands in a new filter step which allows us to discard candidates hits before having to execute the computationally expensive extend step (Needleman-Wunsch algorithm). We developed a series of filters of increasing complexity and efficiency capable of quickly eliminating most false-positive candidate hits for a wide range of execution configurations, with few or many gaps, low or high error rates.
Many genetic mutations can be found by re-sequencing an organism and comparing the data with a reference genome. However, next-generation sequencing data may consist of very short DNA fragments (reads), with lengths starting at 30 base pairs. At this read length, it was shown that only 80% of the human genome can be re-sequenced. Recently, sequencers have been able to produce mate-paired reads, ie. pairs of fragments separated by a known distance. We designed an efficient algorithm  to determine which part of a genome can be re-sequenced using mate-paired reads. Using this algorithm, we showed that mate-paired reads of 20-25 base pairs suffice to re-sequence 95% of the human genome. We have also started to investigate on de novo eukaryotic genome assembly from mate-paired reads, a NP-hard issue that remains a bottleneck with respect to the use of NGS data.
Discovery of molecular markers for efficient identification of living organisms remains a challenge of high interest faced to the huge amounts of data that will soon become available in all kingdoms of life. The diversity of species can now be observed in details with low cost genomic sequences produced by new generation of sequencers. We developed a method, called c-GAMMA, which formalizes the design of new markers from such data. It is based on a series of filters on forbidden pairs of words, followed by an optimization step on the discriminative power of candidate markers. This method was implemented and tested on a set of microbial genomes (Thermococcales)  .
Multiple alignment between sequences is a NP-Hard problem where useless long computations can be avoided by focusing on the multiple repeats they contain. The key idea is to remove portions of sequences that may not contain researched repeats before the multiple alignment phase. This preliminary work is then seen as a filter. Basically a filter applies a carefully chosen necessary condition on a sequence or a set of sequences. In  ,  we proposed a filter for speeding-up the multiple repeat search. called Tuiuiu, improving previous attempts by a stronger necessary condition, and proposed an efficient way to apply it on large set of sequences. It is available on Genouest.
Average-case analysis of indexes
Participant : Jérémie Bourdon.
Factor and suffix oracles have been introduced in 1999 in order to provide an economic and efficient solution for storing all the factors and suffixes respectively of a given text. Whereas good estimations exist for the size of the factor/suffix oracle in the worst case, no average-case analysis has been done until now. In  , we give an estimation of the average size for the factor/suffix oracle of an n -length text when the alphabet size is 2 and under a Bernoulli distribution model with parameter 1/2 . To reach this goal, a new oracle is defined, which shares many of the properties of a factor/suffix oracle but is easier to study and provides an upper bound of the average size we are interested in. Our study introduces tools that could be further used in other average-case analysis on factor/suffix oracles, for instance when the alphabet size is arbitrary.