Team Symbiose

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Other Grants and Activities
Dissemination
Bibliography

Section: Scientific Foundations

Optimized algorithms on parallel specialized architectures

Mixing parallel computing and genomics is both motivated by the large volume of data to handle and by the complexity of certain algorithms. Today, (dec. 2008) more than 800 genomes – including the human genome – are completely sequenced, and there exist a lot more sequencing projects (1000 human genomes, Human Microbiome Project,..., see Genomes online database (http://www.genomesonline.org/ )). Huge data bases become necessary whose volume approximatively doubles every year. This exponential growth is not expected to decline in the next few years due to low cost sequencing technologies and new needs such as isolation of important conserved structures in close species or metagenomics for ecological studies.

The problem is to efficiently explore these banks, and extract relevant informations. A routine activity is to perform content-based searches related to unknown DNA or protein sequences: the goal is to detect similar objects in the banks. The basic assumption is that two sequences sharing any similarities (identical characters) allow further investigations on some related functionality.

The first algorithms for comparing genomic sequences, essentially based on dynamic programming techniques, have been developed in the seventies [97] , [106] . Then, with the increasing growth of data, faster algorithms have been designed to drastically speed-up the search. The Blast software [108] acts now as a reference to perform rapid searches over large data bases. But, in spite of its short computation time (compared to the first algorithms) a growing number of genomic researches require much lower computation time. Parallelizing the search over large parallel computers is a first solution implemented for instance in the LASSAP software (JJ Codani, [81] . Other works concern dedicated hardware machines. Several research prototypes such as SAMBA [83] , BISP [70] , HSCAN [82] or BioScan [112] , have been proposed, leading today to powerful commercial products: BioXL, DECYPHER and GeneMatcher coming respectively from Compugen ltd. TimeLogic and Paracel (http://www.compugen.co.il/ , http://www.timelogic.com , http://www.paracel.com ).

Beyond the standard search process, this huge volume of available (free) data naturally promote new field of investigation requiring much more computing power such as, for example, comparing a set of complete genomes, classifying all the known proteins (decrypton project), establishing specific databases (ProDom), etc. Of course, the solutions discussed above can still be used, even if for 3-4 years, new alternative has appeared with the grid technology. Here, a single treatment is distributed over a group of computers geographically scattered and connected by Internet. Today, a few grid projects focusing on genomics applications are under deployment: the bioinformatics working group (WP 10) of the European DataGRID project; the BioGRID subproject from the EuroGRID project; the GenoGRID project deploying an experimental grid for genomics application; the GriPPS (Grid Protein Pattern Scaning) project.

Note that the large amount of genomic data is not the only motivation for parallelizing computations. The complexity of certain algorithms is also another strong motivation, especially for the analysis of structures in sequences [BMW03]. For instance, predicting the 3D structure of a protein from its amino acid sequence is an extremely difficult challenge, both in term of modeling and computation time. The problem is investigated following many ways ranging from de novo folding prediction to protein threading techniques [96] . The underlying algorithms are NP-complete and require both combinatorial optimization and parallelization approaches to calculate a solution in a reasonable amount of time.

For the last 2-3 years, GPU boards (Graphical Processing Units) have seen their computational power highly increasing. They now become a real alternative for deporting very time consuming general purpose computation. This activity is referred as GPGPU, standing for General-Purpose computation on GPUs. Many bioinformatics algorithms present interesting features allowing them to provide efficient parallelization. In 2007, we have started investigating the potentiality of this hardware support on several basic bioinformatics algorithms.


previous
next

Logo Inria