Section: Scientific Foundations
Introduction
From a historical perspective, research in bioinformatics started with string algorithms designed for the comparison of sequences. Bioinformatics became then more diversified, accompanying the emergence of new high-throughput technologies: DNA chips, mass spectrometry, and others. By analogy to the living cell itself, bioinformatics is now composed of a variety of dynamically interacting components forming a large network of knowledge: systems biology, proteomics, text mining, phylogeny, structural genomics,... Sequence analysis remains a central node in this interconnected network, and it is the heart of the Sequoia project. It is a common knowledge nowadays that the amount of sequence data available in public databanks (such as GenBank and others) grows at an exponential pace. The recent advent of new sequencing technologies, also called Next Generation Sequencing and deep sequencing, amplified this phenomenon. Sequencing a bacterial genome is now done routinely, at a very moderate cost. Even if the first draft human genome sequence was obtained only eight years ago, obtaining a genome sequence of an eukaryotic organism is currently becoming a routine and low-cost operation too. Next Generation Sequencing promises to revolutionize genomic and transcriptomic. It allows for a fast and low-cost massive acquisition of short genomic fragments and thus represent a remarkable tool for genome studies. It gives rise to new problems and gives new insight on old problems by revisiting them: accurate and efficient remapping/pre-assembling, fast and accurate search of non exact (but quality labelled) reads, and/or non species specific reads. To illustrate this, SOLiD technology enables error detection and correction, providing SNP detection accuracy at genome-wide scale even at sparse read coverage. Illumina announced in June 2009 a personal genome sequencing service offering a sequencing of an individual genome for only $48,000. This also opens the way to a variety of applications: contamination and vector detection, fast and accurate species detection in metagenomics, themselves having great potential in animal/human epidemic detection,...As a result, sequence analysis and sequence processing receive now a renewed attention [57] .
The second incentive for sequence analysis is the progress in the elucidation of mechanisms of genome functioning. Molecular biology is a rapidly evolving science. Originally, sequence analysis was mostly driven by the scheme of the central dogma in its simplest formulation: information is contained in DNA, then it is transcribed into messenger RNA and finally translated into proteins. New pieces of information that shed a new light on the central dogma are now available. First, it is now widely recognized that the role of noncoding RNA genes has been largely underestimated until the late 90's. Following miRNAs and snoRNAs, many new families of those genes have been discovered recently: piRNAs, tasiRNAs, ... RNA genes are now known to play an important role in many cellular processes – protein synthesis, regulation. Furthermore, recent observations derived from tiling arrays or deep-sequencing technologies show that a large part of the transcriptional output of eukaryotic genomes does not appear to encode proteins.
Another biological phenomenon supplementing the central dogma occurs at the protein level. Translation of RNA is not the only way the proteins are synthesized in the cell: some peptides (typically in bacteria and fungi) result from a nonribosomal synthesis performed by a separate cell machinery. As the name suggests, it is an alternative pathway that allows for the production of polypeptides that are not encoded in the genome, and that are produced without ribosome but with other enzymatic complexes called nonribosomal synthetases (NRPSs). This biosynthesis has been described for the first time in the 70's [47] . For the last decade, the interest in nonribosomal peptides and their synthetases has considerably increased, as witnessed by the growing number of publications in this field. These peptides are or can be used in many existing or potential biotechnological and pharmaceutical applications (e.g. anti-tumors, antibiotics, immuno-modulators).
Lastly, computer hardware is also evolving with the advent of massively multicore processors . For a few years, issues with heat dissipation prevent the processors from having higher frequencies. The thermal density of some processors approaches the one of the surface of the sun [44] . One of the answers to maintain the Moore's Law is the usage of parallel processing. Grid environments provide tools for effective implementation of coarse grain parallelization. Recently, another kind of hardware has attracted interest: multicore processors. Graphic processing units (GPUs) are a first step towards massively multicore processors. They allow everyone to have some teraflops of cheap computing power in its personal computer. High-end GPUs, for less than $500, embed far more arithmetic units than a CPU of the same price. Recent trends blur the line between such GPUs and CPUs. Moreover, libraries like CUDA (released in 2007) and OpenCL (specified in December 2008) facilitate the use of those units for general purpose computation. We believe that this new era in hardware architecture will bring new opportunities in large scale sequence analysis. For example, recent parallelizations on GPUs for sequence analysis problems achieve speedups between 10× and 100× compared to a serialized one-core version.
All above-mentioned biological phenomena together with big volumes of new sequence data and new hardware provide a number of new challenges to bioinformatics, both on modeling the underlying biological mechanisms and on efficiently treating the data.