Section: New Results
Computational analysis of the evolution of species, genomes and gene families
In several contexts such as 1. species classification, 2. confrontation of a new sequence to a database, 3. update of homologous (i.e. descending from a same ancestor) gene family sequence databases, the classification of a new sequence into a collection is needed. This classification allows the identification of which family the sequence belongs to and contributes to the assessment of its evolutionary relationships. Today, massive sequencing techniques are routinely used and the number of new available sequences grows up quickly. Furthermore, the identification task requires the chaining of different programs (for similarity search, alignment and phylogenetic tree computation) that are sometimes complex to handle. Some results have also to be manually checked. Doing these tasks sequentially makes the work of sequence identification tedious and time-consuming. Automated bioinformatic methods are thus necessary to carry out these operations in an accurate and fast way. As part of her PhD, Anne-Muriel Arigon has developed a method that allows to automatically assign sequences to homologous gene families from a set of databases  . After identification of the most similar gene family to the query sequence, this sequence is added to the whole alignment, and the phylogenetic tree of the family is rebuilt. The phylogenetic position of the query sequence in its gene family can then be easily identified.
Recent heuristic advances have made maximum likelihood phylogenetic tree estimation tractable for hundreds of sequences. Noticeably, these algorithms are currently limited to reversible models of evolution. The reversible property is a technical one that leads to more tractable models but is clearly not verified by evolutionary processes. As part of his PhD, Bastien Boussau has shown that by reorganising the way likelihood is computed, one can efficiently compute the likelihood of a tree from any of its nodes with a nonreversible model of sequence evolution, and hence benefit from cutting-edge heuristics. This computational trick can be used with reversible models of evolution without any extra cost. Bastien then introduced nhPhyML  , an adaptation of the nonhomogeneous nonstationary model of Galtier and Gouy (1998; Mol. Biol. Evol. 15:871-879) to the structure of PhyML (2003; Syst Biol. 52(5):696-704), as well as an approximation of the model in which the set of equilibrium frequencies is limited. This new version shows good results both in terms of exploration of the space of tree topologies and ancestral G+C content estimation. Moreover, the approach argues for a hypothesis that members of the HELIX team already defended in the past: the last common ancestor of all of today's living organisms did not live at high temperatures. nhPhyML was applied to the slowly evolving sites of rRNA sequences. The result is that the model and a wider taxonomic sampling still do not plead in favour of a hyperthermophilic last universal common ancestor.
Beside research in phylogeny, Helix also conducts activities in e-learning in the context of the ISee platform (see Section 5.17 ). In 2006, A. Chamontin has finished the development of an ISee course dedicated to the computation of phylogenetic trees from a set of genomic sequences. In the first part of the course, the user can follow the execution of the UPGMA algorithm step by step. In a second part, he/she has to make the right choices for the algorithm to progress correctly. For instance, at each iteration, he/she has to choose the next term of the distance matrix and to recompute the remaining terms. The course also provides three smaller programs which illustrate the limits of the method. They essentially explain why the distance between two sequences, in terms of differences, may differ from the evolutionary distance between the organisms. This course will be used by the CCSTI at Grenoble for its ``École de l'ADN'' targeted to graduate students.
The arrival of Daniel Kahn in HELIX has brought to the team an internationally recognised expertise in the modular evolution of protein sequences. This expertise is currently stored in a database, ProDom , that has been transferred to the PRABI and is maintained and regularly updated by Daniel Kahn and members of the PRABI (see Section 5.27 ). The analysis of evolutionary scenarios of protein domain families from ProDom (2005; Nucleic Acids Res. 33(Database issue):212-215) showed that only a small minority of domain families is truly ancestral. Far from being static, the protein domain repertoire undergoes a continuous innovation process. Therefore the tremendous diversity of modular proteins results from both the combinatorial assortment of protein domains and an ongoing process of protein domain innovation.
In mammals, females carry 2 X chromosomes and males only one. To avoid a higher gene expression in females, a mechanism inactivates one of the X chromosomes in each cell. This inactivation is managed by a non coding gene, called Xist. Laurent Duret has shown  that Xist evolves from a coding gene since the separation between eutherians (mammals that have a placenta) and marsupials (mammals, such as the kangooroo, in which the female typically has a pouch where its youngs are reared through early infancy). This result is an important step towards understanding sex determination in mammals. It confirms that, although sex is one of the most universal properties among eukaryotes, the way it is determined is submitted to a very quick evolution.
Statistical analysis of the global composition of genomes and its link with environmental or metabolic characteristics has been the focus of considerable interest. As part of Leonor Palmeira's PhD, the hypothesis that pyrimidine dinucleotides (dinucleotides composed of Cs and Ts) are avoided in light-exposed genomes as the result of a selective pressure due to high ultraviolet (UV) exposure was investigated. The main damage to DNA produced by UV radiation is known to be the formation of pyrimidine dimers as the product of a photochemical reaction between adjacent pyrimidines. All available complete prokaryotic genomes and the model organism Prochlorococcus marinus were statistically analysed and it was found  that pyrimidine dinucleotides are not systematically avoided. This suggests that prokaryotes must have sufficiently effective protection and repair systems for UV exposure to not affect their dinucleotide composition.
The evolution of nucleic sequences is usually modelled by point substitutions and under the hypothesis that sites evolve independently of each other. This hypothesis is mainly kept for mathematical purposes and has no biological foundation, as it is now clear that molecular substitution mechanisms frequently involve adjacent bases. The most typical example is the highly frequent spontaneous chemical transformation of CpG dinucleotides observed on some sequences (the `` CpG'' notation is used to distinguish a cytosine C followed by guanine G from a cytosine base paired to a guanine). Berard, Gouere and Piau (2006; private communication) have shown that in some special cases which include the CpG transformation, neighbour-dependent models are solvable: equilibrium distributions can be determined. Still as part of Leonor Palmeira's PhD, the system was solved for a number of specific biological models. Under the assumption of stationarity, it was further shown that it is easy to compute the substitution rates acting on any given sequence.
The relationship between codon usage in prokaryotes and their ability to grow at extreme temperatures has been given much attention over the past years. Previous studies have suggested that the difference in synonymous codon usage between (hyper)thermophiles and mesophiles is a consequence of a selective pressure linked to growth temperature. A hyperthermophile is an organism that thrives in extremely hot environments while a mesophile grows best in moderate temperature. As part of her PhD, Anamaria Necsulea performed an updated analysis  . The conclusion reached is that the difference in synonymous codon usage between (hyper)thermophilic and non-thermophilic species cannot be clearly attributed to a selective pressure linked to growth at high temperatures. Strong efforts are currently under way to determine the genome of major eukaryotic human parasites. Anamaria Necsulea has started a bioinfomatics analysis of some of these genomes, mainly in the Leishmania gender  . She has shown a new and surprising usage of synonymous codons (codons that refer to the same amino acid) in these organisms. Several biological interpretations are possible: this may either be explained by a better adaptation to the translational process, or be the result of mutational bias. New genomes are being sequenced and this gives hope that it will be possible to discriminate between the two hypotheses in a near future.
Modelling and analysis of the spatial organisation and dynamics of genomes
Genomes are organised as a succession of region having different functional roles: introns, exons, intergenic regions, etc. Each of these regions has different statistical properties that are required by their functional role. However, other regional organisations exist at broader scales, isochores being the most studied ones. Isochores occur mainly in mammal and bird genomes. An isochore is a large region (more than 300kb) with a relative homogeneity in base frequencies, particularly in C+G. An analysis of the isochore organisation of a genome needs therefore to separate structures that occur at different scales. Before the PhD work of Christelle Melodelima   , either the local structure was ignored or the analysis was restricted to exons. C. Melodelima proposed an HMM approach that simultaneously builds a modelling of the local organisation and allows (through a bayesian approach) a model selection that leads to the segmentation of the genome in its isochores. Moreover, this original approach has lead to new biological results, for instance, on the relationships between the organisation in isochores and the sequences coming just before and after each gene (5'UTR and 3'UTR).
Hidden Markov Models are one possible way of segmenting a biological sequence. Another that was developed in HELIX by Laurent Gueguen is maximal predictive partioning (MPP) (2001; LNCS vol. 2066, pages 32-44). MPP is a method that, given a set of models (for instance, related to sequence composition), builds the best partition of a sequence into k segments. MPP scores the adequacy of a model to a segment. During the partitioning operation, several models are compared in order to optimally segment a sequence of letters into homogeneous parts. Such a model may, for instance, be Markovian. Using dynamic programming, MPP computes the k-partitions for successive values of k in time linear with k and the length of the sequence. Hence, an MPP gives a multi-scale representation of the sequence. Another way to compute a multi-scale segmentation of sequences is to use a hierarchical process. The segments are in this case recursively divided.
A package of Python modules, called Sarment was developed for easy building and manipulation of sequence segmentations (2005; Bioinformatics 21(16):3427-3428). The first aim of Sarment is to provide an efficient implementation of the HMM segmentation algorithms (Viterbi and Forward-Backward) and of the MPP method. The second aim of Sarment is to allow easy manipulation of the models that are used in HMMs and in MPPs. An algorithm for computing the probability of a given segmentation has also been submitted for publication (submitted paper).
Gene order is not random with regard to gene expression in mammals: coexpressed genes, and in particular housekeeping genes (i.e. genes that are transcribed at a relatively constant level), are clustered along chromosomes more often than expected by chance. To understand the origin of these clusters, and to quantify the impact of this phenomenon on genome organisation, Laurent Duret and his former PhD student, Marie Semon, analysed clusters of coexpressed genes in the human and mouse genomes. They showed that neighbouring genes experience continuous concerted expression changes during evolution, which leads to the formation of coexpressed gene clusters  . The pattern of expression within these clusters evolves more slowly than the genomic average. Moreover, by studying gene order evolution, it was shown that some clusters are maintained by natural selection and, therefore, have a functional meaning. However, it was also demonstrated that some coexpressed gene clusters are the result of neutral coevolution effects, as illustrated by the clustering of genes escaping inactivation on the X chromosome.
Genomes undergo large scale changes through evolution, called rearrangements. As part of the PhD of Claire Lemaitre, we are interested in the specific sequences where a rearrangement took place, more precisely we seek to identify characteristics in these breakpoint regions specific to rearrangements  . The detection of such breakpoints is made by a comparative genome analysis between two species using the annotated orthologous genes. The breakpoint region is then refined by the alignment of intergenic regions. This method has been applied on the human and mouse genomes and it allows to analyse precisely the sequences around the breakpoints, and to compare them with other sequences in the genome. The aim is to find sequence characteristics linked with genome dynamics to understand the molecular mechanisms underlying the process leading to rearrangements.
The mammal chromosomes X and Y have evolved from an identical autosome pair. This process is at the origin of sexual differentiation - the female XX and the male XY pairs. Due to the recombination mechanism (recombination is the genetic transmission process intra and inter chromosomes), female organisation favours X chromosome conservation. On the other hand, the male XY pair evolution causes Y chromosome degeneration, as this chromosome loses gradually the capacity of recombining with its X partner.
Current theories show that the rearrangement process followed by these two chromosomes was mainly composed by a few big reversals, which have happened in an ordered way, from the end to the beginning of the Y chromosome. Nevertheless these theories still present controversial aspects. As part of the PhD of Marilia Braga, we are trying to elucidate the question by reconstructing the rearrangement process with the available information we get in public databases. In addition, we are developing a new algorithmic model for the genome rearrangement problem, which might be better adapted to these ordered big reversal events.
In comparative genomics, algorithms that sort permutations by reversals are often used to propose evolutive scenarios of large scale genomic mutations between species. One of the main problems of such methods is that they give one solution while the number of optimal solutions is huge, with no criteria to discriminate among them. One previous study by Bergeron and colleagues (2006; IEEE/ACM TCBB , in press) has tried to give some structure to the set of optimal solutions, in order to be able to give more presentable results than only one solution or a complete list of all solutions. The structure is a set of partially ordered sets (posets), of which all linear extensions are solutions. However, no algorithm existed so far to compute this set of posets except through the enumeration of all solutions, which takes too much time even for small permutations. With an Italian master student, Celine Scornavacca, and with Marilia Braga, we devised such an algorithm, which gives all the posets and counts the number of solutions, with a better theoretical and practical complexity than a complete enumeration (paper in preparation). Several biological examples are provided where the result is more relevant than a unique optimal solution or the list of all solutions, the latter being often impossible to compute.
Another way to deal with the huge number of solutions provided by rearrangement reconstitution algorithms is to add some biological constraints, such as favouring small inversions or inversions that do not cut clusters of co-localised genes. This approach is called ``perfect'' sorting. Together with a German master student, Yoan Diekmann, Eric Tannier and Marie-France Sagot devised an algorithm that is able to test if there is one solution that respects the contraints of gene clusters, and gives it if this is the case  . This was tested on gene order data of several species, and some statistics were provided for the cases where there exists one solution or not. The algorithm was then extended by giving some solutions that minimize the number of gene clusters that have to be broken by a rearrangement scenario.
The reconstructions of evolution scenarios of genomic rearrangements are sometimes very different in computational biology than they are in cytogenetics, where different kinds of data are analysed. Cytogenetics is the study of the structure of chromosome material. With the PhD students Marilia Braga and Claire Lemaitre, and in collaboration with Thomas Faraut from the INRA Toulouse, we started a deep analysis of the differences in the methods in the two domains. This work is done in collaboration with Bernard Dutrillaux and Florence Richard from the Museum National d'Histoire Naturelle of Paris, who are cytogeneticists who have been doing research on rearrangement scenarios for many years and have what is probably the richest collection of expertly assessed cytogenetic data in the world. We intend to group the methods, data and results of both research domains, in order to propose better algorithms, and a unifying theory of the modes of speciation.
Rearrangements and accelerated mutation rates are also observed in the most simple organisms such as bacteria when subjected to environmental changes. In the context of a collaboration with the group of Roger Frutos (CIRAD Montpellier), we have performed the complete annotation of two strains of Ehrlichia ruminatium , an obligatory pathogen and causative agent of heartwater, a major tick-borne disease of livestock in Africa and Caribbean. The most specific feature of these genomes is their exceptionally large intergenic regions (acually the largest amongst bacteria) and the presence of long-period tandem repeats associated to expansion/contraction of these intergenic regions. Following the publication of the complete genome  , we have performed a comparative genomic analysis of these strains as well as additional species of Ehrlichia . This has revealed the presence of an active and specific mechanism of genomic plasticity, probably following the exposure to a diverse environment (different hosts), which could explain the limited field-efficiency of vaccines against E. ruminantium  . Most of these studies have been conducted by using the GenoStar platform and have therefore represented the first real-size test bed for this platform.
Gene duplication has different outcomes: pseudogenization (death of one of the two copies), gene amplification (both copies remain the same), sub-functionalization (both copies are required to perform the ancestral function) and neo-functionalization (one copy acquires a new function). Asymmetric evolution (one copy evolves faster than the other) is usually seen as a signature of neo-functionalization. However, it has been proposed that sub-functionalization could also generate asymmetric evolution among duplicate genes when they experience different local recombination rates. Gabriel Marais and Raquel Tavares together with an L2 student, Yves Clément, tested this idea with about 100 pairs of young duplicates from the Drosophila melanogaster genome  . They found that dispersed pairs tend to evolve more asymmetrically than tandem ones. Among dispersed copies, the low recombination copy tends to be the fast-evolving one. They also tested the possibility that all this was explained by a confounding factor (expression level) but found no evidence for it. In conclusion, their results do support the idea that asymmetric evolution among duplicates is enhanced by restricted recombination. However, further work is needed to clearly distinguish between sub-functionalization and neo-functionalization for the asymmetrically-evolving duplicate pairs that they found.
The duplication of entire genomes has long been recognized as having great potential for evolutionary novelties, but the mechanisms underlying their resolution through gene loss are poorly understood. In collaboration with the groups of Jean Cohen (CGM, Gif), Eric Meyer (ENS, Paris) and the Genoscope (P. Wincker, Evry), Laurent Duret, Vincent Daubin and Jean-François Gout from HELIX are involved in the analysis of the genome of the unicellular eukaryote Paramecium tetraurelia . In this ciliate, most of the nearly 40,000 genes arose through at least three successive whole-genome duplications. Phylogenetic analysis indicates that the most recent duplication coincides with an explosion of speciation events that gave rise to the P. aurelia complex of 15 sibling species. We observed that gene loss occurs over a long timescale, not as an initial massive event. Genes from the same metabolic pathway or protein complex have common patterns of gene loss, and highly expressed genes are over-retained after all duplications. The conclusion of this analysis is that many genes are maintained after whole-genome duplication not because of functional innovation but because of constraints on the number of copies of a given gene present in a cell or nucleus   .
Motif search and inference
One large part of the algorithmic studies in the HELIX project concerns the search for regularities at many levels of living systems. Regularities may be seen as motifs in DNA sequences, RNA structures or protein structures, as well as motifs in metabolic networks.
Concerning motifs in DNA sequences, Pierre Peterlongo, a PhD student co-supervised by HELIX and the University of Marne-la-Vallee who defended in September 2006, designed two algorithms, called Nimbus (2005; LNCS , vol. 3772, pages 179-190) and Ed'Nimbus  , for filtering sequences prior to finding repetitions occurring more than twice in a sequence, or in more than two sequences. Nimbus and Ed'Nimbus use gapped seeds that are indexed with a new data structure, that can be either a bi-factor  or array  . Experimental results show that the filter can be very efficient. This work is being done in collaboration with Nadia Pisanti from the University of Pisa, Italy, and with Alair Pereira do Lago from the University of São Paulo, Brazil.
In the same vein, Frédéric Boyer and Eric Coissac, in collaboration with Guillaume Achaz at the Université Paris VI, have developed a new, space efficient implementation of the Karp-Miller-Rosenberg algorithm to look for exact repeats in very large DNA sequences (such as complete human chromosomes)  . This implementation forms the basis of the Repseek program that looks for approximate repeats, i.e allowing for deletion or substitution scores. In the statistical framewotk of extreme distributions, the parameters of the score distribution, as a function of the DNA GC content, have been empirically determined.
Concerning RNA secondary structures, Julien Allali, ex-PhD student of Marie-France Sagot and now Associate Professor at the LABRI, University of Bordeaux, introduced a new data structure, called MiGaL for ``Multiple Graph Layers'', that is composed of various graphs linked together by relations of abstraction/refinement. The new structure is useful for representing information that can be described at different levels of abstraction, each level corresponding to a graph. We proposed an algorithm for comparing two MiGaLs   . MiGaLs represent a very natural model for comparing RNA secondary structures that may be seen at different levels of detail, going from the sequence of nucleotides, single or paired with another to participate in a helix, to the network of multiple loops that is believed to represent the most conserved part of RNAs having similar function.
As part of the PhD of Nuno Mendes in co-supervision with Ana Teresa Freitas from the Instituto Superior Tecnico of Lisbon, Portugal, work has just started on the development of new algorithms and models for predicting small functional RNA motifs. In particular, several models for an RNA sequence will be studied, allowing a flexible and general approach to RNA interference problems (problems of interference of RNAs with regulation of gene expression). This work will concern more specially microRNAs (denoted by miRNAs). Such RNAs are 20 to 24 nucleotides in length, single stranded and predominantly derived from intergenetic regions. Due to the difficulty of systematically detecting miRNAs by existing experimental techniques, researchers increasingly turn to computational methods to identify new miRNAs. Important computational tools have been used but they are limited in practice since they are based on comparative sequence analyses that can fail when sequence conservation is too low (that is, leads to poor alignments) or too high (that is, leads to lack of sequence covariation – variation of bases at a given position that is correlated with variation at another position). Some of them perform simultaneous multiple sequence alignment and folding that is computationally expensive. A very recent approach proposes a probabilistic algorithm that uses covariance models for motif description to predict miRNAs motifs in unaligned sequences. However, the approach is not able to find the best solution and cannot be used to identify miRNAs motifs that are only present in a subset of the input sequences. To push forward future developments in this field, new algorithms to identify miRNAs motifs unrelated to previously known ones have to be developed.
Finally, as part of the long-term visit among us of Paulo G. S. Fonseca (Paulo is a PhD student of Katia Guimaraes at the Federal University of Pernambuco, Brazil who came to France with a ``sandwich'' scholarship and is being funded for his second year with us by HELIX), we are working on the problem of integrating various sources of information (sequence motifs, gene expression profiles and evolution) to infer genetic network modules.
Knowledge representation for genomics
Genome annotation can be viewed as an incremental, cooperative, data-driven, knowledge-based process that involves multiple methods to predict gene locations and structures. This process may have to be executed more than once and may be subjected to several revisions as the biological (new data) or methodological (new methods) knowledge evolves. In this context, although many annotation platforms already exist, there is still a strong need for computer systems which take care, not only of the primary annotation, but also of the update and advance of the associated knowledge. We propose to adopt a blackboard architecture when designing such a system. In his PhD, defended in July, S. Descorps-Declère has developed a prototype, called Genepi , which validates these conceptual and technical options  . Specific adaptations to the classical blackboard architecture have been required, such as the description of the activation patterns of the knowledge sources by using an extended set of Allen's temporal relations.
Recent works around the AROM platform (see 5.1 ), in collaboration with Jérôme Gensel (LSR-IMAG) and Cécile Capponi (LIF Marseille), concerns 1. the evolution of the AROM knowledge representation meta-model (to integrate whole-part relationships on a basis close to the one proposed in the UML-2 specification), and 2. the integration into the AROM2 platform of an Algebraic Modelling Language (AML) which allows the writing of equations involving variables of classes and associations. These equations are part of an AROM model and can be used to infer variables values.
Finally, Helix has welcomed Dr. José Luis Aguirre, Professor at the Technologico de Monterrey (Mexico) for a one year visit, starting in August 2006. Dr. Aguirre's research interests are directed toward multi-agent systems and their application to information search and exchange. One of the expected outcomes of this visit is to study a multi-agent system for facilitating the access to heterogeneous and distributed biological data according to a user's specific profiles known by the system.