Section: Scientific Foundations
Fundamental questions in the life sciences can now be addressed at an unprecedented scale through the combination of high-throughput experimental techniques and advanced computational methods from the computer sciences. The new field of computational biology or bioinformatics has grown around intense collaboration between biologists and computer scientists working towards understanding living organisms as systems . One of the key challenges in this study of systems biology is understanding how the static information recorded in the genome is interpreted to become dynamic systems of cooperating and competing biomolecules.
Magnome addresses this challenge through the development of informatic techniques for multi-scale modeling and large-scale comparative genomics: data models for knowledge representation, stochastic hierarchical models for behavior of complex systems, algorithms for genome analysis, and data mining and classification. Our research program builds on our experience in comparative genomics, data-mining and classification, and formal methods for multi-scale stochastic modeling of complex systems.
The first overall goal for Magnome is to develop methods for understanding the structure and history of eukaryote genomes, in order to identify their differences and the link between these differences and the dynamic behavior of these organisms. The central dogma of evolutionary biology postulates that contemporary genomes evolved from a common ancestral genome, but the large scale study of their evolutionary relationships is frustrated by the unavailability of these ancestral organisms that have long disappeared. However, this common inheritance allows us to discover these relationships through comparison , to identify those traits that are common and those that are novel inventions since the divergence of different lineages.
We develop novel techniques to address fundamental questions of mechanisms of gene dynamics, and the ways that genes and their products are organized at different scales. These results are then combined into integrated models through the organization of these objects into networks and pathways that can be used to predict the dynamic behavior of cells. Through combinatorial optimization we can construct plausible hypotheses about the structure of ancestral genome architectures, which may provide deep insight both into the past histories of particular genomes and the general mechanisms of their formation.
The methods designed by Magnome for comparative genome annotation, structured genome comparison, and construction of integrated models are applied on a large scale to yeasts from the hemiascomycete class  ,  ,  ,  ,  , which provide a unique tool for studying eukaryotic genome evolution over a broad range of distances. With their relatively small and compact genomes, yeasts offer a unique opportunity to explore eukaryotic genome evolution by comparative analysis of several species. Yeasts are widely used as cell factories, for the production of beer, wine and bread and more recently of various metabolic products such as vitamins, ethanol, citric acid, lipids, etc. Yeasts can assimilate hydrocarbons, depolymerise tannin extracts, and produce hormones and vaccines in industrial quantities through heterologous gene expression. Several yeast species are pathogenic for humans. The hemiascomycetous yeasts represent a homogeneous phylogenetic group of eukaryotes with a relatively large physiological and ecological diversity.
The second overall goal for Magnome uses theoretical results from formal methods to define a mathematical framework in which discrete and continuous models can communicate with a clear semantics. We exploit this to develop the BioRica platform, a modeling middleware in which hierarchical models can be assembled from existing models. Such models are translated into their execution semantics and then simulated at multiple resolutions through multi-scale stochastic simulation.
A general goal of systems biology is to acquire a detailed quantitative understanding of the dynamics of living systems. Different formalisms and simulation techniques are currently used to construct numerical representations of biological systems, and a certain wealth of models is proposed using specific and ad hoc methods. A recurring challenge is that hand-tuned, accurate models tend to be so focused in scope that it is difficult to repurpose them. Instead of modeling individual processes individually de novo , we claim that a sustainable effort in building efficient behavioral models must proceed incrementally. Hierarchical modeling (R. Alur et al. Generating embedded software from hierarchical hybrid models. In Proceedings of LCTES , pp 171–82, 2003.) is one way of combining specific models into networks. Effective use of hierarchical models requires both formal definition of the semantics of such composition, and efficient simulation tools for exploring the large space of complex behaviors.
Hierarchical modeling that integrates both genome-scale models of metabolism and fine-grained models of particular processes of interest in a given application is recognized as a major challenge in systems biology both by the European Union (see “Systems biology: a grand challenge for Europe,” ESF Grand Challenges, Sept. 2007). Furthermore the NSF in the United States recognized since 2004 that multi-scale modeling that integrates all scales from molecular through population levels, is the way for modeling to impact the understanding of biological processes (see, for example NSF 04-607).
The Magnome BioRica system is a high-level modeling framework integrating discrete and continuous multi-scale dynamics within the same semantics domain, while offering a easy to use and computationally efficient numerical simulator. It is based on a generic approach that captures a range of discrete and continuous formalisms and admits a precise operational semantics  . On the practical level, BioRica models are compiled into a discrete event formalism capable of capturing discrete, continuous, stochastic, non deterministic and timed behaviors in an integrated and non-ambiguous way.
Our long-term goal to develop a methodology in which we can assemble a model for a species of interest using a library of reusable models and a organism-level “schematic” determined by comparative genomics.
MAGNOME's short- and mid-term objectives can be described as follows:
Comparative genome annotation
We develop efficient methodologies and a software platform, for associating biological information with complete genome sequences, in the particular case where several phylogenetically-related eukaryote genomes are studied simultaneously.
Phylogenetic protein families establish relations of conservation and lineage-specific gain and loss that permit the detailed study of adaptation and functional specialization. Algorithmic techniques must be developed to improve precision across million-year phylogenetic ranges. Two challenges must be addressed in the classification methods: better definition of inclusion relations, and incorporation of gene fusion and fission events, which induce reticulate relations between family classifications.
Rather than compare flat sets of genes grouped into functional classes, structured comparison explores the topological structure of the graph of relations between genes. Biomolecular networks are one way to perform such comparisons. Using graph theoretic techniques we can assess the relative conservation of networks from one species to another, with the aim of identifying functional differences between the species.
The computational and storage needs of large-scale global comparison of genomes require a dedicated integrated platform for knowledge representation, high-performance computing and software development. A complete analysis chain for new genomes must start from a genome sequence and produce a preliminary annotation, including prediction of genes, putative assignment to protein families, and application of coherency rules. These tools must take into account specificities of fungal genomes such as clade-specific gene architectures, lineage-specific protein families and pathways, and known phylogenetic relationships.
We validate these methodological advances through application to sets of species of biotechnological interest, in collaboration with our biological partners. Magnome manages a key a comprehensive comparison of eighteen yeast genomes, annotated by the Génolevures consortium  . This annotation effort by 40 scientists in France and Belgium has resulted in a complete catalogue of protein-coding genes and other genetic elements, and work by the Magnome team has classified these elements into phylogenetic, structural, and functional categories. These analyses must be extended to systematically cover the range of relations defined above, and will constitute a fundamental resource for the development of dynamic models.
Genome dynamics and evolutionary mechanisms
We develop algorithms for detecting historical relations between genomes and exploring the concrete events and general mechanisms of molecular evolution, in particular mechanisms of rearrangement and duplication that reshape genomes.
Genome rearrangements on two scales contribute to this systematic comparison. Using a complete analysis of gene fusion events across the yeasts and fungi, we identify small-scale events that lead to the birth of new genes and the acquisition of new or improved functions  . On a larger scale, rearrangements of large segments are investigated through a combination of conserved segment identification (in silico chromosomal painting using chromosomal homology established using conserved protein families) and combinatorial techniques we have developed for median genome and rearrangement scenario computation ,  ,  .
The expected results are a comprehensive view of yeast genome organization and evolution, described at multiple scales.
We develop practical and semantically rigorous formalisms for constructing hybrid hierarchical models of dynamic, stochastic biological processes, with a particular focus of model reuse , and build software tools for simulation and analysis of these models. BioRica  is a formalism for hybrid hierarchical modeling developed by Magnome and instantiated in a software platform.
Formal analysis of biological models is usually faced with two major challenges: on one hand these models exhibit complex behaviors since they may contain both hybrid and stochastic modeling features, which leads to theoretical limitations (undecidability in general). On another hand, precise models tend to be very large, with thousands of discrete or continuous variables, and moreover with multiple time-scales. This leads in practice to the well-known combinatorial explosion problem. We improve the state-of-the-art by adapting strategies that have led to significative successes in modeling human-engineered systems, in particular extending the reach of abstraction-based formal analysis techniques to these models . Both trace-based abstraction and qualitative abstraction of hybrid stochastic systems will be developed.
Validation of this approach is based on applications in dynamic modeling of fermenting and oleaginous yeasts. We will develop a modeling methodology that will first, advance the state of the art of modular modeling in systems biology, and second, enable mixing phenomena described with different precision within the same framework of stochastic hybrid hierarchical models.