## Section: New Results

### Functional genomics

#### Computational proteomics and transcriptomics

This year has seen the thorough rewriting of the PepLine software,
in collaboration with users at the Laboratoire de Chimie des Protéines (LCP, CEA Grenoble),
in order to improve both the accuracy and the overall performance of the pipeline. The PST generation
algorithm has been modified in order to incorporate a *de novo* step and a new rank-based scoring scheme.
This resulted in a large improvement of the accuracy (70 % of correctly predicted PSTs). The performance
of the chromosome mapping step has also been improved; the complete analysis of the chromosomes of *A. thaliana*
can now be performed in a few minutes. These new algorithms have been added
into the GenoProteo module of GenoStar by Jérémie Turbet and Marianne Tardif at the LCP with the help of the
Genostar development team. This provides the end user with an intuitive and easy to use interface to PepLine
embedded in the GenoStar environment. A paper describing the approach as well as some applications to the
chloroplastic membrane of *A. thaliana* is in preparation.

Transposons, also called transposable elements, are sequences of DNA that can move around to
different positions within the genome of a single cell, a process called transposition.
Transposons that are still active are transcribed and we know
that such transcription is regulated but
genome-scale studies of their profile of expression has rarely been attempted.
As part of the Master of Florence Cavalli (currently doing a PhD at the EBI and
the University of Cambridge), we started addressing this question, using the
genome of *Drosophila melagonaster* as our model. This work was done in collaboration
with Cristina Vieira, from the team of ``Genome and Populations'' at the LBBE. We used for
that data from ESTs (Expressed Sequence Tags – these are short sub-strings of a transcribed
protein-coding or non-protein-coding nucleotide sequence originally intended as a way to
identify gene transcripts) available in various public databases. The difficulty of the problem
in the case of transposon is in correctly assigning
each EST to its transposon in the genome sequence.
Indeed, while the sequencing of ESTs is not error-free, the transposons in a same family are almost
exact copies of one another, much more than genes that are duplicates, and the families are in general bigger than gene families.
This kind of problem is similar to the one faced when assembling a genome from its sequenced fragments.
The initial results obtained by Florence Cavalli seem to indicate a different profile of expression
of transposons in the X chromosome, and a correlation between the number of transposons
in a family to which an EST maps (and that one may therefore assume is expressed)
and the number of copies in that family. The second result in particular is in
contradiction with previous ones. Both will
continue being investigated during the PhD of Marc Deloger that started in
October of this year in co-supervision with Cristina Vieira.

The main goal of pharmacogenomics is to predict the effect drugs
may have based on the genomic information of the patient. Using microarray technologies,
this may help improve both diagnosis and the posology policy to be adopted. A few
research structures are now ready to simultaneously screen the
transcriptome and the genome of patients in order to reveal possible correlations
with drug effects, and to consider adapting the medical protocols
with this new information. However, it is not easy to extract
knowledge from the amount of data this entails. To try to address this
issue, we are
developing new methods that use bayesian networks. Bayesian networks
permit to represent causal relations (through a DAG), and then to estimate
state probabilities. This method has already
been applied to microarray data (2000; *J. Comp. Biol.* 7:601-620). Our contribution will be to
append clinical information to the network and to specify a model that takes
pharmacogenomics constraints into account. This should enable to do more accurate
predictions. The pharmacogenomics structure of Lyon will keep all the
data treated in several projects (each corresponding to a different cancer
disease), with several samples for a given tumor at
different stages of the disease. This gives us the possibility to model
the time dimension using dynamic bayesian networks (2003; *Brief Bioinform.* 4:228-235). It will also be
interesting to compare the information on the different diseases
(by comparing the corresponding bayesian
networks) which
can reflect some cancer mechanisms. This represents an original
approach, which could then be applied to other biological systems. This work will
be done by Emmanuel Prestat and Christian Gautier.

Molecular data on biodiversity both in health and ecology represent new challlenges for data analysis. In particular, the large number of probes present in this case on the DNA chips invalidate all discriminant analyses. HELIX has concentrated its effort on addressing this issue. Two main results that have been obtained. The first, by Jean Thioulouse, is that the association between DNA chips devoted to biodiversity analysis and environment data cannot be made by a classical maximization of correlation (canonical correlation analysis); however the use of co-inertia analysis (CIA) that maximizes covariance have proven its efficiency on several studies of soil microbial biodiversity [28] [29] [45] [54] [55] . The second result was obtained by Caroline Truntzer, a PhD student of the ``Biostatistics-Health'' team of the LBBE co-supervised by Christian Gautier. Based on both simulated and ``real'' results, C. Truntzer made a comparison of different multivariate analyses that had been performed to discriminate between clinical states by using human pangenomic chips. The two works used the R software, more particularly the ADE4 package developed in HELIX.

#### Modelling and analysis of metabolism: molecular components, regulation, and pathways

Topological motifs have been extensively studied in the context of genetic and protein interaction networks but they seem to be not adapted to capture the functional information of metabolic networks. Therefore, as part of the PhD of Vincent Lacroix, we have defined a new type of motif (called coloured motifs and for which the topology of the subgraph is not given, only the labels of the nodes are known). We have worked on the problem of searching for all the occurrences of such motifs in a graph. We now have an exact algorithm for solving this problem as well as a proof that this problem is NP-complete [42] .

To define ways of assessing the over and under-representation of such motifs in metabolic networks we have then collaborated with Sophie Schbath (INRA), Stephane Robin (InaPG) and their group. From this initial goal, we worked in two directions. The first concerns the conception of realistic random graph models (which model well the distribution of the degrees of the nodes, as well as the modularity of metabolic networks). The main achievement of this part is the extension of the Erdos-Renyi random graph model to a mixture model (ERMG) for which general properties such as clustering coefficient have been studied [66] [69] . The second direction is to search for an analytic formula (to avoid simulations) for the expectation and the variance of the number of occurrences of a motif in a Erdos-Renyi random graph. This work has been successfully applied to topological motifs and we are now working on coloured of motifs (work in preparation).

As part of the postdoc of Patricia Thebault, the relation between metabolic motifs in
general, and coloured motifs in particular on one hand, and
gene expression on the other is also being investigated using various statistical approaches.
The organism chosen for such study is *Saccharomyces cerivisiae* with gene expression
data taken from the *Saccharomyces* Genomes Database (SGD) and gene regulation data from
Yeastract (2006; *Nucleic Acids Res.* 34:446-451, http://db.yeastgenome.org/cgi-bin/expression/expressionConnection.pl ) that is maintained by our Portuguese collaborators, Ana Teresa Freitas and
Arlindo Oliveira from the Instituto Superior Tecnico in Lisbon. This work is one aspect of a more general work on
the links between metabolic and genetic netwoorks. The main aim is to be able to provide a framework for the modelling of the relations between the genotype and phenotype. This work should also lead to proposing new models of
evolutionary and functional modularity in biological networks.

Metabolic networks can be decomposed into pathways. The notion of pathway is usually unclearly defined. Yet, there exists a formal definition of pathway as an elementary mode (denoted by EM). This is a set of enzymes that operate together at steady state. The computation of the elementary modes of a network has been extensively studied in the past years due to the number of applications related to this notion. Yet, all methods rely on linear programming to solve the problem whereas this problem seems to be combinatorial in nature. The goal we wish to achieve is to find a combinatorial algorithm for the calculation of elementary modes. While working in this direction, we believe that a reformulation of related concepts like minimal cut sets in terms of hypergraph problems would be of great help to improve the algorithms that are used for their calculation. Finally, a major issue in the computation of elementary modes and related concepts is the very large size of the output. Enumerating all EMs might not be of great help, but finding a way of grouping them would be very useful. We believe that using a combinatorial framework should facilitate this task. This is work done by Vincent Lacroix and Marie-France Sagot in collaboration with Alberto Marchetti-Spaccamela (University of Rome) and Leen Stougie (Eindhoven University of Technology). A first paper is in preparation.

The tools that are available to draw, and to manipulate the drawings of metabolism are usually restricted to metabolic pathways. This limitation becomes problematic when studying processes that span several pathways. In collaboration with Fabien Jourdan (INRA Toulouse), Romain Bourqui and David Auber (LABRI, University of Bordeaux), Vincent Lacroix and Ludovic Cottret are participating in the development of a method which enables to draw the entire metabolic network while also taking into account its structuration into pathways [64] [12] .

Anne Morgat, from the Swiss-Prot group at the Swiss Institute for Bioinformatics, has continued her work on the Unipathway project in the framework of the BioSapiens NOE and the UniProt grants. The project aims at providing a standardized representation of metabolic data in the UniProtKB/Swiss-Prot database. These metabolic data are explicitely represented and stored into a relational database (UniPathwayDB). They are hierarchically decomposed into super-pathways, pathways, linear sub-pathways and reactions (steps). The development of UniPathwayDB (using postgreSQL) was performed through a collaboration with Eric Coissac at the Université Joseph Fourier. The database is populated with manually expertised metabolic data (from the Swiss-Prot group) and public data (UniProtKB/Swiss-Prot, complete proteomes (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL), GenomeReview complete genomes, Enzyme). By the end of year 2006, more than 260 pathways were manually curated, representing about 450 distinct biochemical reactions. This covers more than 30 000 Swiss-Prot entries (about 70% of the total number of entries related to metabolism). The database will be made available in early 2007 through a web site hosted at the INRIA Rhône-Alpes. The server, as well as one full time engineer (Sophie Huet) who was hired in october 2006, have been provided by the PRABI (Génopole Rhône-Alpes) to this purpose.

#### Modelling and simulation of genetic regulatory networks

The group of Hidde de Jong has continued their efforts on the application of the qualitative simulation tool Genetic Network Analyzer (GNA ) (section
5.10 )) to the modeling of actual genetic regulatory networks. In particular, we study the nutritional stress response in the bacterium *Escherichia coli* in collaboration with experimental biologists in the laboratory of Johannes Geiselmann (Université Joseph Fourier, Grenoble, on leave in HELIX since October 2006). The original model developed by Delphine Ropers, published in a special issue of *BioSystems* [53] , has been extended with additional genes and proteins in order to account for observed discrepancies between the model predictions and published data. Moreover, Delphine Ropers has compared, by means of a Monte-Carlo simulation study, the detailed nonlinear differential equation model of the stress response network with the reduced piecewise-linear differential equation model used in GNA. The project EC-MOAN, funded in the framework of the FP6 NEST programme of the European Commission (2006-2009), and the project MetaGenoReg, funded by the ANR in the framework of the BioSys programme (2006-2009), will allow us to maintain and extend these modeling activities.

The *E. coli* stress response model has given rise to predictions that cannot be tested by currently available experimental data. This has motivated an experimental programme carried out in the laboratory of Johannes Geiselmann, using fluorescent and luminescent gene reporter systems to obtain precise measurements with a high sampling density. Several members of HELIX have contributed to the design of the experiments, while Bruno Besson has (re)developed the program WellReader for the analysis of the gene reporter measurements (section
5.36 )). The systematic comparison of the experimental results and the model predictions is currently under way. Other experiments are being carried out in collaboration with Irina Mihalcescu of the Laboratoire de Spectométrie Physique (Université Joseph Fourier, Grenoble).

In addition to HELIX, various other groups are using GNA in their modeling projects. In a number of cases, we have been actively involved in the formulation of the biological problem and the actual application of the tool. The current version 5.6 of GNA has been deposited at the APP and is distributed by the company Genostar. It has also been integrated in the Iogma platform for exploratory genomics developed by the company Genostar. The European project Cobios (FP6 NEST), which is due to start early 2007, will provide additional support to achieve this integration, notably by providing modules for the formulation of simulation models and facilitating the exchange of models with other modeling and simulation tools.

As the size and complexity of the genetic regulatory networks under study increase, it becomes more difficult to use GNA. For large and complex models, the state transition graph generated by the program, summarising the qualitative dynamics of the system, may consist of thousands of states and is therefore difficult to analyse by visual inspection alone. In order to cope with this problem, we have followed two approaches.

First of all, instead of generating the entire state transition graph, it is often sufficient to compute the steady states of the system and to analyze the neighbouring states in order to determine the stability of the steady states. Based on the mathematical characterisation of equilibria of piecewise-linear differential equation models and their stability, carried out in collaboration with Jean-Luc Gouzé (INRIA Sophia-Antipolis) and Tewfik Sari (Université Haute Alsace, Mulhouse) [14] , Michel Page and Hidde de Jong have developed an attractor search module for GNA. This module transforms the search of steady states into a SAT problem and exploits existing, efficient SAT solvers to find all steady states of networks of more than thousand genes. The work on the attractor search module has been submitted for publication.

A second solution for the upscaling problems consists in the use of model-checking techniques for the automated verification of properties of state transition graphs. In the framework of his PhD thesis, Grégory Batt has pursued this approach in collaboration with Radu Mateescu and his colleagues of the VASY project. This has resulted in another version of GNA, currently only available as a prototype for internal use, which connects the simulation tool to state-of-the-art model checkers. In order to achieve this, a refined simulation method has been developed that exploits the concept of discrete abstraction developed in the hybrid systems community. The work initiated by Grégory Batt is now being carried on in several directions. Estelle Dumas recently joined the HELIX and VASY projects, on an INRIA associate engineer contract, to develop a user-friendly web interface between GNA and the model checker CADP. In the framework of his PhD thesis, Pedro Monteiro has started to study appropriate temporal logics and high-level specification languages for helping the user to formulate biological properties the model has to satisfy. Adrien Richard, in collaboration with Gregor Goessler (POP-ART), is currently investigating the use of modular approaches to verify larger networks.

The above-mentioned work has focused on the analysis of models obtained through literature study and human expertise. The PhD of Samuel Drulhe, supervised by Hidde de Jong and Giancarlo Ferrari-Trecate (University of Pavía) within the framework of the European project HYGEIA, takes a different direction. It concerns the development of methods for the identification of piecewise-linear differential equation models of genetic regulatory networks from gene expression data, adapting existing methods for the identification of hybrid systems. A paper summarizing the first results on simulated data has been presented at the major annual hybrid systems conference [27] , while a longer version of the method has been submitted for a journal publication. Shortly, the application of the method to gene reporter data on the *E. coli* nutritional stress response will be undertaken.