Section: New Results
Modeling motifs and structures on sequences
Several lines of research are carried out using pattern matching, formal languages and combinatorial analysis techniques in order to identify structural models on sequences. Biologists may either want to design and test hypothetical models or to infer such models from a set of sequences sharing a functional or structural property. RNA and protein folding studies issues use general models that need heavy computations. The goal is always to get an explicit view of the organization of the sequences and possibly to get new candidates with a similar organization in new sequences or to validate hypothetical mechanisms.
Finding modules in sequences
J. Nicolas has coordinated the national ANR project Modulome that aims at modeling the structure of genomes in terms of assembly of «modules» that may be copied and move inside or between genomes. This is supported by three applications on genomic mobile elements in cooperation with URGI/Inra Versailles, LME/Ifremer Brest and LEPG/CNRS Tours. For this last year, CRISPI, the most complete database to date on CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) has been built regrouping the analysis of all complete microbial genomes available to date (more than 1100 archeal and bacterial genomes) and is available through the web  . CRISPR are formed by a repetitive skeleton including genetic material imported from viruses and plasmid. The theoretical part of this research has been submitted for publication, together with a method and tool -ModuleOrganizer- for the analysis of modules in transpon families.
We propose the specification of a new modelling language, called Logol, intended to express structure-based models for biological sequences, based on a particular form of Definite Clause Grammars. We did deeply revisit String Variable Grammars (SVG) for this purpose in the line of Searls' work. This year, we have refined the Logol implementation and started to apply it on the search of MITE elements, a transposon family largely present in the human genome, in collaboration with GICC Tours (Y. Bigot). We have also studied with this laboratory the ascovirus DpAV4a (family Ascoviridae)  .
RNA and Protein Folding
Computational problems related to spatial structures are inherently much more complex than those considering only the sequence level. A theoretical basis that could support a rigourous analysis and understanding of structure prediction models is almost non-existent, as the problems are blend of continuous and discrete mathematics. In our group we focus notably on creating efficient algorithms for solving combinatorial optimization problems yielded by secondary structure prediction, sequence/structure alignment (Protein Threading Problem-PTP), and structure/structure comparison (CMO problem). The first problem is polynomial and the two other problems have been proved to be NP-complete.
RNA Secondary structure
The importance of the world of non coding RNA has fostered the interest for efficient prediction programs on RNA secondary structures. Associated algorithms have time complexity in O(n3 ) that becomes prohibitive for large sequences or large data sets. We have parallelized algorithms on GPU, which provide better performance/price and performance/energy ratios than CPU. The main computation is a dynamic programming algorithm. In addition, by exploiting parallelism at a coarse grain level among several sequences, we were able to provide the GPU with enough independent tasks. Our implementation faced two major GPU-specific issues: computation divergence and complex memory access patterns, which can lead to respectively inefficient use of computational and memory bandwidth resources. We managed to tackle those issues by off-loading part of the divergence to the CPU, and through the careful use of GPU memory spaces : shared, constant and texture memory. This lead to a x17 speedup over the reference sequential CPU code  . Ongoing work with a tiled approach allowing a greater reuse of data shows significant improvement for both the GPU and CPU sequential code.
In  We propose a new local alignment method for the protein threading problem (align part of a protein structure onto a protein sequence). Local sequence-sequence alignments are widely used to find functionally important regions in families of proteins. However, as far as we know, no local sequence-struture alignment algorithm has been implemented. We developed five Mixed Integer Programming (MIP) models that can perform local alignements between sequences and structures and compared their performances.
Protein structure comparison
Many protein structure comparison methods can be modeled as maximum clique problems in specific k-partite graphs, referred here as alignment graphs. In  we proposed a new protein structure comparison method based on internal distances (DAST) which was posed as a maximum clique problem in an alignment graph. We also designed a dedicated algorithm (ACF) for solving such maximum clique problems. ACF is first applied in the context of VAST, a software largely used in the National Center for Biotechnology Information, and then in the context of DAST. The obtained results on real protein alignment instances show that our algorithm is more than 37000 times faster than the original VAST clique solver which is based on Bron & Kerbosch algorithm. We furthermore compare ACF with one of the fastest clique finder, recently conceived by Ostergard. On a popular benchmark (the Skolnick set) we observe that ACF is about 20 times faster in average than the Ostergard's algorithm.
Learning automata and grammars on biological sequences
We use the inference of automata from samples of (unaligned) sequences as a general learning technique for the characterization of protein families. Automata are graphical models that are more expressive than standard sequence patterns (such as PSSM, Profile HMM, or Prosite Patterns) and enable modelling heterogeneous sequence families. We are also studying how to learn more expressive grammars such as context-free grammars.
Discovery of new protein
Last year, a new candidate protein in the family involved in cell apoptosis was discovered thanks to Protomata-Learner. In collaboration with T. Guillaudeux from the team Microenvironnement et Cancer (MICA), we have defined more precisely the localization of the complete gene and studied it by comparison with other species collecting in-silico evidence that the protein is actually expressed  .
Characterizing protein fold with automata
Protomata-Learner software has been generalized and is able now to characterize a set of protein structures instead of a set of sequences. First experiments on building structural cores for the protein threading program FROSTO are encouraging  . Considering the study of newly introduced partial local alignments, we have worked on improving their definition and we have designed a C++ library handling these objects as a first step for the implementation of the alignment algorithms.
Integrating scores on automata
Since learned automata are used to predict new members of a family, it is important to associate scores on their transitions. We have worked on the introduction of pseudo-counts based on Dirichlet mixtures in the automata and on the significance of the score on new sequences  . Test on classical benchmarks and on particular families of proteins in collaboration with biologists of Genouest are planned.
Integrating motif discovery methods
Numerous motif discovery tools are now available for the identification of transcription factors, a crucial task to construct regulatory networks. Combining efficiently their results appears useful for comparing and clustering these motifs in order to reduce redundancies and to identify corresponding transcription factor. We develop a pipeline that produces, compares and clusters a set of motifs and identifies some close motifs in public databases like JASPAR and Transfac. Unlike common comparison methods, where each matrix column is compared independently, we have developed a global approach that helps to reduce false positives. We also proposed an original graph motif model that generalizes the classical position specific pattern matrices. Finally, we present an application of our method to study ChIP-chip data sets in the context of an eukaryotic organism  .
Learning context-free grammars (CFG)
Focusing rather on learning the structure than the language, we develop an approach of CFG learning based on recoding repeated words. To handle large sequences such as genomes, efficient data structures and algorithms have to be used. To detect and score repeats, we are using suffix arrays which need to be regularly updated after each rewriting of a repeated word. We proposed an incremental update algorithm of suffix arrays after the substitution in the indexed text, of some (possibly all) occurrences of a given word by a new character. Our algorithm uses the specific internal order of suffix arrays in order to update simultaneously groups of entries, and ensures that only entries to be modified are visited. Our implementation exhibits a significant speed-up compared to the construction from scratch at each step  . In collaboration with the Natural Language Processing Group from Universidad Nacional de Córdoba, we have applied the algorithm on the smallest coding grammar issue, studying new scores for choosing the words and their occurrences to be rewrited.