Section: Scientific Foundations
Modeling sequences and structures
Formal Languages and biological sequences
Biological sequences may be abstracted as words on an alphabet of nucleic or amino acids. Structural and functional constraints on families of sequences lead to the formation of true languages whose knowledge would enable to predict the properties of these families. The theory of languages offers an ideal framework for the in depth formal or practical study of such languages:
Formal: the goal is to define and study the most adapted classes of formal languages for the description of observed natural phenomena: crossing over (splicing systems of Head  ), Watson Crick complementarity (Sticker-system  ),inversion, transposition, copy, deletion... Language theorists like A. Salomaa and Gh. Paun  have explored standard questions (complexity, decidability) when faced with natural operations on biological sequences. The current agreement is that the necessary expressivity is the class of "mildly context sensitive" languages, well-known in natural language analysis  ,  ,  ;
Practical: the goal is to provide to the biologist the means of formalizing his model using a grammar, which submitted to a parser will then make it possible to extract from public data banks relevant sequences with respect to the model. J. Collado Vides was one of the first interested in this framework for the study of the regulation of genes  . D. Searls proposed a more systematic approach based on logical grammars and a parser, Genlang  . Genlang still required advanced competences in languages and seems not used any more. We started our own work from this solution, keeping in mind the need for better accessibility of the model to biologists.
Machine Learning : from Pattern Discovery to Grammatical Inference
In practice, building relevant models is hard and frequently requires the assistance of Machine Learning techniques. Machine Learning addresses both theoretical (learnable classes) and practical issues (algorithms and their performances). Recent techniques mix both points of view, like boosting techniques (allowing good performances from initial weak learner) or support vector machines (applying structural risk minimization principle from statistical learning theory). Statistical tools are everywhere: reinforcement learning, classification, statistical physics, neural networks or hidden Markov models (HMM). HMM contain the mathematical structure of a (hidden) Markov chain with each state associated with a distinct independent and identically distributed (IID) or a stationary random process. Estimation of the parameters following maximum likelihood or related principles has been extensively studied and good algorithms relying on dynamic programming techniques are now available in bioinformatics. When available, domain knowledge may help to design HMM structure but it is often very simple in practice (Profile HMM) and its discriminative power relies mostly on its parameter choice.
Because of its practical importance in genomic sequence analysis, a high number of pattern discovery methods have been proposed  ,  . One can primarily represent a language either within a probabilistic framework, by a distribution on the set of possible words, or within a formal languages framework, by a production system of the set of accepted words. At the frontier, Hidden Markov Models and stochastic automata have very good performances, but there structure is generally fixed and learning is achieved on the parameters of the distribution. Distributional representations are expressed via various modalities: consensus matrices (probability of occurrence of each letter at each position), profiles (adding gaps), weight matrices (quantity of information). A typical algorithmic approach scans for short words in the sequences and produce alignments by dynamic programming around these "anchoring" points  . Most powerful programs in this field use bayesian procedures, Gibbs sampling and Expectation-Maximization  . The linguistic representation, which corresponds to our own work, generally rests on regular expressions. Algorithms use combinatorial enumeration in a partially ordered space  ,  . Another track explores variations on the search for cliques in a graph  ,  .
There exists a fundamental limitation in most studies: it is primarily the presence at a given position of some class of letters which will lead to the prediction. Purely statistical learning reaches its limit when relation between distant sites -frequent in biology- needs to be taken into account, because many parameters need to be adjusted. The theoretical framework of formal languages, where one can seek to optimize the complexity of the representation (parsimony principle), seems to us more adapted. We are studying this problem in the general framework of Grammatical Inference.
A grammatical inference problem is an optimization problem involving the choice of a) a relevant alphabet and a class of languages; b) a class of representations for the languages and a definition of the hypothesis space; c) a search algorithm using the hypothesis space properties and available bias (domain knowledge) to find the “best” solution in the search space. State of the art in grammatical inference is mostly about learning the class of regular languages (at the same level of complexity than HMM structures) for which positive theoretical results and practical algorithms have been obtained. Some results have also been obtained on (sub-)classes of context-free languages  . In the Symbiose project, we are studying more specifically how grammatical inference algorithms may be applied to bioinformatics, focusing on how to introduce biological bias and on how to obtain explicit representations. Our main focus is on the inference of automata from samples of (unaligned) sequences belonging to a structural or functional family of proteins. Automata can be used to get new insights into the family, when classical multiple sequence alignments are insufficient, or to search for new family members in the sequence data banks, with the advantage of a finer level of expressivity than classical sequence patterns permitting to model heterogeneous sequence families.