Project : axis
Section: New Results
Other applications were developed mostly for validating our algorithms (as for , , ).
Comparison of Sanskrit Documents
Keywords : text comparison, Sanskrit, transliteration.
Participants : Marc Csernel, Yves Lechevallier.
This research is carried out in the context of the CNRS Action ``Histoire des savoirs'' (History of Knowledge).
The goal of the projects is to produce software tools that support the construction of critical edition of Sanskrit texts. A critical edition is a document that shows all different versions of a text found in different manuscripts. Generally the critical edition of a text is the base for all further studies.
Sanskrit texts contain some unique features that make inefficient standard tools dedicated to Indo-European languages. First, Sanskrit uses a specific 46 letters alphabet. Sanskrit scripts must be transliterated into roman scripts to be used on a computer; usual software tools compare the roman characters of the transliteration and do not use the Sanskrit alphabet directly.
Second, Sanskrit texts (especially in ancient manuscripts) can be written without spaces between words. This made comparisons between texts quite complex since it is hard to separate words. To avoid this difficulty we use two kinds of text: a lemmatized master text (called the padapatha) and the text to be compared. This will greatly improve the algorithmic complexity, but will introduce some new difficulties. Indeed, when spaces between two words are suppressed, the two words are not simply glued. They are modified according to some special rules call Sandhi. Taking Sandhi into account is not a trivial task.
During 2004, we developed the comparison software. The approach was to adapt the velhuis transliteration system in order to compare the words according to the Sanskrit alphabet and not with the roman alphabet. The implementation is using mostly Lex.
The second step was to allow comparison between text of different structures, the "normal text" and the lemmatised text: before comparison, the lemmatised text is transformed according to Sanskrit Rules. This transformation needs to be done for each comparison, because only the lemmatised text can indicate in which word a difference occurs. We have implemented most of the Sandhi rules, although there is still some work left. Once the implementation of the Sandhi rules is completed, we will test them on real examples.
The software is now able to make comparison between the "master lemmatised text" and another text, taking into account the Sanskrit alphabet as well as sandhi rules, but the algorithm needs some optimisation. We plan to increase the its performance by adapting the DIFF algorithm to deal with Sanskrit characteristics.
Using GrepMiner on Gene Regulatory Expression Profiles
Keywords : sequential pattern mining, suffix tree, Apriori, affymetrix GeneChip, differential expression, DNA chip, DNA microarray, expression analysis, gene expression.
Participants : Doru Tanasa, Brigitte Trousse.
Given the advent of microarray technology, it is now possible to analyze the expression of a large number of genes simultaneously. Microarray experiments can be classified according to the nature of the samples, i.e. time of collection, location, type of tissue, class of tumor, etc. In the paper  we are interested in exploring our computational methods when applied to time series microarray experiments. In particular we report results applied to gene expression time series associated to mouse cerebellum development .
Biological motivation and gene expression data generation
The time-series gene expression data was generated by Kagami et al . The data is publicly available through GEO (Gene Expression Omnibus), http://www.ncbi.nlm.nih.gov/geo/. In such study Kagami et al  investigated differentially expressed genes during the development of mouse cerebellum. Their biological interest was focused to further understanding the molecular basis of mouse cerebellum development. The mouse cerebellum is not entirely developed until post-natal day 21, therefore their experiment was an ideal framework for the understanding of the genetic foundations and mechanisms of neural development.
Sequential Patterns Discovery in microarray Data.
We propose APRIORI-GST, an APRIORI-like algorithm that uses a Generalized Suffix Tree (GST) index for discovering sequential patterns from microarray data. The microarray data is transformed into sequences of three possible levels of exposure (e + , e0 or e-). These sequences are indexed using a GST index. A microarray sequential pattern may be seen, in this case, as a sub-sequence of levels of exposures that frequently occur.
From the extracted patterns we outlined the hypothesis that there is a lot of gene activity between the prenatal stage E18 and postnatal stage P7, which needs to be further investigated.
GREPminer Software Tool
To support our methodology, we designed and implemented in Java, the GREPminer(Gene Regulatory Expression Profiles Miner) tool presented in Fig. 13. The user chooses a dataset file and extracts sequential patterns having the support superior to a specified threshold. The extracted frequent sequential patterns are listed on the left side and the details (list of genes) for the selected pattern is displayed on the right side.