Project : texmex
Section: Application Domains
Textual Database Management
Searching in large textual corpora has already been the topic of many researches. The current stakes are the management of very large volumes of data, the possibility to answer requests relying more on concepts that on simple inclusions of words in the texts, and the characterization of sets of texts.
We work on the exploitation of scientific bibliographical bases. The explosion of the number of scientific publications make the retrieval of relevant data for a researcher a very difficult task. The generalization of document indexing in data banks did not solve the problem. The main difficulty is to choose the keywords which will encircle a domain of interest. The statistical method used, the factorial analysis of correspondences, makes it possible to index the documents or a whole set of documents and provides the list of the most discriminating keywords for this or these documents. The validation of the indexing is carried out by making a search for information in databases more general than the one which made it possible to work out the index and by studying the reported documents. That in general makes it possible to still reduce the subset of words characterizing a field.
Another difficulty is to find within a given document the parts which tackle a subject. We thus worked, from texts of bioinformatics coming from bases such as Medline, on the automatic extraction of the zones of texts describing the interactions between genes and on the modeling of the described interaction. Modeling requiring a fine and expensive analysis of sentences, it should be carried out only on zones of texts likely to contain an interaction indeed. Our methodologies of training of semantic bonds between words are exploited to determine these relevant zones of texts. To a corpus of summaries extracted from Medline, we apply a training by ILP to try to learn what distinguishes the sentences containing interactions description of the others.
We also explore scientific documentary corpora to solve two different problems: to index the publications by the way of meta-keys and to identify the relevant publications in a large textual database. For that, we use factorial data analysis which allows us to find the minimal sets of relevant words that we call meta-keys and to clear out the bibliographical search from the problems of noise and silence. The performances of factorial correspondence analysis are sharply greater than classic search by logical equation.