Team TEXMEX

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Dissemination
Bibliography

Section: New Results

New Techniques for Linguistic Information Acquisition and Use

NLP for Document Description

Semantic annotation of multimedia documents based on textual data

Participants : Ali Reza Ebadat, Vincent Claveau, Pascale Sébillot.

This work is done in the framework of the Quaero project (see below) and started in October with the thesis of Ali Reza Ebadat.

On this subject, TexMex is implied in three tasks of the Quaero project.

The first task concerns the extraction of terminology from document. The objective of this work is to study the development and the adaptation of methods to automate the acquisition of terminologies. In this context, we focused our work on the term normalisation problem: we are developing various methods to generate the variant forms of an initial list of terms (acronyms, graphical, morphological, morpho-syntactic variants), based on previous work on analogical learning.

The second task aims at extracting semantic and ontological relations from documents. Indeed, detecting semantic and ontological relations in texts is a key to describe a domain and thus manipulate cleverly documents. These relations can be very different from a domain to another and from a final application to another. Thus, it is important to develop generic methods to detect them. This year we began to study the use of machine learning techniques to extract relations. In particular, we began to develop an system inferring extraction patterns from examples, based on previous work using Inductive Logic Programming. We also studied the use of probabilistic approaches like Conditional Random Field to detect relations.

The last task directly deals with the semantic annotation of multimedia documents based on textual data, for, very often, many textual or language-related data can be found in multimedia documents or come along such documents. For example, a TV-broadcast, contains speech that can transcribed, Electronic Program Guide ans standard program guide information, closed captions, associated websites... All these sources offers a way to exploit complementary information that can be used to semantically annotate multimedia document. This year, TexMex and other Quaero partners were specially involved in the definition, the building and the annotation of such a multimedia corpus composed of football matches.

Oral and Textual Information Retrieval

Phonetization

Participant : Vincent Claveau.

Phonetization is a crucial step for oral document processing. In 2009, we have proposed a new letter-to-phoneme conversion approach; it is automatic, simple, portable and efficient. It relies on a machine learning technique initially developed for transliteration and translation; the system infers rewriting rules from examples of words with their phonetic representations [29] , [19] . This approach was evaluated in the framework of the Pronalsyl Pascal challenge, which includes several datasets on different languages. The obtained results equal or outperform those of the best known systems. Moreover, thanks to the simplicity of our technique, the inference time of our approach is much lower than those of the best performing state-of-the-art systems.

Information Retrieval in the TV context

Participants : Julien Fayolle, Patrick Gros, Fabienne Moreau, Christian Raymond.

The work on this topic is done in close collaboration with Guillaume Gravier from the METISS project-team.

The main focus of our research is to conceive new generation of information retrieval (IR) systems capable of retrieving information from TV data. Directly indexing automatic transcripts is errorprone because of transcription errors, in particular in the TV context where error rates can be high for some programs. The main challenge is therefore to develop IR approaches able to retrieve information in degraded text.

In October 2009, Julien Fayolle (supervised by F. Moreau, C. Raymond and P. Gros) started a Ph.D. whose aim is to investigate IR approaches robust to transcription errors. To this end, we focus on three main points: (i) detecting portions of transcripts most likely to contain errors, (ii) designing innovative representations to ensure more flexibility than purely text or phonetic ones, and (iii) adapting IR mechanisms to the new representations and to the TV context where the notion of document is not clearly defined. As an initial step, we are studying a hybrid lexico-phonetic representation, emphasizing named entities which are highly problematic for automatic speech recognition systems.

Graded-Inclusion-Based Information Retrieval Systems

Participants : Vincent Claveau, Laurent Ughetto.

Our work on this topic is done in close collaboration with Olivier Pivert and Patrick Bosc from the Pilgrim team of IRISA Lannion.

Databases (DB) querying mechanisms, and more particularly the division of relations was at the origin of the Boolean model for Information Retrieval Systems (IRSs). This model has rapidly shown its limitations and is no more used in Information retrieval (IR). Among the reasons, the Boolean approach does not allow to represent and use the relative importance of terms indexing the documents or representing the queries. However, this notion of importance can be captured by the division of fuzzy relations. This division, modeled by fuzzy implications, corresponds to graded inclusions. Theoretical work conducted by the Pilgrim team have shown the interest of this operator in IR.

Our first work was to investigate the use of graded inclusions to model the information retrieval process. In this framework, documents and queries are represented by fuzzy sets, which are paired with operations like fuzzy implications and T-norms. Through different experiments, we have shown that only some among the wide range of fuzzy operations are relevant for information retrieval. When appropriate settings are chosen, it is possible to mimic classical systems, thus yielding results rivaling those of state-of-the-art systems [17] . These positive results have validated the proposed approach, while negative ones have given some insights on the properties needed by such a model [18] .

From these encouraging results, perspectives have been derived, and will be addressed. The link between our fuzzy model and logical models in IR, and with language models in IR, are currently studied. Among other perspectives, this graded inclusion-based model gives new and theoretically grounded ways for a user to easily weight his query terms, to include negative information in his queries, or to expand them with related terms...


previous
next

Logo Inria