Section: New Results
Speech-to-Speech Translation and Langage Modeling
Participants : Kamel Smaïli, David Langlois, Caroline Lavecchia, Sylvain Raybaud.
The objective of our team is to provide an entire speech to speech system. Currently, the results presented are on text-to-text translation:
-
Phrase-based machine translation. We pursued our effort to build a phrase based machine translation system. Last year, we retrieved the best phrases in a language; then, we translated them by using the concept of inter-lingual triggers and we selected the best ones by using simulated annealing algorithm [56] . This year, we systematically added n-gram of length 2, or 3, and 4 with their best inter-lingual triggers of length 2 or 3. This led to better results in terms of BLEU.
-
Confidence Measures for machine translation. In machine translation, errors obviously happen. In order to estimate which confidence we give to the obtained translation,last year we decided to develop several confidence measures based on mutual information, n-gram language model and lexical features language model. This year we pursued our efforts in this direction. This had led to three new publications describing the following results: (i) the combination of our measures yields a classification error rate as low as 25.1% with an F-measure of 0.708 [29] ; (ii) the introduction of another standard confidence measures (backward n-gram) [30] ; (iii) the design of a method to automatically build corpora containing realistic errors [28] (errors are introduced into reference translation with the supervision of Wordnet); (iv) the use of SVM to combine the confidence measures: the combination outperforms by 14% (absolute) our best single word-level confidence measure.
We are currently writing a journal paper on this work, and now, as the confidence measures show interesting discriminating power, we will integrate them in a more general process of discriminative training.
-
A decoder for Machine Translation. We developed a first version of a machine translation decoder based on genetic algorithms. Dealing with genetic algorithms allows to use a search space composed of whole sentences. Then, in the future, it will be possible to use sentence-level confidence measures, or sentence-level evaluation such as syntactic correctness, in order to pilot the search algorithm. This decoder needs now to be systematically tuned and used with state of the art models (translation models, distortion models...). This work started during a research Master 2 training period supported by the CPER/TALC/TATI operation (http://wikitalc.loria.fr/dokuwiki/doku.php?id=operations:tati ).
-
Language modelling for Arabic. Always in the multi-lingual scope, but more on the language modelling aspect, we conducted works on Arabic Languages. In a first work, we used Multi-Category Support Vector Machines for topic identification [19] . Second, we studied the difference of modelisation (smoothing methods, order of n-gram) between French and Arabic [26] .
-
Multi-lingual summarization. In this work, we want to provide a translation of a document content. For that, we abord in the same work the summarization field and the machine translation field. In order to prevent from the difficulty of producing a true syntactically correct summary of a document, we propose a graph representation of the document, and we propose a method to translate the nodes of the graph by taking into account the neighbours. Our first results encourage us to continue in this direction by defining a measure of the correctness of a graph versus the initial document. This work started during a research Master 2 training period.