Section: Contracts and Grants with Industry
The RAPSODIS project is an “INRIA Action de Recherche Concertée” that has started in 2008 and will end in 2009. It is lead by the Parole team, and further involves three other INRIA teams (Talaris in Nancy and TexMex and Metiss in Rennes) and the CEA-LIST research team in Paris.
The main objective of this project is to study and analyze solutions for the integration of syntactic and semantic information within speech recognition systems. This project thus defines a pluridisciplinary framework that integrates researches from two main research areas: automatic speech recognition and natural language processing, including syntax parsing, semantic lexicon and distributional semantics.
The members of this project have chosen to address two main challenges:
Design and computation of syntactic and semantic features that may prove useful for speech recognizers;
Integration of these features into state-of-the-art speech recognition systems.
The work realized so far has mainly focused on exploring several types of syntactic and semantic information, such as dependency graphs derived from rule-based syntax parser, and thematic recognition computed from Random Indexing frequency matrices grabbed from the Web, and on investigating the possibility to rescore the speech recognizer n-best outputs with such information. One of the main difficulty to deal with is the lack of syntactic parser for oral speech processing (as opposed to written texts) in French. The current main research objectives concern the exploitation of stochastic syntactic parsers that can compute parsing probabilities for different sentences. A post-doctoral researcher has been hired in october 2009 to improve the training of such parsers with active learning.
More details can be found at http://rapsodis.loria.fr
As, in USA, NIST organizes every year an annual evaluation of the systems performing an automatic transcription of radio and television broadcast news, the French association AFCP (Association Francophone de la Communication Parlée) has initiated such an evaluation for the French language, in collaboration with ELRA (European Language Resources Association) and DGA (Délégation Générale pour l'Armement). The ESTER (Évaluation des Systèmes de Transcriptions Enrichies des émissions Radiophoniques) project evaluate different tasks: segmentation, as speech/music segmentation, speaker tracking system and orthographic transcription.
We have developed a fully automatic transcription system (Automatic News Transcription System: ANTS) containing a segmentation module (speech/music, broad/narrow band, male/female) and a large vocabulary recognition engine (see section 5.1.7 ). The first evaluation was conducted in January 2005. The next one took place from november 2008 until march 2009. We participated to this Ester 2 evaluation campaign for the broadcast news transcription task. The aim of this campaign was to evaluate automatic radio broadcasts rich transcription systems for the French language. This campaign was organised by DGA (Direction générale de l'armement) and AFCP (Association Francophone de la Communication Parlée).
We presented the new version of our Automatic News Transcription System (ANTS), including HLDA and SAT implementations, at the workshop ESTER 2009 at Paris.
This contract, coordinated by Prof. Rudolph Sock from the Phonetic Institute of Strasbourg (IPS), addresses the exploitation of X-ray moving pictures recorded in Strasbourg in the eighties. Our contribution is the development of tools to process X-ray images in order to build articulatory model.
This contract started in January 2009 in collaboration with LTCI (Paris), Gipsa-Lab (Grenoble) and IRIT (Toulouse). Its main purpose is the acoustic-to-articulatory inversion of speech signals. Unlike the European project ASPI the approach followed in our group will focus on the use of standard spectra input data, i.e. cepstral vectors. The objective of the project is to develop a demonstrator enabling inversion of speech signals in the domain of second language learning.
This year the work has focused on the development of more appropriate articulatory models of the tongue, the development of a lifter distance for cepstral data and the synthesis of speech signal from articulatory contours outlined from X-ray moving pictures.
This ANR Jeunes Chercheurs started in 2009, in collaboration with Magrit group. The main purpose of ViSAC (Acoustic-Visual Speech Synthesis by Bimodal Unit Concatenation) is to propose a new approach of a text-to-acoustic-visual speech synthesis which is able to animate a 3D talking head and to provide the associated acoustic speech. The major originality of this work is to consider the speech signal as bimodal (composed of two channels acoustic and visual) "viewed" from either facet visual or acoustic. The key advantage is to guarantee that the redundancy of two facets of speech, aknowledged as determining perceptive factor, is preserved. An important expected result is a large bimodal speech corpus offering a high linguistic coverage which will be used to build the acoustic-visual speech synthesis system, and allows to study coarticulation in depth.
We have begun a collaboration with The Picture Factory about the indexation of rushes. The automatic transcription of French dialogs contained in the rushes would automatically allow the rush indexing.
During the first phase of the project (until sept 2009) we evaluated the contribution of automatic recognition of speech for indexing rushes in order to identify interesting topics of research. To be more precise, we used our Automatic News Transcription System, ANTS, to highlight the specific problems posed by the automatic transcription of a rush, The main problems are: speech/non-speeech segmentation, language identification and spontaneous speech recognition. In the second phase, a thesis funded by the project will begin about spontaneous speech recognition.