Section: New Results
Automatic Speech Recognition
Participants : Jun Cai, Christophe Cerisara, Dominique Fohr, Jean-Paul Haton, Irina Illina, Pavel Kral, David Langlois, Odile Mella, Kamel Smaïli, Frederick Stouten, Sébastien Demange, Frédéric Tantini, Christian Gillot.
Robustness of speech recognition
Robustness of speech recognition to multiple sources of speech variability is one of the most difficult challenge that limits the development of speech recognition technologies. We are actively contributing to this area via the development of the following advanced approaches:
Missing data recognition
The objective of Missing Data Recognition (MDR) is to handle “highly” non-stationary noises, such as musical noise or a background speaker. These kinds of noise can hardly be tackled by traditional adaptation techniques, like PMC. Two problems have to be solved: (i) find out which spectro-temporal coefficients are dominated by noise, and (ii) decode the speech sentence while taking into account this information about noise.
We published a journal paper  that summarizes our work on context-dependency modeling of missing data masks. The context considered here is the whole frequency band along with the preceding mask. The paper presents extensive evaluation of our model on the noisy Aurora2 and Aurora4 numbers and large vocabulary speech recognition tasks. Furthermore, additional experimental results are given for concurrent speech, which is a very difficult task that has received specific attention from the missing data speech recognition community. The proposed models are analyzed both in terms of strengths and weaknesses, such as the dependency of mask models to the environment and the robustness of the mask clustering process.
Detection of Out-Of-Vocabulary words
One of the key problems for large vocabulary continuous speech recognition is the occurrence of speech segments that are not modeled by the knowledge sources of the system. An important type of such segments are so-called Out-Of-Vocabulary (OOV) words (words are not included in the lexicon of the recognizer). Mostly OOV words yield more than one error in the transcription result because the error can propagate due to the language model.
We have investigated, with Frederik Stouten, to what extent OOV words can be detected. For this we used a classifier that makes a decision about each speech frame whether it belongs to an OOV word or not. Acoustic features for this classifier are derived from three recognition systems. The first one is a word recognizer constrained by the lexicon. This recognizer builds a word lattice which is used to calculate frame-based word posterior probabilities. The second system is a phone recognizer constrained by a grammar. This system was used for calculating approximations to the phoneme posteriors. The third system is a phoneme recognizer (a free phoneme loop) from which we extracted frame-based phoneme posterior probabilities. The difference between these probabilities is assumed to give an indication about speech frames that belong to words that are not included in the lexicon of the word recognizer.
On top of the acoustic features we also used four language model features: the ngram probability, the order of the gram that was used to calculate the language model probability, the unigram probability for the current word and a binary indicator that takes the value one if the word is preceded by a first name.
The detection experiments were carried out on the ESTER corpus using the segmentation and transcription tool ANTS developed in our Team. We evaluated the detection at the segment level. The detection results were represented as precision vs.recall (EER of 35%)  .
Core recognition platform
Broadcast News Transcription
In the framework of the Technolangue project ESTER, we have developed a complete system, named ANTS, for French broadcast news transcription (see section 5.1.7 ).
Two versions of ANTS were implemented: the first one gives better accuracy but is slower (10 times real time), the second one is real-time (1 hour of processing for 1 hour of audio file).
This year, we included a tool for automatic speaker diairization (speaker segmentation and clustering): SELORIA (cf. 5.1.6 ). For acoustic features, we did not use first and second derivatives, but we concatenated MFCC parameters of 9 consecutive frames. We further reduced the number of parameters to 40 using HLDA (Heteroscedastic Linear Discriminant Analysis). For acoutic models, in order to be more robust to speaker variability, we used SAT (Speaker Adaptive Training). We also increased the size of the lexicon and for language model we moved form 3-gram to 4-gram.
We integrated the “liaison” phenomenon into the recognition engine. We evaluated the effect of the number of acoustic models for phonemes or allophones which are acoustically close.
Moreover, we tried to integrate linguistic knowledge using the random indexing technique for computing “semantic” distances between words.
Furthermore, we rewrited the training scripts in order to take advantage of the new cluster TALC. The training of the acoustic models was speeded up, from 1 week to less than 8 hours.
We presented our new version of ANTS (with HLDA and SAT) at the workshop ESTER 2009 at Paris.
We adressed the speech/music segmentation problem using a new parameterization based on wavelets. We studied different decompositions of the audio signal based on wavelets (Daubechie, Coiflets, symlets) which allow a better analysis of non stationary signals like speech or music. We computed different energy types in each frequency band. Results on an audio broadcast corpus gave significant improvement compared to classical MFCC features for music/non music segmention  .
Speech and text alignment is an old research area that can be considered as solved in constrained situations (relatively clean speech, limited size audio streams). However, we started the ALIGNE project in 2008 (see section 7.2.2 ) to answer a request from linguist researchers, who need to align long and noisy speech corpora with independent manual transcriptions. In contrast with recent state-of-the-art solutions to this problem, which basically automatically compute distant anchors with a large vocabulary speech transcription system, we have focused our work on the interactive control of the automatic algorithms by the user. Our objective is thus to help the user to work with semi-automatic algorithms rather than completely unsupervised batch processing. A Master internship (Josselin Pierre) has contributed in 2008 to the implementation of the first release of the jtrans software (see section 5.1.9 ). A first set of evaluations have been performed in 2009 by linguist researchers from the ATILF laboratory. The results of this user evaluation have shown some usability and related speed issues. We have then designed and proposed a new interaction paradigm, which is largely inspired by the reference Transcriber software  , but which extends the functionalities proposed by Transcriber to a great extend, thanks to the smooth integration of semi-automatic alignment algorithms. The novel interaction model largely improves the alignement accuracy while further reducing the processing time and cognitive efforts as compared to Transcriber  . In addition, another Master internship (Jean-René Courtault) has greatly improved the phonetizer of JTrans in 2009 by implementing a classifer-based grapheme-to-phoneme converter for out-of-vocabulary words. More generally, the proposed JTrans software compares favorably to the other existing software for text and speech alignement, as it is the only one that integrates semi-automatic algorithms within an application GUI and proposes a smooth integration paradigm to reliably align corpora in faster than real-time.
Integration of linguistic information in speech recognition
One of most striking weakness of nowadays speech recognition systems is their total lack of understanding faculty, whereas everybody agrees that human processing of speech is largely guided by the semantic content of speech, and what can be understood out of it. A promising research area is then to investigate new research directions in the integration of higher-level information, typically related to syntax and semantic, into the speech decoding process.
The first information we have been interested in are dialog acts. Dialog acts represent the meaning of an utterance at the level of illocutionary force. This can be interpreted as the role of an utterance (or part of it), in the course of a dialog, such as statements or questions. The objective of our work is to automatically identify dialog acts from the user's speech signal. This is realized by considering both prosodic and lexical cues, and by training discriminant models that exploit these cues. This work, which has begun with Pavel Kral's PhD thesis, has been recently summarized in  .
The second information we have investigated is the topic of the speech, which is a coarse semantic information that relates to the discourse thematics. In the past, research on thematic recognition have already been carried on in the team, for instance in Armelle Brun's Ph.D. thesis  . However, the current work differs from this previous studies because it addresses the specific case of speech input without any explicit linguistic or textual knowledge. The main advantage of this approach is its portability and its independance to the language. The basic principle proposed here consists first in extracting acoustic repetitions from the speech stream: the most frequent of these repetitions are then associated to a lexical entry. Then, the distributional hypothesis is applied to cluster the lexical entries into a hierarchy of clusters that are associated to the main thematics discussed in the corpus, leading to the building of a semantic lexicon. A system implementing this approach has been evaluted on two very different tasks, without any adaptation to the task, in order to show the robustness of the system that results from the lack of initial constraints. The first task is spontaneous telephone speech from the OGI corpus, while the second task is French broadcast news transcription. These experiments are described in details in  .
The third work regarding the computation of syntax and semantic information for speech recognition has taken place within the context of the INRIA ARC RAPSODIS project, which has begun in feburary 2008. This project is described in section 7.3.1 : it is a place of collaboration between specialists of different domains (lexical semantic, computational linguistic, speech recognition, ...). We have in particular investigated two aspects in the PAROLE team, respectively concerning lexical semantics and syntactic parsing.
For lexical semantics, we have based our work on the distributional hypothesis, which assumes that the meaning of a word can be deduced from its usage in context. We have collaborated in these aspects with the team CEA-LIST in Paris, which is specialized in the related aspects of Latent Semantic Analysis. We have hence exploited Random Indexing approaches, which are incremental dimensionality reduction techniques, based on the Johnson-Lindenstrauss theorem  , that support very large corpora. We have used this approach to process “Le Monde” corpus and derive semantic distances between words that have an interesting potential for several future research works that may need to generalize lexical models without falling back to the (too) broad part-of-speech tags. Langage models or syntactic analysis are such potential applications.
Regarding syntactic parsing, we have decided in 2009 to invest a large amount of efforts in the development of a new syntactic parser dedicated to transcribed speech. This objective was motivated by the lack of existing parsing solutions for erroneously transcribed speech, and by the very important requirement of exploiting such a parsing in order, in the long term, to be able to compute semantic information beyond lexical semantics. Before taking this decision, we have spent more than one year trying to use existing parsers such as Syntex for this purpose, but none of them was efficient and adaptable enough. We have then strenghtened our collaboration with the TALARIS team, which is specialized in computational linguistics, to design and develop such a parser. The first step, achieved in 2009, has mainly consisted in focusing on the problem of parsing oral speech with stochastic dependency parsers, in order to more easily adapt and integrate the parsing model within our own stochastic framework. We have further decided to participate to the french syntactic parsing PASSAGE evaluation campaign organized in November 2009 (http://atoll.inria.fr/passage/eval2.en.html ). The following joint paper  describes the resulting JSynATS parser. Other details can also be found in section 7.3.1 .
All these works are described in details in the 2009 RAPSODIS project report available in the web site (http://rapsodis.loria.fr ).