Section: New Results
Keywords : audio segmentation, statistical hypothesis testing, speech recognition, audio and multimodal structuring broadcast news indexing, rich transcription, statistical hypothesis testing, audiovisual integration, multimedia.
Audio analysis and structuring for multimedia indexing and information extraction
Automatic speech recognition with broad phonetic landmarks
HMM-based automatic speech recognition can hardly accomodate prior knowledge on the signal, apart from the definition of the topology of the phone-based elementary HMMs. We proposed a framework to incoporate such knowledge as constraints on the best path in a Viterbi based decoder. As HMM-based speech recognition can be viewed as the search for the best path in a treillis, knowledge of the broad phonetic content of the signal can be used to prune (or at least penalize) those paths which are inconsistent with the available prior knowledge. We refer to those places where prior information on the phonetic content of the signal is available as l andmarks. From a theoretical point of view, this can be seen as decoding with non stationary Markov models where the transition probabilities vary with time depending on the presence or not of landmarks. In practice, a confidence measure associated to automatically detected landmarks can be used to penalize transitions according to the confidence in the landmark, the lower the confidence, the lower the penalty.
We carried out a preliminary study to determine whether a local knowledge of the broad phonetic content of the signal can benefit the transcription system or not. The aim of the study is twofold: (i) validate the proposed approach to integrate landmarks in a statistical decoder and (ii) measure the benefit of broad phonetic landmarks. Broad phonetic landmarks of different types (vowels, fricatives, glides, etc.) were extracted from a reference phonetic alignment of the waveform and used as landmarks. Experimental results on broadcast news transcription show that each type of landmarks brings a small improvement to the system. Using all the landmarks simultaneously significantly yielded a considerable improvement of the transcription system along with a faster decoding, even when landmarks actually covers a small portion of the actual phone. This last result indicate that precisely detecting the landmark boundaries is not required. Finally, simulated miss detection errors showed that the performance gain scales linearly with the amount of detected landmarks  .
These encouraging results triggered future work on the automatic detection of broad phonetic landmarks in order to validate these ideas in a realistic application framework. Application of this decision fusion scheme to audiovisual speech recognition is also foreseen, where the visual modality can be used to provide knowledge on the phonetic content for the audio modality.
Speech transcription with part-of-speech tagging
Automatic speech recognition (ASR) systems aim at generating a textual transcription of a spoken document, usually for further analysis of the transcription with natural language processing (NLP) techniques. However, most current ASR systems solely rely on statistical methods and seldom use linguistic knowledge. The thesis of Stéphane Huet, in collaboration with the Tex-Mex project, aims at using NLP techniques to improve the transcription of spoken documents.
In 2006, the work of Stéphane Huet has focused on incorporating linguistic knowledge for the rescoring of N-best sentence hypothesis lists. After a survey on the use of linguistic knowledge in ASR systems  , we investigated N-best lists rescoring using part-of-speech (PoS) information. We first demonstrated that N-class based PoS taggers are robust to the specificities of spoken document transcriptions (lack of punctuation, no case in our ASR output, breath group instead of sentences, ASR errors). In particular, PoS taggers are robust to transcription errors, since they mostly rely on local decisions and many words are non ambiguous. We then showed that PoS information can be used to detect and correct transcription errors  . Finally, these two results enable the use of PoS taggers to rescore a list of sentence hypotheses based on a score combining acoustic, language and syntactic (PoS) information. The combined score was used in conjunction with several rescoring schemes, namely maximum a posteriori, minimum expected word error rate and consensus decoding, to rerank lists of 100 sentence hypotheses with a decrease of about 1 % of the word error rate in all cases  . Moreover, the resulting transcription exhibits, in most cases, a better grammatical structure, which is reflected by a decrease of the sentence error rate.
The corresponding algorithms were implemented in our Sirocco software and incorporated in our spoken document analysis platform Irene .
Future work on this topic include exploiting the enhanced transcription along with PoS tags to segment the text into topically coherent stories caracterized by some automatically extracted keywords (which can in turn be used to adapt the vocabulary and the language model). The use of syntactic information for confidence measures is also a foreseen continuation of this work.
Multimodal segment models for video analysis
Participant : Guillaume Gravier.
This section and the next one describe a joint work with the Tex-Mex project, carried out in the framework of the Ph. D. thesis of E. Delakis  under the supervision of Guillaume Gravier and Patrick Gros (Tex-Mex).
In a previous work  , we investigated the use of hidden Markov models (HMM) for the integration of audio and visual cues, applied to tennis video structuring. This work clearly demonstrated the potential of an HMM approach but also outlined the two main limitations of HMMs for such a task: the synchronization at the shot level of the descriptors of each media and the same underlying model for both modalities.
Motivated by this need for more efficient multimodal representations, the use of segmental features in the framework of Segment Models (SM) was previously proposed  ,  , instead of the frame-based features of Hidden Markov Models. Considering each scene of the video as a segment, the synchronization points between different modalities are extended to the scene boundaries and a scene duration model is added. Conditionnally to a scene, the sequences of visual and auditory features are considered independent and different models can be used. This year, we studied various models for the auditory feature sequences, including a model based on cepstral coefficient to avoid the error-prone step which consists in tracking events like "ball hits" and "applause" in the soundtrack. Segment models yielded better performance than HMMs, mainly due to the scene duration model. Asynchronous audio-visual fusion at the scene level yielded no improvement compared to a synchronous fusion with SMs. This result is most probably due to the fact that strong correlations between visual and audio features at the scene level are disregarded in the asynchronous fusion scheme. Combining asynchronous and synchronous fusion resulted in a small performance gain  .
Finally, we also explored the idea of using a hybrid SM-ANN approach using Recurrent Neural Networks as a segmental scorer. To this end, the newly introduced Long-Short Term Memeory (LSTM) topology was favorably compared to BPTT-trained RNNs and used in the hybrid model. The hybrid performed however visibly inferior to the standard SM but still with a promising performance. In fact, what makes the difference is that the HMM-based segmental scorers can use prior knowledge on the task directly into their topology, while the LSTM scorers were built from scratch  .
These results illustrate the increased flexibility of SMs with respect to HMMs. However, the hypothesis of independence of the information streams is a clear limitation of the SM approach for multimodal integration. Exploiting he dynamic Bayesian framework to overcome this limitation and to relax the synchronization constraints are part of a new work following the Ph. D. of Emmanouil Delakis.
Score-oriented Viterbi search for sports audio and video analysis
Participant : Guillaume Gravier.
In sport videos, score annoucement are often displayed on screen, giving some valuable information on the high-level semantic of the video. However, current models can hardly accomodate such an information stream as they are highly synchronous and sparse (i.e. not always displayed). In tennis videos, using the presence of a displayed score as an extra feature, at the shot level in hidden Markov models, or at the scene level in segment models, result in a marginal performance improvement. The reason is that score announcements may appear a lot delayed or may not appear at all after a game event. The probability distributions of this feature become thus almost uniform, i.e. , they carry no useful information.
Instead, we studied a new decoding algorithm, the score-oriented Viterbi search, in order to fully exploit the semantic content of the score announcements. This algorithm aims at finding out the most likely path consistent with the score annoucements, at the expense of a computation overhead a little superior to the standard Viterbi decoding for both HMMs and SMs. The key idea is to perform a cascade of local optimization, penalizing local paths inconsistent with the number of points scored between two score announcements. Experimental results on our tennis video corpus demonstrated a significant performance improvement with both HMMs and SMs  .
The scope of the proposed algorithm is not limited to tennis and extends to any constraints that can be formulated as "there are n events of a given kind between two instants a and b". In the particular case of tennis, the events are the number of points scored while the instants are the consecutive instants at which a score is displayed.
Statistical models of music
With analogy to speech recognition, which is very advantageously guided by statistical language models, we hypothetise that music description, recognition and retranscription can strongly benefit from music models that express dependencies between notes within a music piece, due to melodic patterns and harmonic rules.
To this end, we are investigating the approximate modeling of syntactic and paradigmatic properties of music, through the use of n-grams models of notes, succession of notes and combinations of notes.
In practice, we consider a corpus of MIDI files on which we learn co-occurences of concurrent and consecutive notes, and we use these statistics to cluster music pieces into classes of models and to measure predictability of notes within a class of models. Preliminary results have shown promising results that are currently being consolidated. Bayesian networks will be investigated.
At the longer term, the model is intended to be used in complement to source separation and acoustic decoding, to form a consistent framework embedding signal processing techniques, acoustic knowledge sources and music rules modeling.