Section: New Results
Keywords : audio segmentation, statistical hypothesis testing, speech recognition, audio and multimodal structuring broadcast news indexing, rich transcription, statistical hypothesis testing, audiovisual integration, multimedia.
Audio indexation and information extraction
Speaker tracking and turn segmentation
In various applications, there is a need to do both speaker turn detection and speaker tracking. So far, these tasks were implemented separately though it seems obvious that they are related. Indeed, grouping together segments from the same speaker before doing speaker tracking should help take better decision based on a larger amount of data. On the opposite side, using prior information on speakers can benefit to speaker turn segmentation, for example by grouping together segments that were identified as belonging to the same known speaker.
We investigated the possible interactions between speaker tracking and speaker turn detection in the framework of broadcast news indexing. We observed that speaker tracking did not benefit from steps of clustering, mostly because the speaker verification system is robust enough to the limited amount of data. On the other hand, the performance of speaker turn detection algorithm can be greatly improved by the knowledge of some of the speakers by applying a speaker tracking algorithm before clustering segments. It was observed that performance are unaffected if none of the known speakers are present  .
Future works include the use of transcripts for speaker turn detection and tracking. For example, word usage, knowledge of the phonetic content of a segment, are valuable information for speaker related tasks that we plan to explore.
Part of speech tagging for multiple hypothesis speech transcription rescoring
In the framework of the Ph.D. of Stéphane Huet (joint PhD with the Tex-Mex team), we are studying the relations and interactions between natural language processing (NLP) techniques and automatic speech recognition (ASR) techniques.
The first step of NLP techniques often consists of a part-of-speech (POS) tagger. The aim of the POS tagger is to tag each word of a text with morphological and syntactical information concerning the gender, the number and the grammatical function (article, noun, verb, etc.). POS taggers are primarily designed to work on written texts free of grammatical errors and with punctuation marks. We therefore investigated the robustness of various POS taggers on ASR transcriptions which typically contain errors and do not include punctuation. A qualitative analysis of various taggers, either rule-based or statistical, showed that POS taggers are robust to transcription errors and loose punctuation, since they are mostly based on local decisions.
This robustness makes the use of POS tagger practical to improve the transcription. Indeed, one of the most common sources of errors in automatic transcription of French speech is the agreement in gender and number, mostly due to the mute 'e' and 's' in the French language. To deal with this problem, we investigated the use of part-of-speech (POS) taggers to rescore N-best lists. A statistical POS tagger was used to provide a score for the sequence of tags in a sentence. N-best transcriptions were then rescored based on the POS tagger score combined with the acoustic and linguistic ones. This first approach proved effective in correcting some errors but introduced new ones, actually increasing the error rate. We are currently investigating other ways to use POS tags to improve speech transcription.
This work was carried out in collaboration with Pascale Sébillot (IRISA, Tex-Mex).
Audio and audio-visual structuring of sports programmes
The problem of (sport) videos analysis has so far been seen mainly from the image point of view. Based on our previous work on the extraction of audio information, we investigated how the latter can be combined with visual information in order to structure automatically sports video.
Previous work by Ewa Kijak, based on HMMs, demonstrated the potential of the Markovian formalism to integrate multimodal (sound and image) information  ,  as well as prior structural knowledge. However, this work also demonstrated the limits of this formalism where a single observation is associated to one state. Due to such a constraint, the analysis of the different media must be synchronised to have sequences of descriptors sampled at exactly the same rate for each media stream.
To overcome this problem, we investigated segment models (SM) whose principle is to associate a sequence of observations, aka segment, to a state of the Markov process. Such models were originally proposed for speech modeling  . In this case, a state corresponds to a semantic event with its own duration, modeled at the state level, and with which a model is associated in order to compute the probability of a sequence of observations.
This framework was exploited for multimodal tennis video structuring with several, possibly asynchronous, sequences of observations per state. The state conditional probability of a sequence of visual descriptors is given by a HMM as in our previous work. We investigated various ways of modeling the audio stream in this framework. In the HMM framework, audio information resulted from the segmentation of the audio track into broad sound classes (speech, applause, ball hits, etc.) and were incorporated as a shot descriptor, thus loosing the dynamic of such information.
With segment models, we were able to model independently the audio events and to take into account their dynamic. Since broad class segmentation is error prone, we also investigated direct modeling of low-level audio features to get rid of the audio segmentation step. We also explored the incorporation of scores (extracted from incrustations) in the segment model framework.
All the segment model-based approaches yielded significant improvement over the conventional HMM approach. Experimental results also showed that SMs on top of the broad sound class segmentation outperform SMs with low-level cepstral audio features, the latter using data that are too variable. However, we demonstrated that the use of a simple audio frame pre-classifier can result in the same level of performance as with a full broad sound class segmentation  ,  .
The work on tennis videos is a joint work with Tex-Mex and is the prime focus of the PhD of Manolis Delakis, co-directed by Patrick Gros (Tex-Mex), Pascale Sébillot (Tex-Mex) and Guillaume Gravier (Metiss).
Statistical models of music
Keywords : musical description, statistical models.
With analogy to speech recognition, which is very advantageously guided by statistical language models, we hypothetise that music description, recognition and retranscription can strongly benefit from music models that express dependencies between notes within a music piece, due to melodic patterns and harmonic rules.
To this end, we are investigating the approximate modeling of syntactic and paradigmatic properties of music, through the use of n-grams models of notes, succession of notes and combinations of notes.
In practice, we consider a corpus of MIDI files on which we learn co-occurences of concurrent and consecutive notes, and we use these statistics to cluster music pieces into classes of models and to measure predictability of notes within a class of models. Preliminary results have shown promising results that are currently being consolidated. Bayesian networks will be investigated.
At the longer term, the model is intended to be used in complement to source separation and acoustic decoding, to form a consistent framework embedding signal processing techniques, acoustic knowledge sources and music rules modeling.