Section: New Results
Keywords : audio detection, audio segmentation, statistical hypothesis testing, speech recognition, audio and multimodal structuring broadcast news indexing, rich transcription, statistical hypothesis testing, audiovisual integration, multimedia.
Audio analysis and structuring for multimedia indexing and information extraction
Audio content processing and information extraction in sports events
This work has been done in the context of the ITEA PELOPS project, in close cooperation with Thomson Multimedia).
Extracting relevant information in sports programmes (such as soccer matches) is a challenge which is closely linked to applicative considerations, such as automatic content summarization and fast post-production and repurposing. In this context, the activities of the Metiss group in the PELOPS Project were focused on 2 tasks :
Generation of acoustic and semantic descriptors from audio soundtracks.
Audio-visual information fusion and integration, for the classification of highlights in a sport event (collaboration with Thomson Multimedia who provides video descriptors).
A set of low levels audio descriptors have been setup, using statistical and pattern recognition techniques. BSS techniques are used as a preprocessing phase to separate the commentator track from the crowd and field ambiance. This preprocessing step improved the robustness of several audio descriptors, such as commentator pitch tracking, rate of commentator speech and cheering level.
The fusion and integration of audio and visual information addresses the problem of combining heterogenous descriptors with asynchronous streams. The events are modelled by means of contextual relations between time intervals, using different statistics on the descriptors (max, min, standard deviation). Support Vector Machine classifiers have been used to train the models and to score the test matches, as described in  .
The resulting event classification is synchronized with the video shot segmentation and each shot is assigned a score for the considered events (goals, cards, goal attempts, other). The classification is evaluated using a precision-recall curve. In our experiments on a corpus of 12 soccer matches, 100of the shots with the highest estimated goal probability.
Integrating natural language processing and speech recognition
This work has been done in close collaboration with the Texmex project-team at IRISA and has led to a rising collaboration with the NLP group at the Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE, Puebla, Mexico).
Automatic speech recognition (ASR) systems aim at generating a textual transcription of a spoken document, usually for further analysis of the transcription with natural language processing (NLP) techniques. However, most current ASR systems solely rely on statistical methods and seldom use linguistic knowledge. In collaboration with the NLP group in the Texmex project-team of IRISA, we investigated several directions toward a better use of linguistic knwoledge such as morphology, syntax, semantics and pragmatics in ASR.
The works described here under were implemented in our Sirocco software and incorporated in our spoken document analysis platform Irene . The proposed approaches were benchmarked on the Ester French broadcast news corpus  which consitutes a reference in ASR for the French language.
Using morpho-syntactic knowledge in speech recognition
In 2006, we had demonstrated the interest of a score combining acoustic, language and morpho-syntactic information to rescore N-best sentence hypothesis lists. This year, we consolidated these results with various configuration of our ASR system and studied the impact of morpho-syntactic information for confidence measure computation. In particular, we demonstrated that confidence measures can be improved based on our combined score function [Oops!] .
Spoken document segmentation
Spoken document segmentation is a crucial step for the analysis of multimedia documents which requires the combination of linguistic and acoustic cues. To this end, we extended a statistical method based on lexical cohesion  for topic shift detection to take into account additional knowledge such as semantic relation between words, syntactic coherence and acoustic cues. Our technique enables us to improve segmentation, although a few parts —particularly those corresponding to the news headlines— have still to be refined.
Using information retrieval for language model adaptation
We proposed a method to adapt the language model of an ASR system for each segment resulting from the segmentation step described above. The method is completely unsupervised and uses neither a priori knowledge about topics nor a static collection of texts. The idea is to gather textual adaptation data for each segment, based on information retrieval (IR) methods to extract keywords which are used to retrieve documents from the Web. IR techniques, used both for keyword extraction and for document selection, have been adapted to tackle the specificities of automatic transcriptions (e.g. misrecognized words, named entities). Results indicate a large improvement of the language model, which finally yields a small improvement of the word error rate [Oops!] .
This preliminary work has demonstrated the potential of our approach to efficiently transcribe speech streams and suggests further work on language model and vocabulary adaptation based on IR methods to gather adaptation data from the Internet. The thesis of Gwénolé Lecorvé, which started in September 2007 in collaboration with the Texmex project-team, will be dedicated to language model and vocabulary adaptation for the robust transcription of multimedia streams.
Speech recognition based on phonetic landmarks
Participant : Guillaume Gravier.
HMM-based automatic speech recognition can hardly accomodate prior knowledge on the signal, apart from the definition of the topology of the phone-based elementary HMMs. In the previous years, we have shown that such knowledge can be efficiently used during decoding with the Viterbi algorithm as constraints on the best path.
Preliminary experiments have shown that accurately detecting broad phonetic landmarks, such as vowels or stops, can greatly benefit to ASR. Hence, we focused this year on the actual detection of such landmarks. Experiments on HMM-based landmark detection demonstrated that, if HMMs can be used to provide a segmentation into broad phonetic events, the classification rate is not high enough to benefit speech recognition. This is also due to the fact that the same paradigm (features and models) is used for both landmark detection and speech recognition. We therefore focused on the use of support vector machines to classify feature vectors into broad phonetic classes, achieving classification rates around 95 % for vowels, fricatives and nasals [Oops!] .
In the future, we plan to improve SVM-based landmark detection using different features and to demonstrate the actual feasibility of broad phonetic landmark-driven speech recognition.
Motif discovery in audio documents
Discovering repeating motifs —such as advertisments, jingles or even words— in audio streams or databases is a crucial task for the unsupervised strucuring of audio data collections and a necessary step toward the lightly supervised design of audio event recognition systems. Research in this field are oriented along two main axes, namely efficient search of a motif (querry) and efficient representation of a motif to deal with variability.
In 2007, our activity in the field of audio motif discovery mainly focuses on the study of sequence models for fast retrieval of audio sequences, in collaboration with the Texmex project-team at Irisa. Extending existing multidimensional indexing techniques is not possible as these were designed for description schemes in which the concept of sequence lacks. A solution is to summarize the sequence in a model before indexing and comparing models rather than sequences. To this end, we investigated the use of support vector machines as a prediction model and compared the SVM-based comparison of sequences with the more traditionnal feature-based dynamic time warping alignment method. Overall, we have shown that relying on models (instead of relying on descriptors) provides a better robustness to severe modifications of sequences, like temporal distortions for example [Oops!] , [Oops!] .
These encouraging results motivate further investigation on SVM-based models of audio sequences. In parallel, the thesis of Armando Muscariello, which started in October 2007, will focus on the practical application of sequence models for motif discovery in audio streams, aiming at the discovery of variable motifs.
Participant : Guillaume Gravier.
The work described in this section is carried out in the framework of the Ph. D. thesis of Siwar Baghdadi, in collaboration with the Texmex project-team of IRISA and Thomson Multimedia Research.
Bayesian networks provide an interesting framework for the joint modeling of multimodal information. Morevoer, unlike HMMs and segment models, it is possible to learn the structure of a Bayesian network, i.e. the relation between the variables describing the problem, from data.
We investigated the use of dynamic Bayesian networks and the potential of structure learning algorithms such as K2  for multimodal integration in a commercial detection application. A video stream is considered as a succession of shots, where a shot is represented by a set of visual and audio features, which can be labeled either as commercial or not. We have shown that structure learning algorithms can efficiently learn the relations between the variables describing a shot. We investigated different approaches to model temporal relations between shots, in particular using an explicit duration model as in segment models.
Future work involves the extension of this approach to event detection in soccer games with a focus on structure learning, either static or temporal, in order to provide a framework for the lightly supervised development of new applications.
Polyphonic music transcription and coding
Participant : Emmanuel Vincent.
Music signals can be described by a score consisting of several notes defined by their onset time, duration, pitch and instrument class. The task of estimating the notes underlying a given signal is termed polyphonic music transcription. It involves two substasks, namely pitch transcription and instrument identification. This task can also form the core of an "object-based" coder, encoding the signal in terms of resynthesis parameters for each note and allowing high-level manipulation of the signal.
Our previous work [Oops!] focused on the modeling of music signals via Bayesian harmonic models. This year we proposed an improved inference method for such models allowing faster computation of the posterior probability of a set of notes on a given time frame.
We also investigated alternative methods addressing this task in the framework of sparse representations. The first method represents the signal in each time frame as a linear combination of harmonic atoms learnt on isolated notes from various instruments. The relevant atoms are selected by Matching Pursuit and additional structural constraints are used to extract sequences of atoms modeling individual notes. The second method represents the short-term magnitude spectrum as a linear combination of magnitude spectra corresponding to different pitches. These spectra are adapted from the signal alone ny minimizing the loudness of the residual under harmonicity constraints. This method provided similar pitch transcription accuracy as state-of-the-art methods, while allowing better generalization to unknown instruments.
Finally, we investigated the use of such note-based representations for bandwidth extension and "resolution-free" audio coding.
This work was conducted in collaboration with Mark D. Plumbley and Steve Welburn (Queen Mary, University of London), Pierre Leveau and Laurent Daudet (Université Paris 6) and Nancy Bertin and Roland Badeau (GET - Télécom Paris). Previous results have been published as journal articles [Oops!] , [Oops!] . New results have been submitted to a journal [Oops!] and published in the proceedings of a conference [Oops!] and an evaluation campaign [Oops!] .
Statistical models of music
Speech recognition is very advantageously guided by statistical language models : we hypothetise that music description, recognition and retranscription can strongly benefit from music models that express dependencies between notes within a music piece, due to melodic patterns and harmonic rules.
To this end, we have investigated the approximate modeling of syntactic and paradigmatic properties of music, through the use of n-grams models of notes, succession of notes and combinations of notes.
In practice, we consider a corpus of MIDI files on which we learn co-occurences of concurrent and consecutive notes, and we use these statistics to cluster music pieces into classes of models and to measure predictability of notes within a class of models.
The model is intended to be used in complement to source separation and acoustic decoding, to form a consistent framework embedding signal processing techniques, acoustic knowledge sources and music rules modeling. A publication is in preparation.