Section: New Results
Speech recognition for multimedia structuring and indexing
Speech based structuring and indexing of audio-visual documents
Participant : Guillaume Gravier.
Work done in close collaboration with the Texmex project-team of IRISA.
Speech can be used to structure and index large collections of spoken documents (videos, audio streams...) based on semantics. This is typically achieved by first transforming speech into text using automatic speech recognition (ASR), before applying natural language processing (NLP) techniques on the transcriptions. Our research focuses on the integration of ASR and NLP techniques in the framework of large scale analysis of multimedia document collections.
In 2009, several aspects were considered, namely topic segmentation using semantic relations, unsupervised topic adaptation and semantic verification of TV programs.
We improved our transcript-based topic segmentation method based on an extension of the initial work of Utiyama and Isahara  that accounts for additional knowledge sources such as acoustic cues or semantic relations between words  . In particular, we further investigated the use of semantic relations, implementing a mathematically rigorous framework to account for such relations and comparing several methods for their automatic corpus-based acquisition. We demonstrated on a TV news corpus that directly using automatically generated semantic relations increases precision on topic boundaries to the expanse of a lower recall. This result points out the need for a careful selection of the relations to be considered.
Given thematically homogeneous segments, we pursued our work on unsupervised topic adaptation of the ASR system language model. Elaborating on our previous work based on the automatic acquisition of adaptation data from the Web, we investigated constraint selection in MDI adaptation. Experiments reported in  have shown that considering only a small number of terms in MDI contraints, i.e. , topic-specific words, is sufficient to perform an efficient adaptation. In addition to this result, it has also been shown that these terms can be automatically extracted from a small topic-specific corpus without any prior knowledge.
Finally, we extended our method for the semantic validation of automatic alignments of TV streams with an electronic program guide (EPG). The method is based on a comparison of the speech transcripts with the short program description provided by the EPG to validate the alignment. The comparison combines lexical and phonetic information retrieval techniques to define a distance between transcripts and descriptions. In 2009, we validated the method on a large dataset, introducing time-based constraints to limit computation  .