Section: New Results
New processing tools for audiovisual documents
TV Stream Structuring
TV stream macro-segmentation
Participants : Patrick Gros, Gaël Manson.
This work is done in the framework of the thesis of Gael Manson at Orange Labs (former France-Telecom R&D).
TV stream macro-segmentation basically aims at precisely determining the start and the end of each of the broadcasted programs and inter-programs.
A first step in automatic TV macro-segmentation is segmenting TV stream in programs and inter-programs segments. The most promising automatic approach for that consists in detecting inter-programs as near-identical repeated sequences. However, the resulting repeated sequences can only detect sequences that repeat in the stream. Some of these sequences are actually inter-programs and others belong to long programs. Moreover, some inter-programs do not repeat.
It is therefore necessary to classify the segments resulting from the repeated sequences detection phase (the occurrences of repeated sequences and the rest of the stream) into program and inter-program segments. Our solution for that is based on Inductive Logic Programming. In addition to intrinsic features of each segment (e.g. duration, number of repetitions), our technique makes use of the relational and contextual information of the segments in the stream (e.g. the class of following segments, the class of other occurrences of repetitions).
The last step in automatic TV macro-segmentation is TV program extraction. One TV program can be split in several parts over a set of consecutive program segments. Consecutive program segments of the same TV program have thus to be reunified or fused in order to retrieve the entire TV programs. When the EPG or EIT metadata is available, we align the program segments with it to perform the reunification. When no metadata is available, our approach relies on analyzing the visual content and characteristics of each pair of consecutive segments in order to decide if they have to be reunified or not. It uses, amongst others, content-based descriptors like the color distribution, the number of faces in each segment and also the number of near-identical shots between the two segments. These descriptors are then used within an SVM classifier that makes the final decision.
Repetition Detection-based TV Structuring
Participants : Patrick Gros, Zein Al-Abidin Ibrahim, Emmanuelle Martienne, Sébastien Campion.
This work is done under the Semim@ges project in collaboration with the CAIRN team.
In the frame of the ANR Semim@ges project, we have tried to develop a slightly different approach than that we developed with Gaël Manson. This approach is based on the same fundamentals: detection of repeated sequences and classification of these segments followed by an alignment with an Electronic Program Guide. The difference comes from the fact that we base the classification step of the repetition patterns themselves, and not only on the local environment of each segment which is repeated somewhere else. As a matter of fact, the classification tries to differentiate the repetition schemes of advertisements, programs, and other segments based on their number of apparitions and on various descriptors on each of the occurrences. The method provided first promising results, but have strong experimental constraints. The repetition patterns must be learned on long video sequences (several weeks). The repetition patterns have variable sizes when many learning techniques requires fixed size data. On the other hand, this method allow to take into account long-term information.
Semantic TV Program Verification
Participants : Camille Guinaudeau, Pascale Sébillot.
Our work on this topic is done in close collaboration with Guillaume Gravier from the Metiss project-team.
Over previous years, we have developed an approach for the automatic labeling of program segments from the electronic program guide (EPG) [59] . In 2008, we developed a method to validate this labeling from a semantic point of view. In this method, program descriptions from a TV guide are associated with each segment in the TV stream, using IR methods on the phonetic and lexical transcripts of the speech material. As a result, each segment is associated with a program name which is used for the validation of the EPG alignment. In 2009, we improved this technique relying on an anchorage system that limits the association of a segment and a description only if they are contained in the same 2 h time-slot [32] , [21] . The new method was evaluated on a larger corpus (650 segments, transcripts lengths vary from 7 to 27,150 words, with an average of 2,643.). Time anchorage improved recall and precision resp. from 64 % to 86 % and from 47 % to 63 %.
Program Structuring
Stochastic Models for Video Description
Participants : Siwar Baghdadi, Patrick Gros.
Our work on this topic is done in close collaboration with Guillaume Gravier from the Metiss project-team and Thomson as external partner.
Bayesian Networks are an elegant and powerful semantic analysis tool. They combine an intuitive graphical representation with efficient algorithms for inference and learning. They also allow the representation in a comprehensive manner of the interaction between a system variable and the integrating of external knowledge. Unlike HMM and segment models, structures of Bayesian Networks are very flexible and can be learned from data. We explored the idea of using Bayesian Networks and their temporal extension Dynamic Bayesian Networks to do event detection in video streams. As a first application we have chosen commercial detection. We modeled the video stream by a Dynamic Bayesian Network. According to this model, the video stream is a sequence of observations (a set of multimodal features). Each observation is generated according to the state of the system (program or commercial). The model is fed with knowledge about the commercial segment duration. Detecting commercial segment is then a problem of inferring the optimal sequence of hidden nodes with the convenient duration. Structure learning allowed us to learn the optimal interaction between variables. Future work involves the extension of our model to do event detection in soccer games. The challenge of this part is to take into consideration all kinds of feature interactions (spatial, temporal at short or long term).
Automatic discovery of audiovisual structuring events in videos
Participants : Mathieu Ben, Sébastien Campion.
Our work on this topic is done in close collaboration with Guillaume Gravier from the Metiss project-team.
Extraction of structuring events in video programs is an important pre-processing step for higher level video analysis tasks. However most current techniques rely on supervised approaches specifically dedicated to a given target event, for example detection of anchorperson shots in TV news programs.
To overcome this genericity issue, we have developed a cross-modal technique for the automatic discovery of audiovisual structuring events in TV programs, using only little prior knowledge for the definition of the targeted events. The algorithm is based on two separate hierarchical clustering processes, one for audio segments and one for video shots. The two resulting clustering trees are then crossed by measuring the mutual information between each pair of audio/video (A/V) clusters. The A/V cluster pairs are then ordered in an N-best list from which we are able to obtain a confidence score for each segment in the video. Applying a varying threshold on these scores, one can select more or less segments thus fixing a tradeoff between missed structuring events and falsely detected ones.
Experiments on several kinds of TV programs have shown that the technique is able to extract the most relevant parts of the video, from a structuring point of view: anchorperson shots for TV news and report programs, audio/video jingles separating the reports for flash news programs. On a manually labeled database of 59 TV news programs, we were able to obtain more than 80% precision with 80% recall for the detection of anchorperson shots.
We are currently investigating how to efficiently use the results from this structuring event extraction module to guide the “best path” decoding process of our topic segmentation system.
Using Speech to Describe and Structure Video
Participants : Julien Fayolle, Camille Guinaudeau, Gwénolé Lecorvé, Christian Raymond, Pascale Sébillot.
Our work on this topic is done in close collaboration with Guillaume Gravier and Pierre Cauchy from the Metiss project-team.
Speech can be used to structure and organize large collections of spoken documents (videos, audio streams...) based on semantics. This is typically achieved by first transforming speech into text using automatic speech recognition (ASR), before applying natural language processing (NLP) techniques on the transcripts. Our research focuses firstly on the adaptation of NLP methods designed for regular texts to account for the specificities of automatic transcripts. In particular, we investigate a deeper integration between ASR and NLP, i.e. , between the transcription phase and the semantic analysis phase. Finally, we study the effective use of semantic analysis for video structuring.
In 2009, we mostly focused on transcript-based topic segmentation and on unsupervised topic adaptation of the ASR system.
In order to achieve robust topic segmentation of TV streams, we worked on improving our transcript-based topic segmentation method based on an extention of the initial work of Utiyama and Isahara [70] that accounts for additional knowledge sources such as acoustic cues or semantic relations between words [55] . Firstly, we further investigated the use of semantic relations, implementing a mathematically rigorous framework to account for such relations and comparing several methods for their automatic corpus-based acquisition. We demonstrated that directly using automatically generated semantic relations increases precision on topic boundaries to the expanse of a lower recall. Secondly, we combined transcript-based topic segmentation with automatic discovery of structural elements in videos (see 7.2.1 ), using the latter as a prior information on the location of topic boundaries. Preliminary results showed that the detection of structural elements is not accurate enough, in particular regarding the boundaries of such elements, to improve topic segmentation. Future work includes the selection of relevant semantic relations and better consideration of structural elements.
Since the quality of transcriptions is a central factor for many NLP tasks in the semantic analysis step, we aimed at thematically adapting the language model of the ASR system to thematically coherent segments using related documents retrieved from the Internet [22] . More specifically, we sought to better understand which factors of the MDI-based language model adaptation step most affect the recognition rate. Experiments reported in [22] have shown that considering only a small number of terms in MDI contraints, i.e. , topic-specific words, is sufficient to perform an efficient adaptation. In addition to this result, it has also been shown that these terms can be automatically extracted from a small topic-specific corpus without any prior knowledge.
With the arrival of Christian Raymond in September 2009, we initiated work on entity detection in spoken document as named entities are key elements for speech-based segmentation, structuring and indexing of videos. We developed and compared two baseline methods for named entity detection in speech transcripts, respectively based on support vector machines and conditional random fields.
As mentioned in introduction, using speech as a description of multimedia segments makes it possible to structure and organize collections of documents from a semantic viewpoint. Our work on topic segmentation and keyword extraction over the last few years is at the core of a demonstration of multimedia navigation in a collection of TV news documents that we implemented in 2009 (see 5.2.1 ).