Section: New Results
Keywords : Multimedia Description.
Multimedia Document Description
The term multimedia documents is broadly used and covers in fact most documents. It is in fact more and more appropriate since any document are now truly multimedia and contain several media: sound, image, video, text. The description of these documents, videos for example, remains quite difficult. Research groups are often monodisciplinary and specialist of only one of these media, and the interaction between the different media of a same document is not taken into account. Nevertheless, it is clear that this interaction is a very rich source of information and allows us to avoid the limitations of the techniques devoted to a single medium since the limits vary according to the concerned medium.
The combination of all the media remains a difficult objective. Each pair of media present specific difficulties. On the other hand, as few studies have been published in this domain, many of the difficulties are still to be discovered. To handle this fact, we decided to study particular combinations of media for problems of reduced size. The four ongoing studies concern the audio-video combination for both fine grain and coarse grain segmentation of videos, the description of documents containing text and images, and the collaboration between speech transcription and natural language processing.
This work is done in close collaboration with the METIS group of IRISA which brings its knowledge of the sound and speech media as well as in stochastic modeling.
Stochastic Models for Video Description
Our work on this topic is done in close collaboration with the METISS team of IRISA and Thomson as external partner.
Our first axis of research on multimedia description concerns the structuration of sport videos. M. Delakis's thesis was devoted to the use of Segment Models (SMs) to extract the dense structure of tennis broadcasts. Three years ago, E. Kijak's thesis showed the interest of multimodal integration for this kind of videos, but the Hidden Markov Models (HMMs) used at that time was lacking of the needed flexibility. The aim of our work was thus to test a more general model. In order to extract individual events from a video, we investigate the use of Dynamic Bayesian Networks in the frame of S. Baghdadi's thesis.
Segment models for dense tennis broadcast structuring
The aim of this study is the automatic construction of the table of contents of a tennis broadcast. In this type of video, game rules as well as production rules define a type of tennis video syntax that can be modeled by Markovian models and parsed with dynamic programming techniques. The video is thus segmented into scenes, which are video portions that share a unique semantic content and serve as the basic building blocks of the table of contents of the video.
Motivated by the need for more efficient multimodal representations, the use of segmental features in the framework of Segment Models was proposed  ,  , instead of the frame-based features of Hidden Markov Models. Considering each scene of the video as a segment, the synchronization points between different modalities are extended to the scene boundaries. Visual features coming from the produced broadcast video and auditory features recorded in the court view are processed before fusion in their own segments, with their own sampling rates and models. On video only data or with synchronous audiovisual fusion, SMs demonstrated a performance improvement over HMMs, traded with a negligible extra computation cost. Possibilities of asynchronous fusion of auditory models with SMs were also examined, even though performance did not finally improve.
Beside the audiovisual integration, we are also concerned with the fusion of high-level information coming from the game structure and rules and from the score announcements  . The first dictates that the solution provided by Viterbi decoding should be consistent with the tennis rules and the second one that it must be consistent with the actual game evolution, as it is recorded in the score announcements. The incorporation of tennis rules is easily performed by extending the flat topologies of both HMMs and SMs to hierarchical ones, in order to follow the hierarchical decomposition of a tennis game. Hierarchical topologies also allow for the detection of high-level segments in the video, like a set or a game. Unfortunately, the number of transition probabilities in a hierarchical topology is very high and thus they have been manually set to arbitrary values. For this reason, the hierarchical topologies did not outperform the flat ones.
Regarding the score announcements, their fusion as an extra feature (or observation) at the shot level (HMMs) or at the scene level (SMs) was firstly considered. In both case, a marginal performance improvement was noticed. The reason is that score announcement may appear a lot delayed or may not appear at all after a game event. The probability distributions of this feature become thus almost uniform, carrying no useful information. The Score-Oriented Viterbi Search was introduced and used instead, in order to fully exploit the semantic content of the score announcements and their positions as well. The algorithm finds out the most likely path that is consistent with the score announcements in the expense of a computation overhead a little superior to the standard Viterbi decoding for both HMMs and SMs. The fusion of the score announcements as such yielded a clear performance improvement in both HMMs and SMs and with both flat or hierarchical topologies.
Finally, we explored the idea of using a SM-RNN hybrid, i.e. , the use of Recurrent Neural Networks as a segmental scorer. To this end, the newly introduced LSTM topology was favorably compared to BPTT-trained RNNs and used in the hybrid. The hybrid performed however visibly inferior to the standard SM but still with a promising performance. In fact, what makes the difference is that the HMM-based segmental scorers can use prior knowledge on the task directly into their topology, while the LSTM scorers were built from scratch.
Future work involves firstly the addition of useful information in the feature set, like the outcomes of a player tracking process. Tennis rules dictate that the players have to change position between successive exchanges. Tracking players between successive exchanges can thus convey helpful information. We also plan to extend the framework of SMs to other genres of video where structure analysis is required. News broadcasts can be such a video genre, defining a segment as a news story. Information sources from multiple modalities, like image, sound and text could be asynchronous between them but, by definition, they are synchronous inside the story unit boundaries.
Dynamic Bayesian Networks for video event detection
The aim of this work is to study the additional flexibility brought by DBN with respect to what is possible with Segments Models. The first application considered is advertisement detection. Like Segment models, DBN should allow us to preserve the dynamics of each modality while taking better into account the dependance between these modalities. It also allows us to learn automatically the correlations between random variables, and thus to learn the structure of the network. But the possibility to use such a learning in practical situations has still to be investigated. Our first work consists in testing how the temporal aspect of our problem can be handled using DBNs.
Automatic Structuring and Labeling of Television Streams
Our work on this topic is done in collaboration with INA and the Metiss team (IRISA).
This topic investigates the possibility of automatically structuring large television streams. Structuring means finding the structure of the TV stream at a relatively coarse level, e.g. extracting the succession of programs in a large corpus of weeks or months of TV. This is of major interest for television archives, which are usually not structured nor documented, thus making it very difficult to retrieve any information. Other applications of structuring include monitoring or consumer-oriented TV services. Commercial monitoring is already a mature subject, a few solutions from industry based on watermarking or fingerprinting indeed exist. Apart from this classical commercial monitoring application, our solution enables to monitor a channel in its entirety: computing the amount of commercials vs. programs, checking if programs are scheduled on time... This is of interest for institutions like CSA (Conseil supérieur de l'audiovisuel) which in France monitors channels to detect potential frauds to the French law.
Our previous work enabled us to segment a TV stream into programs and to tag them with labels coming from a TV program guide. The program guide indeed provides a rough schedule of the programs to come and albeit errors it contains very useful information for tagging the segmented stream. The link between the extracted segments and TV program guide information is done using a Dynamic Time Warping algorithm (DTW), which computes a global alignment between the segmentation and the program guide.
Extensions of this simple algorithm have been proposed, in particular to take into account information coming from a shot recognition process. Such a recognition which aims at detecting repeated video segments provides ``synchronization points'' called anchors, which constrain the path of the DTW algorithm, and thus provide a nice way to include side information. A new post-processing algorithm has been designed for resolving conflicts in the case where labels proposed by the DTW and the recognition process differ. It is based on a simple yet effective Bayesian decision criterion.
The problem of the video database update has also been investigated. As a matter of fact, the database used in the recognition process needs to be updated in order to keep correct labeling results over time. We have shown that a naïve approach actually decreases the labeling results due to over-segmentation by trailers. A more subtle approach was then proposed which structures and eventually labels the new segments before including them in the database. As a result, with carefully selected segments, an automatic update of the database is possible and the results remain consistent over a three weeks period.
Image and Text Joint Description
Keywords : Image-Text Interaction.
In text retrieval engines, images are not taken into account; conversely, image retrieval engine treats the images independently of the text surrounding them. Of course, it should be better to couple these two engines or, at least, to couple the information that both media can provide.
The first way to reach this goal is to determine the parts of the texts which are related to images. This could lead to textual descriptions of images, and thus to the possibility of textual queries to retrieve images, in a much richer way than what is currently offered by systems using simple keywords associated with the images. Moreover, it is possible to find two documents containing a same image and to use both texts to disambiguate or improve the understanding of the text.
Our first work in this framework was to build up a huge corpus of news articles containing texts and images. Moreover, in this corpus, each image is associated with a brief description given by news agencies documentalists describing the content of the image. This corpus allowed us to test, in "real-world" conditions, the relevance of simple approaches trying to combine visual and textual pieces of evidence to propose a meaningful description of images. Indeed, in our first experiments, we jointly used a face detection system and named entity recognition system to annotate images representing a person with its name. Very good results are obtained, both in terms of precision and recall.
Text and Speech Joint Description
Keywords : Text-Speech Interaction.
A lot of sound documents contain speech. Speech recognition is used to automatically index and exploit them, particularly when the data to analyze are voluminous. Automatic speech recognition (ASR) systems produce a textual transcription from sound documents, thanks to acoustic and linguistic clues. However, most of current ASR systems are more based on purely statistical methods (study of n-grams in corpora) than on linguistics. Stéphane Huet's Ph.D. thesis (under the double supervision of Guillaume Gravier, from the METISS Project-Team, and Pascale Sébillot) aims at using Natural Language Processing (NLP) techniques to help the transcription of sound documents.
This year, after publishing a state-of-the art of the use of linguistic knowledge in ASR systems as an internal report  , we have focused our interest on the study of the integration of parts of speech (PoS) (categorial tags identifying a word as a noun, a verb, an adjective, etc. , and at the singular or plural form, conjugated, etc. ) to improve the quality of the output of ASR systems. Our choice is mostly motivated by the following reasons:
PoS can be reliably obtained from automatic NLP tools. The main difficulties we are faced with when NLP techniques are applied to spoken documents are their lack of structure, since there are no paragraphs or even sentences, and the noise of automatic transcriptions, as the hypotheses made by ASR systems may include many erroneous words. These phenomena complicate the use of a high-level syntactic analysis but only slightly disturb PoS tagging. With only minor adjustments to the specificities of spoken documents, our experiments showed than more than 91% of words of automatic transcriptions are correctly tagged, while the usual results for written documents are of about 95%;
PoS tagging brings interesting information to correct errors commonly made by ASR systems. For instance, in the French language, the plural and singular forms, as well as the masculine and feminine ones of many words, are homophones and PoS tagging brings new information to discriminate them;
disambiguating words according to their PoS is usually necessary before extracting other information from a text like terms or named entities.
We showed by experiments the interest of PoS to correct transcription errors  . We proposed and evaluated several approaches to use PoS tag information in a postprocessing stage to rescore and reorder the 100-best hypotheses produced by the METISS Project-Team's ASR system; we noticed an absolute decrease of 0.9% of the word error rate when we include PoS knowledge into the score computed for each hypothesis  .
We plan to further use PoS tagging to help us to segment outputs of ASR systems into topically cohesive stories, which will imply to find the most relevant keywords to discriminate the different parts of a given broadcast.