Project : texmex
Section: New Results
Keywords : multimedia .
Multimedia Document Description
The term multimedia documents is broadly used and covers in fact most documents. It is in fact more and more appropriate since any document are now truly multimedia and contain several media: sound, image, video text. The description of these documents, videos for example, remains quite difficult. Research groups are often monodisciplinary and specialist of only one of these media, and the interaction between the different media of a same document is not taken into account. Nevertheless, it is clear that this interaction is a very rich source of information and allows to avoid the limitations of the techniques devoted to a single media since the limits vary according to the concerned media.
We propose to investigate, in conjunction with other teams like METISS and VISTA, this new aspect of multimedia document description. We propose to follow two directions: the first one concerns the case of video, where image and sound are closely related and provide complementary information. The sound track also opens the possibility of speech recognition, and requires the use of natural language processing in order to use this new modality. In this case, one of the problem is to handle the dynamics proper to each media. The second direction concerns the documents which mix text and still images like journals, technical manuals, or most of web pages.
      
Our work on this topic is done in close collaboration with the METISS and VISTA teams of IRISA and with the Thomson company where E. Kijak has done most of her thesis (cf. 7.1.1).
The aim of this work is to define a general method allowing to describe all the media of a video, as well as their interaction. Another constraint is that this method should also allow a user to formulate a task or a query concerning videos. This problem was first studied during the thesis of E. Kijak in the frame of a limited case, the structuring of sport (and particularly tennis) reports. In such documents, there are four main sources of information: the tennis rules which explain how a tennis game is organized, the production rules which explain how the producer works, what tools it uses and how he tries to reflect what's going on by formal techniques, the image and the sound tracks.
It is clear that none of these sources can explain the video alone. The goal is thus to integrate these sources such that their complementarity can be used to obtain the most complete description as possible. Three ways of integration are possible. In the late integration frame, the processing is done independently of each media, and the results are merged in a second time. The main difficulty of such a method comes from the merging operation since there is not coherence between the results provided by each media, and usually no satisfying way to solve this point.
The second way is to give a leading role to one of the modality. We experimented such a solution with image as the leading media. To help the characterization of the shots of a tennis reports, each shot was characterized by the presence, during this shot, of speech, applauds or ball noise. Sound is seen as a complementary source of information and allows to improve the results obtained previously with images only. Such a solution appears to be not fully satisfactory since only a small portion of the information carried by the sound track can be taken into account.
The third way we plan to study is early integration, where the different media are mixed from the beginning and all decision are taken based of the whole stream of information.
Such integration frames must be supported by a foundation technique which allow to handle them. Hidden Markov Models where chosen first, due to their nice properties to represent temporal streams and their ability to represent a priori knowledge about tennis and production rules. A hierarchical model was used to represent the complete structure of a tennis report, ans the Viterbi algorithm was used in order to identify this structure from the video stream. A problem with this model is that the time model it used is not flexible enough to handle properly the different time granularity of the various media.
We propose to use another kind of models, segment models, to circumvent this problem. In these models, each state does not correspond to a unique observation as it is the case in HMM, but can correspond to a variable number of observations. It is also possible to use several streams of observations. On the other hand, the models are more complicated and their use is more costly. The use of these models is the subject of the thesis of M. Delakis.
Image and Text Joint Description
Keywords : image - text interaction .
In text retrieval engines, images are not taken into account. When an image retrieval engine exists aside the previous one, it treats the images independently of the text surrounding them. Of course, it should be better to couple these two engines or, at least, to couple the information that both media can provide.
The first way to do this is to determine the parts of the text which are related to images. This should allow to get a textual description of the images, and to make textual query to retrieve images in a much richer context than that if systems using simple keywords linked with the images.
In the other way, it is possible to find documents containing a same image and to use both texts to disambiguate or improve the understanding of the text. These two points are the thesis subject of H. Renault.