Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities
Inria / Raweb 2004
Project: METISS

Project : metiss

Section: New Results

Keywords : audio information extraction, HMM, statistical hypothesis tests, multimedia, audiovisual integration.

Audio information extraction

Detecting simultaneous events in audio tracks

Participants : Guillaume Gravier, Michaël Betser.

One common problem in sound event detection is the existence of simultaenous superposed events in complex auditive scenes.

To tackle this problem, we had already proposed to extend a baseline HMM-based system by adding states for all the possible combination of superposed events. As no sufficient data is available for a reliable estimation of the state conditional probability distributions for those states that correspond to multiple events, we proposed several methods to combine models of isolated events into a model for the superimposed events [40].

In 2004, we experimented a new approach that outperformed the previous HMM approach [23][24]. The new approach is based on a maximum a posteriori criterion to detect the events present in a portion of the document. The sound track is first segmented into homogeneous parts and detection is carried out in each segment and for each event of interest. The proposed MAP criterion is strongly related to statistical hypothesis tests but enables the use of a unique decision threshold for all the events considered. This approach was validated on tennis broadcast sound tracks to detect events such as speech, applause or ball hits, and on broadcast news material for speech and music detection.

Though efficient, this approach outlined the limitation of the classical segmentation algorithms, such as the Bayesian Information Criterion one, to detect changes in complex audio scenes (e.g. changes from speech to speech+music). An approach combining hypothesis testing and HMMs [23] was studied to solve this problem but achieved the same performance as the MAP criterion.

The results of event detection in tennis videos is exploited for video abstracting [26] in collaboration with VISTA and for video structuring in collaboration with Tex-Mex (see below).

Using audio cues for video structuring

Participants : Guillaume Gravier, Michaël Betser, Ewa Kijak, Robert Forthofer, Stéphane Huet.

The problem of detecting highlights in (sport) videos has so far been seen mainly from the image point of view with some authors using audio cues to select relevant portions of the video. Based on our work on the extraction of audio information (see above), we investigated how the latter can be combined with visual information in order to structure the tennis videos in terms of games, sets an points.

A previous work based on HMMs demonstrated the potential of the Markovian formalism to integrate multimodal (sound and image) information [57][20] as well as prior structural knowledge. However, this work also demonstrated the limits of this formalism where a single observation is associated to one state. Due to such a constraint, the analysis of the different media must be synchronised to have sequences of descriptors sampled at exactly the same rate for each media stream. In the work of Ewa Kijak, this constraint leads to an analysis stongly drivn by a shot segmentation, even though this segmentation has no meaning from the sound track point of view!

To overcome this problem, we investigate on segment models whose principle is to associate a sequence of observations, aka segment, to a state of the Markov process. Such models were originally proposed for speech modeling. In this case, a state corresponds to a semantic event with its own duration, modeled at the state level, and to which a model is associated in order to compute the probability of a sequence of observations.

This framework was exploited for multimodal tennis video structuring with several, possibly asynchronous, sequences of observations per state. The state conditional probability of a sequence of visual descriptors is given by a HMM as in our previous work. However, the state conditional probability of a sequence of audio events is given by a bigram model thus enabling to take into account the dynamics of audio events. Preliminary results showed significant improvements over the previous HMM approach (This work is a joint work with TEXMEX and is the prime focus of the PhD of Manolis Delakis, co-directed by Patrick Gros (TEXMEX), Pascale Sébillot (TEXMEX) and Guillaume Gravier (METISS).)

More general data structures and elaborate modeling strategies are currently being studied in the framework on two PhDs in their early stage.


Logo Inria