Project : metiss
Section: New Results
Audio information extraction
Detecting simultaneous events in sport broadcast sound tracks
One common problem in sound event detection is the existence of simultaenous superposed events in complex auditive scenes.
To tackle this problem, we extended our baseline system based on ergodic hidden Markov modeling by adding states for all the possible combination of superposed events. As no sufficient data is available for a reliable estimation of the state conditional probability distributions for multiple events states, we proposed several model combination techniques, namely convolution and concatenation, in order to derive a model of the superposed events from the models of the isolated events .
Lately, a new approach was developed and outperformed the Viterbi based approach previously used. This method is based on statistical hypothesis tests to detect the presence or not of an event, similarly to what is done in speaker tracking and verification. The sound track is first segmented into homogeneous chunks and detection is carried out in each segment and for each event of interest. It was shown experimentally that using two models, i.e. the event and non-event models, can lead to better results than the maximum likelihood Viterbi-based approach. However, it was observed that the segmentation of complex audio scenes into homogeneous chunks is poor, thus causing errors in event detection. The latter is near perfect if a manual perfect segmentation is used. We are currently studying solutions to combine the advantages of model based segmentation (as in the HMM approach) and statistcal hypothesis tests.
Using audio cues for video structuring
The problem of detecting highlights in (sport) videos has so far been seen mainly from the image point of view with some authors using audio cues to select relevant portions of the video. Based on our work on the extraction of audio information (see above), we investigated how the latter can be combined with visual information in order to detect highlights in a tennis video or to structure the video in terms of games, sets an points. Our approach is based on tracking broad sound classes in the video sound track, such as applause, ball hits (i.e. tennis noise), music or speech.
In the case of highlight detection, potentially relevant shots in the video are first selected based on movement information. These candidates shots are then filtered to select only the most interesting ones, based on the audio characterization of the shot. The filtering heuristics, of the type 'the shot contains ball hits and is followed by applauses', were determined based on human knowledge.
In the case of video structuring of tennis video, audio cues are integrated as an additional observation stream for a parser based on hidden Markov models. The HMM parser models the syntax of a tennis game in terms of points and sets and classify shots accordingly. When using video attributes only, shots are represented by their similarity to a reference key frame and their duration. Audio information were incorporated into this framework by adding a third stream of observation which consists of a vector describing which sound classes are present in the shot. At training, the observation probabilities of each sound classes for each state are estimated. This strategy allowed for a significant increase of the shot classification rate provided a good detection of the audio events is available. However, this performance increase is less significant with automatic audio event detection and, in the future, audiovisual integration methods more robust to audio event detection errors must be devised  .
This work was done in collaboration with Thomson Multimedia as well as with other teams at IRISA, namely VISTA and TEXMEX, in the framework of the Domus Videum project financed by the Réseau National de la Recherche en Télécommunication.