Team MISTIS

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Dissemination
Bibliography

Section: New Results

Mixture models

Taking into account the curse of dimensionality.

Participant : Stéphane Girard.

Joint work with:Bouveyron, C (Université Paris 1) and Celeux, G. (Select, INRIA).

In the PhD work of Charles Bouveyron (co-advised by Cordelia Schmid from the INRIA team LEAR)  [43] , we propose new Gaussian models of high dimensional data for classification purposes. We assume that the data live in several groups located in subspaces of lower dimensions. Two different strategies arise:

This modelling yields a new supervised classification method called HDDA for High Dimensional Discriminant Analysis  [3] . Some versions of this method have been tested on the supervised classification of objects in images. This approach has been adapted to the unsupervised classification framework, and the related method is named HDDC for High Dimensional Data Clustering  [2] . In collaboration with Gilles Celeux and Charles Bouveyron we are currently working on the automatic selection of the discrete parameters of the model. Another part of the work of Charles Bouveyron and Stéphane Girard consists in extending this case to the semi-supervised context or to the presence of label noise.

Audio-visual object localization using binaural and binocular cues

Participants : Florence Forbes, Vasil Khalidov.

Joint work with:Arnaud, E., Hansard, M., Horaud, R. and Narasimha, R. from the INRIA team Perception.

This work takes place in the context of the POP European project (see Section 8.3 ) and includes further collaborations with researchers from University of Sheffield, UK. The context is that of multi-modal sensory signal integration. We focus on audio-visual integration. Fusing information from audio and video sources has resulted in improved performance in applications such as tracking. However, crossmodal integration is not trivial and requires some cognitive modelling because at a lower level, there is no obvious way to associate depth and sound sources. Combining expertise from team Perception and University of Sheffield, we address the difficult problems of integrating spatial and temporal audio-visual stimuli using a geometrical and probabilistic framework and attack the problem of associating sensorial descriptions with representation of prior knowledge.

Geometric and probabilistic fusion of spatial visual and auditory cues.We first explain how we can combine spatial visual and auditory cues in a geometric and probabilistic framework. This is done in order to address the issues of detecting and localizing objects in a scene that are both seen and heard. To do so, we used binaural and binocular sensors for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. The proposed probabilistic generative model captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture models. The statistical method of choice for solving this problem is cluster analysis. We rely on low-level audio and video features which makes our model more general and less dependent on supervised learning techniques, such as face and speech detectors. The input data consists of M visual observations f= { f1, ..., fm, ..., fM} , and K auditory observations g= { g1, ..., gk, ..., gK} . This data is recorded over a time interval [ t1, t2] , which is short enough to ensure that the audio-visual (AV) objects responsible for f and g are effectively stationary in space. Then we address the estimation of the AV object sites S= { s1, ..., sn, ..., sN} , where each sn is described by its 3D coordinates ( xn, yn, zn) T . Note that in general N is unknown. A visual observation fm is a 3D binocular coordinate ( um, vm, dm) T , where u and v denote the 2D location in the Cyclopean image. The scalar d denotes the binocular disparity at ( u, v) T . Hence, Cyclopean coordinates ( u, v, d) T are associated with each point s= ( x, y, z) T in the visible scene. We define a function F: R3$ \rightarrow$R3 that maps S onto f . An auditory observation gk is represented by an auditory disparity, namely the interaural time difference, or ITD. To relate a location to an ITD value we define a function G: R3$ \rightarrow$R that maps S on g . Given an observed ITD we can deduce the surface that should contain the source.

We address the problem of AV localization in the framework of unsupervised clustering. The rationale is that observations form groups that correspond to the different AV objects in the scene. So the problem is recast as a clustering task: an assignment of each observation to one of the clusters should be performed as well as the estimation of cluster parameters, which include the N 3D positions sn of AV objects. To account for the presence of observations that are not related to any AV object, we introduce an additional background (outlier) class. Because of the different nature of the observations, clustering is performed via two mixture models respectively in the audio (1D) and video (3D) observation spaces, subject to the common parametrization provided by the positions sn . The next step is to devise a procedure that finds the best values for the assignments and for the parameters. One possibility is to use a version of the EM algorithm, as it is explained below.

Development of statistical methods for cross-modal integration.Given the probabilistic model defined above, we wish to determine the AV objects that generated the visual and auditory observations, that is to derive values of assignment vectors together with the AV object position vectors S (which are part of our model unknown parameters). Direct maximum likelihood estimation of mixture models is usually difficult, due to the missing assignments. The Expectation Maximization (EM) algorithm is a general and now standard approach to maximization of the likelihood in missing data problems. In our specific context, difficulties arise from the fact that it is necessary to perform simultaneous optimization in two different observation spaces, auditory and visual. It involves solving a system of non-linear equations which does not yield a closed form solution and the traditional EM algorithm cannot be performed. As an alternative, we considered instances of the Generalized EM (GEM) algorithm which is more flexible and provided good results in our experiments. This work has been published in the ICMI'08 conference [36] where more details as well as experiments can be found.


previous
next

Logo Inria