Section: New Results
Mixture models
Taking into account the curse of dimensionality.
Participant : Stéphane Girard.
Joint work with:Bouveyron, C (Université Paris 1) and Celeux, G. (Select, INRIA).
In the PhD work of Charles Bouveyron (coadvised by Cordelia Schmid from the INRIA team LEAR) [43] , we propose new Gaussian models of high dimensional data for classification purposes. We assume that the data live in several groups located in subspaces of lower dimensions. Two different strategies arise:

the introduction in the model of a dimension reduction constraint for each group,

the use of parsimonious models obtained by imposing to different groups to share the same values of some parameters.
This modelling yields a new supervised classification method called HDDA for High Dimensional Discriminant Analysis [3] . Some versions of this method have been tested on the supervised classification of objects in images. This approach has been adapted to the unsupervised classification framework, and the related method is named HDDC for High Dimensional Data Clustering [2] . In collaboration with Gilles Celeux and Charles Bouveyron we are currently working on the automatic selection of the discrete parameters of the model. Another part of the work of Charles Bouveyron and Stéphane Girard consists in extending this case to the semisupervised context or to the presence of label noise.
Audiovisual object localization using binaural and binocular cues
Participants : Florence Forbes, Vasil Khalidov.
Joint work with:Arnaud, E., Hansard, M., Horaud, R. and Narasimha, R. from the INRIA team Perception.
This work takes place in the context of the POP European project (see Section 8.3 ) and includes further collaborations with researchers from University of Sheffield, UK. The context is that of multimodal sensory signal integration. We focus on audiovisual integration. Fusing information from audio and video sources has resulted in improved performance in applications such as tracking. However, crossmodal integration is not trivial and requires some cognitive modelling because at a lower level, there is no obvious way to associate depth and sound sources. Combining expertise from team Perception and University of Sheffield, we address the difficult problems of integrating spatial and temporal audiovisual stimuli using a geometrical and probabilistic framework and attack the problem of associating sensorial descriptions with representation of prior knowledge.
Geometric and probabilistic fusion of spatial visual and auditory cues.We first explain how we can combine spatial visual and auditory cues in a geometric and probabilistic framework. This is done in order to address the issues of detecting and localizing objects in a scene that are both seen and heard. To do so, we used binaural and binocular sensors for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audiovisual observations into coherent groups. The proposed probabilistic generative model captures the relations between audio and visual observations. This model maps the data into a common audiovisual 3D representation via a pair of mixture models. The statistical method of choice for solving this problem is cluster analysis. We rely on lowlevel audio and video features which makes our model more general and less dependent on supervised learning techniques, such as face and speech detectors. The input data consists of M visual observations f= { f_{1}, ..., f_{m}, ..., f_{M}} , and K auditory observations g= { g_{1}, ..., g_{k}, ..., g_{K}} . This data is recorded over a time interval [ t_{1}, t_{2}] , which is short enough to ensure that the audiovisual (AV) objects responsible for f and g are effectively stationary in space. Then we address the estimation of the AV object sites S= { s_{1}, ..., s_{n}, ..., s_{N}} , where each s_{n} is described by its 3D coordinates ( x_{n}, y_{n}, z_{n}) ^{T} . Note that in general N is unknown. A visual observation f_{m} is a 3D binocular coordinate ( u_{m}, v_{m}, d_{m}) ^{T} , where u and v denote the 2D location in the Cyclopean image. The scalar d denotes the binocular disparity at ( u, v) ^{T} . Hence, Cyclopean coordinates ( u, v, d) ^{T} are associated with each point s= ( x, y, z) ^{T} in the visible scene. We define a function F: R^{3}R^{3} that maps S onto f . An auditory observation g_{k} is represented by an auditory disparity, namely the interaural time difference, or ITD. To relate a location to an ITD value we define a function G: R^{3}R that maps S on g . Given an observed ITD we can deduce the surface that should contain the source.
We address the problem of AV localization in the framework of unsupervised clustering. The rationale is that observations form groups that correspond to the different AV objects in the scene. So the problem is recast as a clustering task: an assignment of each observation to one of the clusters should be performed as well as the estimation of cluster parameters, which include the N 3D positions s_{n} of AV objects. To account for the presence of observations that are not related to any AV object, we introduce an additional background (outlier) class. Because of the different nature of the observations, clustering is performed via two mixture models respectively in the audio (1D) and video (3D) observation spaces, subject to the common parametrization provided by the positions s_{n} . The next step is to devise a procedure that finds the best values for the assignments and for the parameters. One possibility is to use a version of the EM algorithm, as it is explained below.
Development of statistical methods for crossmodal integration.Given the probabilistic model defined above, we wish to determine the AV objects that generated the visual and auditory observations, that is to derive values of assignment vectors together with the AV object position vectors S (which are part of our model unknown parameters). Direct maximum likelihood estimation of mixture models is usually difficult, due to the missing assignments. The Expectation Maximization (EM) algorithm is a general and now standard approach to maximization of the likelihood in missing data problems. In our specific context, difficulties arise from the fact that it is necessary to perform simultaneous optimization in two different observation spaces, auditory and visual. It involves solving a system of nonlinear equations which does not yield a closed form solution and the traditional EM algorithm cannot be performed. As an alternative, we considered instances of the Generalized EM (GEM) algorithm which is more flexible and provided good results in our experiments. This work has been published in the ICMI'08 conference [36] where more details as well as experiments can be found.