## Section: New Results

### Source separation

#### Source separation using multichannel Matching Pursuit

Keywords : underdetermined blind source separation, multichannel, linear instantaneous, Matching Pursuit, sparse decomposition.

Participants : Sylvain Lesage, Sacha Krstulovic, Rémi Gribonval.

The source separation problem consists in retrieving unknown signals (the sources) form the only knowledge of one or more mixtures of these signals (the channels coming from each sensor). In the case we study, each channel is a linear combination of the sources, and there are more sources than channels, and at least two channels. Due to the underdeterminacy of the problem, knowing all the parameters of the mixing process is not sufficient to retrieve the sources. Focussing on the estimation of the sources –assuming the mixing process is known– we have studied methods to perform the separation based on sparse decomposition of the mixture with Matching Pursuit. Methods for the estimation of the mixing parameters are developped apart (see next section).

Last year we concentrated [62] on methods based on the difference in spatial direction between sources, assuming the source signals can be sparsely decomposed on a joint dictionary. This year, we explored the possibility of simultaneously exploiting spatial differences and “morphological” differences, by choosing a distinct dictionary to sparsely model each source signal in the spirit of [55] . For sources wich can be modeled sparsely in sufficiently distinct domains (e.g., drums and electric guitar), our experiments showed that this approach can drastically improve separation performance. While learning appropriate dictionaries for each source based on training data is straightforward, the problem of training adapted dictionaries based on the only knowledge of the mixture remains a challenge.

This work is has been presented in a workshop.

#### DEMIX anechoic: a robust algorithm to estimate the number of sources in a spatial anechoic mixture

Keywords : underdetermined source separation, multichannel, linear instantaneous, clustering, source localisation.

Participants : Simon Arberet, Rémi Gribonval, Frédéric Bimbot.

An important step for audio source separation consists in finding both the number of mixed sources and their directions in a multisensor mixture.

In complement to the separation methods based on Matching Pursuit, which we developed and evaluated assuming the mixing matrix is known, we proposed last year a robust technique to address this problem in the case of linear instantaneous mixtures [1] , even with more sources than sensors. This year, we extended the approach to a more realistic setting of linear anechoic mixture (where the mixture involves not only intensity difference but also time delays between channels).

The method relies on the assumption that in the neighborhood of some time-frequency points, only one source contributes to the mixture. Such time-frequency points, located with a local confidence measure, provide estimates of the attenuation, as well as the phase difference at some frequency, of the corresponding source. Combining the phase differences at different frequencies, the time delay parameters are estimated, by a method similar to GCC-PHAT, on points having similar intensity differences. As a result, unlike DUET type methods, our method makes it possible to estimate time-delays higher than only one sample.

Experiments show that, in more than 65% of the cases, DEMIX Anechoic correctly estimates the number of directions until 6 sources. Moreover, it outperforms DUET in the accuracy of the estimation by a factor ten.

This work is currently submitted for publication.

#### Single channel source separation

Keywords : Single channel source separation, Gaussian mixture model, Wiener filter, model adaptation.

Participants : Alexey Ozerov, Rémi Gribonval, Frédéric Bimbot.

Probabilistic approaches can offer satisfactory solutions to source separation with a single channel, provided that the models of the sources match accurately the statistical properties of the mixed signals. However, it is not always possible in practice to construct and use such models.

To overcome this problem, we propose to resort to an adaptation scheme for adjusting the source models with respect to the actual properties of the signals observed in the mix. We develop a general formalism for source model adaptation. In a similar way as it is done for instance in speaker (or channel) adaptation for speech recognition, we introduce this formalism in terms of a Bayesian Maximum A Posteriori (MAP) adaptation criterion. We show then how to optimize this criterion using the EM (Expectation - Maximization) algorithm at different levels of generality.

Formulated in such a general way this adaptation formalism can be applied for different models (GMM, HMM, etc.) and using different types of priors (probabilistic laws, structural priors, etc.). Also, we extend this formalism by explaining how to integrate to the adaptation scheme any auxiliary information available in addition to the mix. This can be for example visual information, time segmentation of sound classes, some forms of incomplete separation, etc.

To show the use of model adaptation in practice, we apply this adaptation formalism to the problem of separating voice from music in popular songs. In 2005 we proposed some adaptation techniques based on some segmentation of the processed song into vocal and non-vocal parts. These techniques include learning of music model from the non-vocal parts and voice model filter adaptation from the vocal parts [66] , [65] .

We show that these adaptation techniques are just some particular forms of our general adaptation formalism. Furthermore, we introduce a new Power Spectral Density (PSD) gains adaptation technique, and we explain how to perform joint filter and PSD gains adaptation for voice model, which leads to better performance than filter adaptation alone. Finally, in addition to what was done in [66] , [65] , where a manual vocal / non-vocal segmentation was used, we have developed some automatic segmentation module.

Thus, we have developed a one microphone voice / music separation system based on adapted models. This system performs in a completely automatic manner, i.e. without any human intervention, and the computation load is quite reasonable (not more than 10 times real time). The obtained results show that for this task an adaptation scheme can significantly improve (at least by 5 dB) the separation performance in comparison with non-adapted models.

This work is accepted for publication [Oops!] and is thoroughly detailed in Alexey Ozerov's Ph.D. manuscript [16] . It was done in close collaboration with FTR&D (Pierrick Philippe).

#### Source separation via sparse adaptive representations

Keywords : source separation, sparse representation, adaptive basis.

Participants : Rémi Gribonval, Emmanuel Vincent.

Source separation is the task of retrieving the source signals underlying a multichannel mixture signal, where each channel is the sum of scaled versions of the sources (instantaneous case) or filtered versions thereof (convolutive case). A popular approach is to assume that the sources admit a sparse representation in some (possibly overcomplete) basis. Separation can then be achieved by sparse decomposition of the mixture signal. Previous work in the group focussed on fixed time-frequency bases and source-adapted bases trained on isolated samples of each source.

This year we proposed two methods to adapt the bases directly from the mixture signal. The first method aims to find a time-frequency basis such that the source signals overlap as little as possible in this basis, so that separation can be performed by binary masking, i.e. associating each time-frequency bin with a single source. Such a basis is estimated by minimizing a quadratic overlap criterion, given the spatial directions of the sources. Experiments with Cosine Packet (CP) bases showed that this method outperformed binary masking on a fixed MDCT basis for the separation of stereo instantaneous mixtures of three sources.

The second method assumes that each time frame of the mixture signal can be represented as a sparse linear combination of multichannel atoms forming a complete basis, where each atom belongs to a single source. The best basis is found for all time frames by minimizing the lp norm of the combination weights. The spatial direction associated with each atom is then estimated using the GCC PHAT estimator and the set of atoms corresponding to each source is estimated by clustering of the directions. This method outperformed both convolutive ICA and DUET approaches on low-reverberation convolutive mixtures.

We also studied the minimization of the lp norm of the combination weights for complex-valued overcomplete bases. This optimization problem is difficult since it is nonconvex and theoretical results for real-valued data do not apply for complex-valued data. We characterized the local minima of the lp norm in a simple case and derived a fast algorithm for the estimation of the global minimum. This algorithm has been applied to the separation of stereo instantaneous and convolutive mixtures of three sources.

This work was conducted in collaboration with Maria G. Jafari and Mark D. Plumbley (Queen Mary, University of London) and Mike E. Davies (University of Edinburgh). The results have been published in the form of a journal article [Oops!] , a book chapter [Oops!] and two conference papers [Oops!] , [Oops!] .

#### Evaluation of source separation algorithms

Keywords : blind source separation, evaluation, performance measure, benchmark.

Participants : Rémi Gribonval, Emmanuel Vincent.

Source separation of under-determined and/or convolutive mixtures is a difficult problem that has been tackled by many algorithms based on different source models. Their performance is usually limited by badly designed source models or local maxima of the function to be optimized. Moreover, it may be limited by algorithmic constraints, such as the length of the demixing filters or the number of frequency bins of the time-frequency masks. The best possible source signal that can be estimated under these constraints (in the ideal case where source models and optimization algorithms are perfect) is called an oracle estimator of the source. We have expressed and implemented oracle estimators for four classes of algorithms (time-invariant beamforming, single-channel time-frequency masking, multichannel time-frequency masking and best basis masking) and studied their performance on realistic speech and music mixtures. The results have led to interesting conclusions concerning the performance bounds of blind algorithms, the choice of the best class of algorithms and the assessment of the separation difficulty.

This work, which builds up on our previous contribution published in [68] , was done in collaboration with Emmanuel Vincent and Mark D. Plumbley (Queen Mary, University of London). For more detail, please refer to [Oops!] and [Oops!] .