# Project : metiss

## Section: New Results

Keywords : sparse decomposition, dictionary construction, source separation, granular models.

### Advanced audio signal processing

#### Nonlinear approximation and sparse decompositions

Keywords : redundant dictionnaries, sparsity, Matching Pursuit, Basis Pursuit, linear programming.

Participant : Rémi Gribonval.

Research on nonlinear approximation of signals and images with redundant dictionaries has been carried out over the past few years in collaboration with Morten Nielsen, from the University of Aalborg in Denmark, and more recently with Pierre Vandergheynst, from the Swiss Federal Institute of Technology in Lausanne (EPFL).

A problem closely related to m-term approximation of a signal/function from an overcomplete dictionary is the computation of sparse representations of the signal in the dictionary. For the family of *localized frames* (which includes most Gabor and wavelet type systems) it is known [55] that the canonical frame expansion provides a near-sparsest representation of any signal in the sense, 12. Last year, we have shown [53] that this property is also valid for r<<1 where r depends on the degree s of localization/decay of the frame, and combining it with our previous results [17] we showed that thresholding the canonical representation in a localized frame provided a predictable rate of m-term approximation. However, we disproved in [18] a conjecture of Gröchenig about the existence of a general *Bernstein inequality* for localized frames, by building a simple counter-example. Speaking in simpler words, we proved that for some localized frames, it is possible to find signals for which the ideal m-term approximation rate is infinitely better than what can be predicted from its sparsest representation (which turns out to be essentially its canonical frame expansion). This year, we proved that for *blockwise incoherent* dictionaries, a better behaviour can be expected, namely the rate of best m-term approximation never exceeds *twice* the rate predicted from its sparsest representation.

Many simple and yet interesting frames –such as the union of a wavelet basis and a Wilson basis– are not localized frames, and one cannot rely on the frame coefficients to obtain a near sparsest representation for various measures. Last year, in [54][53], [30] we proposed several extensions of results by Donoho, Huo, Elad and Bruckstein on sparse representations of signals in a union of two orthonormal bases, by (1) relaxing the hypotheses on the structure of the dictionary and (2) replaced the 0 and 1 sparsity measures with a larger family of *admissible sparsity measures* (which includes all norms, 01), and we gave sufficient conditions for having a unique sparse representation of a signal from the dictionary w.r.t. such a sparsity measure. This year, we obtained results on sparse *approximations* (which include the case of sparse *representations*). We provided a simple test [33] that can be applied on the output of a sparse approximation algorithm to check whether it is nearly optimal, in the sense that no significantly different linear expansion from the dictionary can provide both a smaller approximation error and a better sparsity (in the sense of any *admissible* sparsity measure). As a by-product, we obtained results on the identifiability of sparse overcomplete models in the presence of noise, for the class of admissible sparse priors.

In a joint work with Pierre Vandergheynst from EPFL [34] we extended to the case of the Pure Matching Pursuit recent results by Gilbert *et al*[46][47] and Tropp [63] about exact recovery with Orthogonal Matching Pursuit. In particular, in incoherent dictionaries, our result extends a result by Villemoes [64] about Matching Pursuit in the Haar-Walsh wavepacket dictionary: if we start with a linear combination of sufficiently few atoms from an incoherent dictionary, Matching Pursuit will pick up at each step a ``correct'' atom and the residue will converge exponentially fast to zero. The rate of exponential convergence is controlled by the number of atoms in the initial expansion. We also obtained stability results of Matching Pursuit when the analyzed signal is well approximated by such a
linear combination of few atoms.

#### Dictionary design for source separation

Keywords : sparse coding, redundant dictionnaries, sparsity.

Participants : Sylvain Lesage, Rémi Gribonval, Frédéric Bimbot.

Recent theoretical work has shown that Basis Pursuit or Matching Pursuit techniques can recover highly sparse representations of signals from *incoherent* redundant dictionaries, or structured (rather than sparse) representations from unions of orthonormal bases. To exploit these results we started last year a research project dedicated to the design of dictionaries structured as unions of orthonormal bases. We
proposed a new method based on the SVD and thresholding to build
dictionaries which are a union of orthonormal bases. The interest of such a structure is manifold. Indeed, it seems that many signals or images can be modeled as the superimposition of several layers with sparse decompositions in as many bases. Moreover, in such dictionaries, the efficient Block Coordinate Relaxation (BCR) algorithm can be used to compute sparse decompositions. We showed that it is possible to design an iterative learning algorithm that produces a dictionary with the required structure. Each step is based on the coefficients estimation, using a variant of BCR, followed by the update of one chosen basis, using Singular Value Decomposition. We assessed experimentally how well the learning algorithm recovers dictionaries that may or may not have the required structure, and to what extent the noise level is a disturbing factor.Besides its promising results, the method is flexible in that the sparsity measure which is optimized can easily be replaced with some other criterion.

#### Statistical models of music

Keywords : musical description, statistical models.

Participants : Amadou Sall, Frédéric Bimbot.

With analogy to speech recognition, which is very advantageously guided by statistical language models, we hypothetise that music description, recognition and retranscription can strongly benefit from music models that express dependencies between notes within a music piece, due to melodic patterns and harmonic rules.

To this end, we have started a study, in the context of a PhD, on the approximate modeling of syntactic and paradigmatic properties of music, through the use of n-grams models of notes, succession of notes and combinations of notes.

In practice, we consider a corpus of MIDI files on which we learn cooccurences of concurrent and consecutives notes, and we use these statistics to cluster music pieces into classes of models and to measure predictibility of notes within a class of models. Preliminary results have shown promising results that are currently being consolidated.

After simple n-gram models will have been investigated, we will evaluate more elaborate models such as Markov Fields. At the longer term, the model is intended to be used in complement to source separation and acoustic decoding, to form a consistent framework embedding signal processing techniques, acoustic knowledge sources and music rules modeling.

#### Underdetermined audio source separation

Keywords : degenerate blind source separation, denoising, Wiener filter, masking, clustering, Gaussian Mixture Models, Hidden Markov Models, Kalman filtering.

Participants : Alexey Ozerov, Frédéric Bimbot, Rémi Gribonval.

The problem of separating several audio sources mixed on one or more channels is now well understood and tackled in the determined cased, where the number of sources does not exceed the number of channels. Based on our work on statistical modeling and sparse decompositions of audio signals in redundant dictionaries (see above), we proposed in the past years techniques to deal with the degenerate case (monophonic and stereophonic), where it is not possible to merely estimate and apply a demixing matrix.

Last year we proposed [39][37] a series of methods to perform the separation of two sound sources from a single sensor. The methods were based on mixtures of Gaussian models to model the nonstationary data, and they involved a learning phase where the parameters of the models were estimated and a separation phase where a generalization of Wiener filtering was applied to estimate the sources. This year, in [27] we have applied these methods to the separation of music from speech in broadcast news for robust speech recognition and we have compared them to more classical denoising methods. Moreover, we have considered several new parametric models of nonstationary signal based on graphical models and mixtures of Gaussians, either in the spectral or in the log spectral domain. We are now beginning to understand experimentally the interplay between the choice of the modeling domain (spectral or log spectral), the estimation criteria used at the learning and separation phases (e.g., which (average) distortion is minimized) and the quality of the results in terms of a measured distortion.

#### Evaluation of audio source separation methods

Keywords : Audio source separation, source to distortion ratio, source to interference ratio, source to noise ratio, source to artefacts ratio.

Participant : Rémi Gribonval.

Because the success or failure of an algorithm for a practical task such as BSS cannot be assessed without agreed upon, pre-specified objective criteria, METISS took part in 2002-2003 to a GDR-ISIS (CNRS) workgroup [35] which goal was to ``identify common denominators specific to the different problems related to audio source separation, in order to propose a toolbox of numerical criteria and test signals of calibrated difficulty suited for assessing the performance of existing and future algorithms''. The workgroup released an online prototype of a database of test signals together with an evaluation toolbox. This year, we have proposed a larger set of performance measures and an updated toolbox to deal with the fact that, depending on the exact application, different distortions can be allowed between an estimated source and the target true source. We considered four different sets of such allowed distortions, from time-invariant gains to time-varying filters. In each case we proposed to decompose the estimated source into a true source part plus error terms corresponding to interferences, additive noise and algorithmic artifacts. Then we derived a global performance measure using an energy ratio, plus a separate performance measure for each error term. These measures were computed and discussed on the results of several BASS problems with various difficulty levels. These proposals are the subject of a paper currently submitted to IEEE Trans. Speech and Audio Processing.