# Project : metiss

## Section: New Results

Keywords : sparse decomposition , dictionary construction , source separation , granular models .

### Advanced audio signal processing

#### Nonlinear approximation and sparse decompositions

Keywords : redundant dictionnaries , sparsity , Matching Pursuit , Basis Pursuit , linear programming .

Participant : Rémi Gribonval.

Research on nonlinear approximation of signals and images with redundant
dictionaries has been carried out over the past few years in collaboration
with Morten Nielsen, from the University of Aalborg in Denmark. The goal is
to understand what classes of functions/signals can be approximated at a
given rate by
*m*
-term expansions using various families of practical or
theoretical approximation algorithms. Much is known when the dictionary is
an orthonormal wavelet basis, and we focus on findind the right extension
of the wavelet results to structured redundant dictionaries.

Last year, we completely characterized (in terms of Besov spaces) the best
*m*
-term approximation classes with spline-based redundant framelets
systems in
${L}_{p}({\mathbb{R}}^{d})$
[12]. This year, we have
shown that the characterization extends to more general framelet systems [41]
[40] in
${L}_{p}({\mathbb{R}}^{d})$
,
and that Sobolev spaces are also characterized of in terms of framelet
coefficients [34]. The range of Besov
smoothness for which the characterization holds with the frame expansion is
limited by the number of vanishing moments of the functions in the dual
frame. However, we proved in [33] that, for twice
oversampled MRA-based framelets, the same results hold true with no
restriction on the number of vanishing moments, but now the canonical frame
expansion is replaced with another linear expansion. The trick is to prove
a Jackson inequality by building a ``nice'' wavelet which has a highly
sparse linear expansion in the twice oversampled framelet system.

A problem closely related to
*m*
-term approximation is the computation of
sparse representations of a function in a redundant dictionary. For the
family of *localized frames* (which includes most Gabor and wavelet
type systems) it is known [52] that the
canonical frame expansion provides a near-sparsest representation of any
signal in the
${\ell}^{\tau}$
sense,
$1\le \tau \le 2$
. In
[35] we have shown that this property is also valid
for
$r<\tau <1$
where
*r*
depends on the degree
*s*
of
localization/decay of the frame. However, we have disproved in
[36] a conjecture of Gröchenig about the existence
of a general Bernstein inequality for localized frames, by building a
simple counter-example. Many simple and yet interesting frames –such as
the union of a wavelet basis and a Wilson basis– are not localized frames,
and one cannot rely on the frame coefficients to obtain a near sparsest
representation for various
${\ell}^{\tau}$
measures. In
[13]
[35] we extended a result by
Donoho, Huo, Elad and Bruckstein on sparse representations of signals in a
union of two orthonormal bases. In [13], we considered
general (redundant) dictionaries in finite dimension, and derived
sufficient conditions on a signal for having unique sparse representations
(in the
${\ell}^{0}$
and
${\ell}^{1}$
sense) in such dictionaries. The special case
where the dictionary is given by a union of several orthonormal bases was
studied in more detail. In [35] we introduced a large
class of admissible sparseness measures (which includes all
${\ell}^{\tau}$
norms,
$0\le \tau \le 1$
), and we gave sufficient conditions for having
a unique sparse representation of a signal from the dictionary w.r.t. such
a sparseness measure, in finite or infinite dimension. Moreover, we gave
sufficient conditions on the
${\ell}^{0}$
sparseness of a signal such that the
simple solution of a linear programming problem simultaneously solves all
the non-convex (and generally hard combinatorial) problems of sparsest
representation of the signal w.r.t. arbitrary admissible sparseness
measures. In a joint work with Pierre Vandergheynst from EPFL
(see in [24]) we extended to the
case of the Pure Matching Pursuit recent results by Gilbert *et al*
[45]
[46] and Tropp [61] about exact recovery with Orthogonal Matching
Pursuit. In particular, in incoherent dictionaries, our result extends a
result by Villemoes [62] about Matching Pursuit in
the Haar-Walsh wavepacket dictionary: if we start with a linear combination
of sufficiently few atoms of an incoherent dictionary, Matching Pursuit
will pick up at each step a ``correct'' atom and the residue will converge
exponentially fast to zero. The rate of exponential convergence is
controlled by the number of atoms in the initial expansion.

#### Dictionary design for source separation

Keywords : redundant dictionnaries , sparsity .

Participants : Sylvain Lesage, Laurent Benaroya, Frédéric Bimbot, Rémi Gribonval.

Recent theoretical work has shown that Basis Pursuit or Matching Pursuit
techniques can recover highly sparse representations of signals from *incoherent* redundant dictionaries. To exploit these results we have
started a research project dedicated to the design of incoherent
dictionaries with the aim of performing source separation. First, we
compared the sparsity of the decomposition in various orthonomal bases,
both theorically and experimentally. We observed that the cosine basis and
the data-dependant Karhunen-Loeve basis provide the sparsest decomposition
among the tested orthonormal bases. In the cosine basis, we noticed that
the choice of the size of the signal frames which leads to the maximum gain
in sparsity corresponds to the largest duration on which the signal can be
considered stationary. Then we proposed five methods to ``learn'' a
dictionary from training data so as to maximize the mean sparsity. We
proposed a new method based on the SVD and thresholding to build
dictionaries which are a union of orthonormal bases. Besides its promising
results, the method is flexible in that the sparsity measure which is
optimized can easily be replaced with some other criterion.

#### Granular models of audio signals

Keywords : musical signal analysis , granular synthesis , clustering .

Participants : Lorcan Mc Donagh, Frédéric Bimbot, Rémi Gribonval.

The theoretical framework which is the foundation of our work on granular
signal models is now established [30]
[19].
The model
$s\left[n\right]=F\left(\gamma ,\left[{k}_{n}\right],,,{\theta}_{n}\right)$
is frame-based and of ``hybrid'' nature, in that it combines two different
approaches. The non-parametric part consists of a dictionary element or
*prototype*
$\gamma \left[{k}_{n}\right]$
(plain waveforms, Fourier
spectra, LPC excitation signals, etc.), where each prototype corresponds
to one possible state of the model. The parametric part
$F\left(\xb7,,,\theta \right)$
attempts to model how
frame signals
$s\left[n\right]$
deviate from a prototype.
Clustering is then used to compute the dictionary and assign a state index
${k}_{n}$
to each frame
$s\left[n\right]$
. The clustering algorithms attempts
to maximize two antagonistic criteria, namely the quality and efficiency of
the representation, respectively measured by the SNR and symbol-rate.
Clustering algorithms have been specifically developped for our purpose.
Comparative tests conducted on lossy compression of real-world signals
showed that these performed better than a number of state-of-the-art
algorithms. Various models such as LPC adaptive-codebook and various
spectrum-based models have been investigated. Our current efforts
concentrate on modelling local dependencies of amplitude and phase discrete
spectra [59], both in the time/state and frequency-domain.
In this ``spectral'' model, prototypes are spectrum-templates; we attempt
to model amplitude and phase differences between the frame-spectra of a
cluster and the corresponding prototype, only around energy-spectrum peaks.
The aim is to build an efficient granular, object-based, time-frequency
model of audio signals. Altough most of our work is directed towards
compression, segmentation [43], content-based indexing,
summary generation, musical resynthesis [63] etc. are some of
many other possible applications.

#### Underdetermined audio source separation

Keywords : degenerate blind source separation , piecewise linear separation , sparse decomposition , nonlinear approximation , Best Basis , Matching Pursuit , denoising , Wiener filter , masking , clustering , Gaussian Mixture Models , Hidden Markov Models .

Participants : Laurent Benaroya, Frédéric Bimbot, Rémi Gribonval.

The problem of separating several audio sources mixed on one or more channels is now well understood and tackled in the determined cased, where the number of sources does not exceed the number of channels. Based on our work on statistical modeling and sparse decompositions of audio signals in redundant dictionaries (see above), we have proposed techniques to deal with the degenerate case (monophonic and stereophonic), where it is not possible to merely estimate and apply a demixing matrix.

In [19] [17] we proposed two new methods to perform the separation of two sound sources from a single sensor. The first method [19] generalizes the Wiener filtering with locally stationary, non gaussian, parametric source models. The method involves a learning phase for which we proposed three different algorithms. In the separation phase, we used a sparse non negative decomposition algorithm of our own. The second method [17] also generalizes the Wiener filtering but with Gaussian Mixture distributions and Hidden Markov Models. The method involves a training phase of the models parameters, which is done with the classical EM algorithm. We derived a new algorithm for the re-estimation of the sources with these mixture models, during the separation phase. In [18] we applied these methods to the separation of music from speech in broadcast news for robust speech recognition.

Following our work of last year [50] on Matching Pursuit based audio source separation in the stereophonic case, and building upon our new approaches for single channel separation (see above), we proposed a new framework [23], called piecewise linear separation, for blind source separation of possibly degenerate mixtures, including the extreme case of a single mixture of several sources. Its basic principle is to : 1/ decompose the observations into ``components'' using some sparse decomposition/nonlinear approximation technique; 2/ perform separation on each component using a ``local'' separation matrix. It covers many recently proposed techniques for degenerate BSS, as well as several new algorithms that we propose. We discussed two particular methods of multichannel decompositions based on the Best Basis and Matching Pursuit algorithms, as well as several methods to compute the local separation matrices (assuming the mixing matrix is known). Numerical experiments were used to compare the performance of various combinations of the decomposition and local separation methods. On the dataset used for the experiments, it seemed that Best Basis with either cosine packets of wavelet packets (Beylkin, Vaidyanathan, Battle3 or Battle5 filter) were the best choices in terms of overall performance because they introduce a relatively low level of artefacts in the estimation of the sources; Matching Pursuit introduces slightly more artefacts, but can improve the rejection of the unwanted sources.

#### Evaluation of blind audio source separation methods

Keywords : Blind audio source separation , source to distortion ratio , source to interference ratio , source to noise ratio , source to artefacts ratio .

Participants : Laurent Benaroya, Frédéric Bimbot, Rémi Gribonval.

Because the success or failure of an algorithm for a practical task such as BSS cannot be assessed without agreed upon, pre-specified objective criteria, METISS took part in 2002-2003 to a GDR-ISIS (CNRS) workgroup [37] which goal was to ``identify common denominators specific to the different problems related to audio source separation, in order to propose a toolbox of numerical criteria and test signals of calibrated difficulty suited for assessing the performance of existing and future algorithms''. The workgroup released an online prototype of a database of test signals together with an evaluation toolbox.

In [32] [31], we proposed a preliminary step towards the construction of a global evaluation framework for Blind Audio Source Separation (BASS) algorithms. BASS covers many potential applications that involve a more restricted number of tasks. An algorithm may perform well on some tasks and poorly on others. Various factors affect the difficulty of each task and the criteria that should be used to assess the performance of algorithms that try to address it. Thus a typology of BASS tasks greatly helps the building of an evaluation framework. We describe some typical BASS applications and propose some qualitative criteria to evaluate separation in each case. We then list some of the tasks to be accomplished and present a possible classification scheme.

In [22], we introduced several measures of distortion that take into account the gain indeterminacies of BSS algorithms. The total distortion includes interference from the other sources as well as noise and algorithmic artifacts, and we defined performance criteria that measure separately these contributions. The criteria are valid even in the case of correlated sources. When the sources are estimated from a degenerate set of mixtures by applying a demixing matrix, we proved that there are upper bounds on the achievable Source to Interference Ratio. We proposed these bounds as benchmarks to assess how well a (linear or nonlinear) BSS algorithm performs on a set of degenerate mixtures. We demonstrated on an example how to use these figures of merit to evaluate and compare the performance of BSS algorithms.