Project : metiss
Section: New Results
Advanced audio signal processing
Nonlinear approximation and sparse decompositions
Participant : Rémi Gribonval.
Research on nonlinear approximation of signals and images with redundant dictionaries has been carried out over the past few years in collaboration with Morten Nielsen, from the University of Aalborg in Denmark. The goal is to understand what classes of functions/signals can be approximated at a given rate by m -term expansions using various families of practical or theoretical approximation algorithms. Much is known when the dictionary is an orthonormal wavelet basis, and we focus on findind the right extension of the wavelet results to structured redundant dictionaries.
Last year, we completely characterized (in terms of Besov spaces) the best m -term approximation classes with spline-based redundant framelets systems in . This year, we have shown that the characterization extends to more general framelet systems   in , and that Sobolev spaces are also characterized of in terms of framelet coefficients . The range of Besov smoothness for which the characterization holds with the frame expansion is limited by the number of vanishing moments of the functions in the dual frame. However, we proved in  that, for twice oversampled MRA-based framelets, the same results hold true with no restriction on the number of vanishing moments, but now the canonical frame expansion is replaced with another linear expansion. The trick is to prove a Jackson inequality by building a ``nice'' wavelet which has a highly sparse linear expansion in the twice oversampled framelet system.
A problem closely related to m -term approximation is the computation of sparse representations of a function in a redundant dictionary. For the family of localized frames (which includes most Gabor and wavelet type systems) it is known  that the canonical frame expansion provides a near-sparsest representation of any signal in the sense, . In  we have shown that this property is also valid for where r depends on the degree s of localization/decay of the frame. However, we have disproved in  a conjecture of Gröchenig about the existence of a general Bernstein inequality for localized frames, by building a simple counter-example. Many simple and yet interesting frames –such as the union of a wavelet basis and a Wilson basis– are not localized frames, and one cannot rely on the frame coefficients to obtain a near sparsest representation for various measures. In   we extended a result by Donoho, Huo, Elad and Bruckstein on sparse representations of signals in a union of two orthonormal bases. In , we considered general (redundant) dictionaries in finite dimension, and derived sufficient conditions on a signal for having unique sparse representations (in the and sense) in such dictionaries. The special case where the dictionary is given by a union of several orthonormal bases was studied in more detail. In  we introduced a large class of admissible sparseness measures (which includes all norms, ), and we gave sufficient conditions for having a unique sparse representation of a signal from the dictionary w.r.t. such a sparseness measure, in finite or infinite dimension. Moreover, we gave sufficient conditions on the sparseness of a signal such that the simple solution of a linear programming problem simultaneously solves all the non-convex (and generally hard combinatorial) problems of sparsest representation of the signal w.r.t. arbitrary admissible sparseness measures. In a joint work with Pierre Vandergheynst from EPFL (see in ) we extended to the case of the Pure Matching Pursuit recent results by Gilbert et al   and Tropp  about exact recovery with Orthogonal Matching Pursuit. In particular, in incoherent dictionaries, our result extends a result by Villemoes  about Matching Pursuit in the Haar-Walsh wavepacket dictionary: if we start with a linear combination of sufficiently few atoms of an incoherent dictionary, Matching Pursuit will pick up at each step a ``correct'' atom and the residue will converge exponentially fast to zero. The rate of exponential convergence is controlled by the number of atoms in the initial expansion.
Dictionary design for source separation
Recent theoretical work has shown that Basis Pursuit or Matching Pursuit techniques can recover highly sparse representations of signals from incoherent redundant dictionaries. To exploit these results we have started a research project dedicated to the design of incoherent dictionaries with the aim of performing source separation. First, we compared the sparsity of the decomposition in various orthonomal bases, both theorically and experimentally. We observed that the cosine basis and the data-dependant Karhunen-Loeve basis provide the sparsest decomposition among the tested orthonormal bases. In the cosine basis, we noticed that the choice of the size of the signal frames which leads to the maximum gain in sparsity corresponds to the largest duration on which the signal can be considered stationary. Then we proposed five methods to ``learn'' a dictionary from training data so as to maximize the mean sparsity. We proposed a new method based on the SVD and thresholding to build dictionaries which are a union of orthonormal bases. Besides its promising results, the method is flexible in that the sparsity measure which is optimized can easily be replaced with some other criterion.
Granular models of audio signals
The theoretical framework which is the foundation of our work on granular signal models is now established  . The model is frame-based and of ``hybrid'' nature, in that it combines two different approaches. The non-parametric part consists of a dictionary element or prototype (plain waveforms, Fourier spectra, LPC excitation signals, etc.), where each prototype corresponds to one possible state of the model. The parametric part attempts to model how frame signals deviate from a prototype. Clustering is then used to compute the dictionary and assign a state index to each frame . The clustering algorithms attempts to maximize two antagonistic criteria, namely the quality and efficiency of the representation, respectively measured by the SNR and symbol-rate. Clustering algorithms have been specifically developped for our purpose. Comparative tests conducted on lossy compression of real-world signals showed that these performed better than a number of state-of-the-art algorithms. Various models such as LPC adaptive-codebook and various spectrum-based models have been investigated. Our current efforts concentrate on modelling local dependencies of amplitude and phase discrete spectra , both in the time/state and frequency-domain. In this ``spectral'' model, prototypes are spectrum-templates; we attempt to model amplitude and phase differences between the frame-spectra of a cluster and the corresponding prototype, only around energy-spectrum peaks. The aim is to build an efficient granular, object-based, time-frequency model of audio signals. Altough most of our work is directed towards compression, segmentation , content-based indexing, summary generation, musical resynthesis  etc. are some of many other possible applications.
Underdetermined audio source separation
Keywords : degenerate blind source separation , piecewise linear separation , sparse decomposition , nonlinear approximation , Best Basis , Matching Pursuit , denoising , Wiener filter , masking , clustering , Gaussian Mixture Models , Hidden Markov Models .
The problem of separating several audio sources mixed on one or more channels is now well understood and tackled in the determined cased, where the number of sources does not exceed the number of channels. Based on our work on statistical modeling and sparse decompositions of audio signals in redundant dictionaries (see above), we have proposed techniques to deal with the degenerate case (monophonic and stereophonic), where it is not possible to merely estimate and apply a demixing matrix.
In   we proposed two new methods to perform the separation of two sound sources from a single sensor. The first method  generalizes the Wiener filtering with locally stationary, non gaussian, parametric source models. The method involves a learning phase for which we proposed three different algorithms. In the separation phase, we used a sparse non negative decomposition algorithm of our own. The second method  also generalizes the Wiener filtering but with Gaussian Mixture distributions and Hidden Markov Models. The method involves a training phase of the models parameters, which is done with the classical EM algorithm. We derived a new algorithm for the re-estimation of the sources with these mixture models, during the separation phase. In  we applied these methods to the separation of music from speech in broadcast news for robust speech recognition.
Following our work of last year  on Matching Pursuit based audio source separation in the stereophonic case, and building upon our new approaches for single channel separation (see above), we proposed a new framework , called piecewise linear separation, for blind source separation of possibly degenerate mixtures, including the extreme case of a single mixture of several sources. Its basic principle is to : 1/ decompose the observations into ``components'' using some sparse decomposition/nonlinear approximation technique; 2/ perform separation on each component using a ``local'' separation matrix. It covers many recently proposed techniques for degenerate BSS, as well as several new algorithms that we propose. We discussed two particular methods of multichannel decompositions based on the Best Basis and Matching Pursuit algorithms, as well as several methods to compute the local separation matrices (assuming the mixing matrix is known). Numerical experiments were used to compare the performance of various combinations of the decomposition and local separation methods. On the dataset used for the experiments, it seemed that Best Basis with either cosine packets of wavelet packets (Beylkin, Vaidyanathan, Battle3 or Battle5 filter) were the best choices in terms of overall performance because they introduce a relatively low level of artefacts in the estimation of the sources; Matching Pursuit introduces slightly more artefacts, but can improve the rejection of the unwanted sources.
Evaluation of blind audio source separation methods
Because the success or failure of an algorithm for a practical task such as BSS cannot be assessed without agreed upon, pre-specified objective criteria, METISS took part in 2002-2003 to a GDR-ISIS (CNRS) workgroup  which goal was to ``identify common denominators specific to the different problems related to audio source separation, in order to propose a toolbox of numerical criteria and test signals of calibrated difficulty suited for assessing the performance of existing and future algorithms''. The workgroup released an online prototype of a database of test signals together with an evaluation toolbox.
In  , we proposed a preliminary step towards the construction of a global evaluation framework for Blind Audio Source Separation (BASS) algorithms. BASS covers many potential applications that involve a more restricted number of tasks. An algorithm may perform well on some tasks and poorly on others. Various factors affect the difficulty of each task and the criteria that should be used to assess the performance of algorithms that try to address it. Thus a typology of BASS tasks greatly helps the building of an evaluation framework. We describe some typical BASS applications and propose some qualitative criteria to evaluate separation in each case. We then list some of the tasks to be accomplished and present a possible classification scheme.
In , we introduced several measures of distortion that take into account the gain indeterminacies of BSS algorithms. The total distortion includes interference from the other sources as well as noise and algorithmic artifacts, and we defined performance criteria that measure separately these contributions. The criteria are valid even in the case of correlated sources. When the sources are estimated from a degenerate set of mixtures by applying a demixing matrix, we proved that there are upper bounds on the achievable Source to Interference Ratio. We proposed these bounds as benchmarks to assess how well a (linear or nonlinear) BSS algorithm performs on a set of degenerate mixtures. We demonstrated on an example how to use these figures of merit to evaluate and compare the performance of BSS algorithms.