Team METISS

Scientific Foundations
Contracts and Grants with Industry
Other Grants and Activities
Bibliography
Inria / Raweb 2003
Project: METISS

# Project : metiss

## Section: Scientific Foundations

The large family of audio signals includes a wide variety of temporal and frequential structures, objects of variable durations, ranging from almost stationary regimes (for instance, the note of a violin) to short transients (like in a percussion). The spectral structure can be mainly harmonic (vowels) or noise-like (fricative consonants). More generally, the diversity of timbers results in a large variety of fine structures for the signal and its spectrum, as well as for its temporal and frequential envelope.

In addition, a majority of audio signals are composite, i.e. they result from the mixture of several sources (voice and music, mixing of several tracks, useful signal and background noise). Audio signals may have undergone various types of distortion, recording conditions, media degradation, coding and transmission errors,  etc.

To account for these factors of diversity, our approach is to focus on techniques for decomposing signals on redundant systems (or dictionaries). The elementary atoms in the dictionary correspond to the various structures that are expected to be met in the signal.

#### Redundant systems and adaptive representations

Traditional methods for signal decomposition are generally based on the description of the signal in a given basis (i.e. a free, generative and constant representation system for the whole signal). On such a basis, the representation of the signal is unique (for example, a Fourier basis, Dirac basis, orthogonal wavelets, ...). On the contrary, an adaptive representation in a redundant system consists of finding an optimal decomposition of the signal (in the sense of a criterion to be defined) in a generating system (or dictionary) including a number of elements (much) higher than the dimension of the signal.

Let y be a monodimensional signal of length T and D a redundant dictionary composed of $N>T$ vectors ${g}_{i}$ of dimension T .

$y=\left[y\left(t\right){\right]}_{1\le t\le T}\phantom{\rule{1.em}{0ex}}\phantom{\rule{1.em}{0ex}}D={\left\{{g}_{i}\right\}}_{1\le i\le N}\phantom{\rule{1.em}{0ex}}\text{with}\phantom{\rule{1.em}{0ex}}{g}_{i}=\left[{g}_{i}\left(t\right){\right]}_{1\le t\le T}$

If D is a generating system of ${R}^{T}$ , there is an infinity of exact representations of y in the redundant system D , of the type:

$y\left(t\right)={\sum }_{1\le i\le N}{\alpha }_{i}{g}_{i}\left(t\right)$

We will denote as $\alpha ={\left\{{\alpha }_{i}\right\}}_{1\le i\le N}$ , the N coefficients of the decomposition.

The principles of the adaptive decomposition then consist in selecting, among all possible decompositions, the best one, i.e. the one which satisfies a given criterion (for example a sparsity criterion) for the signal under consideration, hence the concept of adaptive decomposition (or representation). In some cases, a maximum of T coefficients are non-zero in the optimal decomposition, and the subset of vectors of D thus selected are refered to as the basis adapted to y . This approach can be extended to approximate representations of the type:

$y\left(t\right)={\sum }_{1\le i\le M}{\alpha }_{\varphi \left(i\right)}{g}_{\varphi \left(i\right)}\left(t\right)+e\left(t\right)$

with $M , where $\varphi$ is an injective function of $\left[1,M\right]$ in $\left[1,N\right]$ and where $e\left(t\right)$ corresponds to the error of approximation to M terms of $y\left(t\right)$ . In this case, the optimality criterion for the decomposition also integrates the error of approximation.

#### Sparsity criteria

Obtaining a single solution for the equation above requires the introduction of a constraint on the coefficients ${\alpha }_{i}$ . This constraint is generally expressed in the following form :

${\alpha }^{*}=arg{min}_{\alpha }\phantom{\rule{0.277778em}{0ex}}F\left(\alpha \right)$

Among the most commonly used functions, let us quote the various functions ${L}_{\gamma }$ :

${L}_{\gamma }\left(\alpha \right)={\left[{\sum }_{1\le i\le N},|,{\alpha }_{i},{|}^{\gamma }\right]}^{1/\gamma }$

Let us recall that for $0<\gamma <1$ , the function ${L}_{\gamma }$ is a sum of concave functions of the coefficients ${\alpha }_{i}$ . Function ${L}_{0}$ corresponds to the number of non-zero coefficients in the decomposition.

The minimization of the quadratic norm ${L}_{2}$ of the coefficients ${\alpha }_{i}$ (which can be solved in an exact way by a linear equation) tends to spread the coefficients on the whole collection of vectors in the dictionary. On the other hand, the minimization of ${L}_{0}$ yields a maximally parsimonious adaptive representation, as the obtained solution comprises a minimum of non-zero terms. However the exact minimization of ${L}_{0}$ is an untractable NP-complete problem.

An intermediate approach consists in minimizing norm ${L}_{1}$ , i.e. the sum of the absolute values of the coefficients of the decomposition. This can be achieved by techniques of linear programming and it can be shown that, under some (strong) assumptions the solution converges towards the same result as that corresponding to the minimization of ${L}_{0}$ . In a majority of concrete cases, this solution has good properties of sparsity, without reaching however the level of performance of ${L}_{0}$ .

Other criteria can be taken into account and, as long as the function F is a sum of concave functions of the coefficients ${\alpha }_{i}$ , the solution obtained has good properties of sparsity. In this respect, the entropy of the decomposition is a particularly interesting function, taking into account its links with the information theory.

Finally, let us note that the theory of non-linear approximation offers a framework in which links can be established between the sparsity of exact decompositions and the quality of approximate representations with M terms. This is still an open problem for unspecified redundant dictionaries.

#### Decomposition algorithms

Three families of approaches are conventionally used to obtain an (optimal or sub-optimal) decomposition of a signal in a redundant system.

The ``Best Basis'' approach consists in constructing the dictionary D as the union of B distinct bases and then to seek (exhaustively or not) among all these bases the one which yields the optimal decomposition (in the sense of the criterion selected). For dictionaries with tree structure (wavelet packets, local cosine), the complexity of the algorithm is quite lower than the number of bases B , but the result obtained is generally not the optimal result that would be obtained if the dictionary D was taken as a whole.

The ``Basis Pursuit'' approach minimizes the norm ${L}_{1}$ of the decomposition resorting to linear programming techniques. The approach is of larger complexity, but the solution obtained yields generally good properties of sparsity, without reaching however the optimal solution which would have been obtained by minimizing ${L}_{0}$ .

The ``Matching Pursuit'' approach consists in optimizing incrementally the decomposition of the signal, by searching at each stage the element of the dictionary which has the best correlation with the signal to be decomposed, and then by subtracting from the signal the contribution of this element. This procedure is repeated on the residue thus obtained, until the number of (linearly independent) components is equal to the dimension of the signal. The coefficients $\alpha$ can then be reevaluated on the basis thus obtained. This greedy algorithm is sub-optimal but it has good properties for what concerns the decrease of the error and the flexibility of its implementation.

Intermediate approaches can also be considered, using hybrid algorithms which try to seek a compromise between computational complexity, quality of sparsity and simplicity of implementation.

#### Dictionary construction

The choice of the dictionary D has naturally a strong influence on the properties of the adaptive decomposition : if the dictionary contains only a few elements adapted to the structure of the signal, the results may not be very satisfactory nor exploitable.

The choice of the dictionary can rely on a priori considerations. For instance, some redundant systems may require less computation than others, to evaluate projections of the signal on the elements of the dictionary. For this reason, the Gabor atoms, wavelet packets and local cosines have interesting properties. Moreover, some general hint on the signal structure can contribute to the design of the dictionary elements : any knowledge on the distribution and the frequential variation of the energy of the signals, on the position and the typical duration of the sound objects, can help guiding the choice of the dictionary (harmonic molecules, chirplets, atoms with predetermined positions, ...).

Conversely, in other contexts, it can be desirable to build the dictionary with data-driven approaches, i.e. training examples of signals belonging to the same class (for example, the same speaker or the same musical instrument, ...). In this respect, Principal Component Analysis (PCA) offers interesting properties, but other approaches can be considered (in particular the direct optimization of the sparsity of the decomposition, or properties on the approximation error with M terms) depending on the targeted application.

In some cases, the training of the dictionary can require stochastic optimization, but one can also be interested in EM-like approaches when it is possible to formulate the redundant representation approach within a probabilistic framework.

Extension of the techniques of adaptive representation can also be envisaged by the generalization of the approach to probabilistic dictionaries, i.e. comprising vectors which are random variables rather than deterministic signals. Within this framework, the signal $y\left(t\right)$ is modeled as the linear combination of observations emitted by each element of the dictionary, which makes it possible to gather in the same model several variants of the same sound (for example various waveforms for a noise, if they are equivalent for the ear). Progress in this direction are conditioned to the definition of a realistic generative model for the elements of the dictionary and the development of effective techniques for estimating the model parameters.

#### Signal separation

METISS is especially interested in source and signal separation in the underdetermined case, i.e. in the presence of a number of sources strictly higher than the number of sensors.

In the particular case of two sources and one sensor, the mixed (monodimensional) signal writes :

$y={s}_{1}+{s}_{2}+ϵ$

where ${s}_{1}$ and ${s}_{2}$ denote the sources and $ϵ$ an additive noise.

Under a probabilistic framework, we can denote by ${\theta }_{1}$ , ${\theta }_{2}$ and $\eta$ the model parameters of the sources and of the noise. The problem of source separation then becomes :

$\left({\stackrel{^}{s}}_{1},{\stackrel{^}{s}}_{2}\right)=arg{max}_{\left({s}_{1},{s}_{2}\right)}\left[P\left(,{s}_{1},,,{s}_{2},|y,,{\theta }_{1},,,{\theta }_{2},\right)\right]$

By applying the Bayes rule and by assuming statistical independence between the two sources, the desired result can be obtained by solving :

$\left({\stackrel{^}{s}}_{1},{\stackrel{^}{s}}_{2}\right)=arg{max}_{\left({s}_{1},{s}_{2}\right)}\left[P\left(y|,{s}_{1},,,{s}_{2},\right)P\left(,{s}_{1},|,{\theta }_{1},\right)P\left(,{s}_{2},|,{\theta }_{2},\right)\right]$

The first of the three terms in the argmax can be obtained via the model noise :

$P\left(y|{s}_{1},{s}_{2}\right)\propto P\left(y-\left({s}_{1}+{s}_{2}\right)|\eta \right)=P\left(ϵ|\eta \right)$

The two other terms are obtained via likelihood functions corresponding to source models trained from examples, or designed from knowledge sources. For example, commonly used models are the Laplacian model, the Gaussian Mixture Model or the Hidden Markov Model.

These models can be linked to the distribution of the representation coefficients in a redundant system in which are pooled together several bases adapted to each of the sources present in the mixture.