METISS is a joint research group between CNRS, INRIA, Rennes 1 University and INSA.
The research objectives of the METISS research group are dedicated to audio signal and speech processing and are organised along three main axes: speaker characterization, information detection and tracking in audio streams and "advanced" processing of audio signals (in particular, source separation). Some aspects of speech recognition (modeling and decoding) are also addressed so as to reinforce these three principal topics. All these objectives contribute to the more general area of audio scene analysis.
The main industrial sectors in relation with the topics of the METISS research group are the telecommunication sector (with voice authentification), the Internet and multimedia sector (with audio indexing), the musical and audiovisual production sector (with audio signal processing), and, marginally, the sector of educational softwares, games and toys.
In addition to the dissemination of our work through publications in conferences and journals, our scientific activity is accompanied with the permanent concern of evaluation and assessment of our progress within the framework of evaluation campaigns. We also widely disseminate software resources corresponding to our latest developments.
On a regular basis, METISS is involved in bilateral or multilateral partnerships, within the framework of consortia, networks, thematic groups, national research projects European projects and industrial contracts with various local companies.
Probabilistic approaches offer a general theoretical framework which has yielded considerable progress in various fields of pattern recognition. In speech processing in particular , the probabilistic framework indeed provides a solid formalism which makes it possible to formulate various problems of segmentation, detection and classification. Coupled to statistical approaches, the probabilistic paradigm makes it possible to easily adapt relatively generic tools to various applicative contexts, thanks to estimation techniques for training from examples.
A particularily productive family of probabilistic models is the Hidden Markov Model, either in its general form or under some degenerated variants. The stochastic framework makes it possible to rely on wellknown algorithms for the estimation of the model parameters (EM algorithms, ML criteria, MAP techniques, ...) and for the search of the best model in the sense of the exact or approximate maximum likelihood (Viterbi decoding or beam search, for example). More recently, Bayesian networks have emerged as offering a powerful framework for the modeling of musical signals.
In practice, however, the use of probabilistic models must be accompanied by a number of adjustments to take into account problems occurring in real contexts of use, such as model inaccuracy, the insufficiency (or even the absence) of training data, their poor statistical coverage, etc...
Another focus of the activities of the METISS research group is dedicated to sparse representations of signals in redundant systems . The use of criteria of sparsity or entropy (in place of the criterion of least squares) to force the unicity of the solution of a underdetermined system of equations makes it possible to seek an economical representation (exact or approximate) of a signal in a redundant system, which is better able to account for the diversity of structures within an audio signal.
This topic opens a vast field of scientific investigation : sparse decomposition, sparsity criteria, pursuit algorithms, construction of efficient redundant dictionaries, links with the nonlinear approximation theory, probabilistic extensions, etc... The potential applicative outcomes are numerous.
This section briefly exposes these various theoretical elements, which constitute the fundamentals of our activities.
For several decades, the probabilistic approaches have been used successfully for various tasks in pattern recognition, and more particularly in speech recognition, whether it is for the recognition of isolated words, for the retranscription of continuous speech, for speaker recognition tasks or for language identification. Probabilistic models indeed make it possible to effectively account for various factors of variability occuring in the signal, while easily lending themselves to the definition of metrics between an observation and the model of a sound class (phoneme, word, speaker, etc...).
The probabilistic approach for the representation of an (audio) class
Xrelies on the assumption that this class can be described by a probability density function (PDF)
P(.
X)which associates a probability
P(
Y
X)to any observation
Y.
In the field of speech processing, the class
Xcan represent a phoneme, a sequence of phonemes, a word from a vocabulary, or a particular speaker, a type of speaker, a language, .... Class
Xcan also correspond to other types of sound objects, for example a family of sounds (word, music, applause), a sound event (a particular noise, a jingle), a sound segment with
stationary statistics (on both sides of a rupture), etc.
In the case of audio signals, the observations
Yare of an acoustical nature, for example vectors resulting from the analysis of the shortterm spectrum of the signal (filterbank coefficients, cepstrum coefficients, timefrequency
principal components, etc.) or any other representation accounting for the information that is required for an efficient separation of the various audio classes considered.
In practice, the PDF
Pis not accessible to measurement. It is therefore necessary to resort to an approximation
of this function, which is usually refered to as the likelihood function. This function can be expressed in the form of a parametric model and the models most used in the field of
speech and audio processing are the Gaussian Model (GM), the Gaussian Mixture Model (GMM) and the Hidden Markov Model (HMM).
In the rest of this text, we will denote as
the set of parameters which define the model under consideration.
_{X}will denote the vector of parameters for class
X, and in this case, the following notation will be used :
Choosing a particular family of models is based on a set of considerations ranging from the general structure of the data, some knowledge on the audio class making it possible to size the model, the speed of calculation of the likelihood function, the number of degrees of freedom of the model compared to the volume of training data available, etc.
The determination of the model parameters for a given class
Xis generally based on a step of statistical estimation consisting in determining the optimal value for the vector of parameters
, i.e. the parameters that maximize a modeling criterion on a training set
{
Y}
_{tr}comprising observations corresponding to class
X.
In some cases, the Maximum Likelihood (ML) criterion can be used :
This approach is generally satisfactory when the number of parameters to be estimated is small w.r.t. the number of training observations. However, in many applicative contexts, other estimation criteria are necessary to guarantee more robustness of the learning process with small quantities of training data. Let us mention in particular the Maximum a Posteriori (MAP) criterion :
which relies on a prior probability
p(
)of vector
, expressing possible knowledge on the estimated parameter distribution for the class considered. Discriminative training is another alternative to these two criteria, definitely more
complex to implement than the ML and MAP criteria.
In addition to the fact that the ML criterion is only one particular case of the MAP criterion (under the assumption of uniform prior probability for
), the MAP criterion happens to be experimentally better adapted to small volumes of training data and offers better generalization capabilities of the estimated models (this is
measured for example by the improvement of the classification performance and recognition on new data). Moreover, the same scheme can be used in the framework of incremental adaptation, i.e.
for the refinement of the parameters of a model using new data observed for instance, in the course of use of the recognition system. In this case, the value of
p(
)is given by the model before adaptation and the MAP estimate
uses the new data to update the model parameters.
Whatever criterion is considered (ML or MAP), the estimate of the parameters is obtained with the EM algorithm (ExpectationMaximization), which provides a solution corresponding to a local maximum of the training criterion.
During the recognition phase, it is necessary to evaluate the likelihood function for the various class hypotheses
X_{k}. When the complexity of the model is high  i.e when the number of classes is large and the observations to be recognized are multidimensional  it is generally necessary to implement
fast calculation algorithms to approximate the likelihood function.
In addition, when the class model are HMMs, the evaluation of the likelihood requires a decoding step to find the most probable sequence of hidden states. This is done by implementing the Viterbi algorithm, a traditional tool in the field of speech recognition.
If, moreover, the observations consist of segments belonging to different classes, chained by probabilities of transition between successive classes and without a priori knowledge of the borders between segments (which is for instance the case in a continuous speech utterance), it is necessary to call for beamsearch techniques to decode a (quasi)optimal sequence of states at the level of the whole utterance.
When the task to solve is the classification of an observation into one class among several closedset possibilities, the decision usually relies on the maximum a posteriori rule :
where denotes the set of possible classes.
In other contexts (for instance, in speaker verification, wordspotting or sound class detection), the problem of classification can be formulated as a binary hypotheses testing problem,
consisting in deciding whether the tested observation is more likely to be pertaining to the class
X(denoted as hypothesis
X) or not pertaining to it (i.e. pertaining to the “nonclass”, denoted as hypothesis
). In this case, the decision consists in acceptance or rejection, respectively denoted
and
in the rest of this document.
This latter problem can be theoretically solved within the framework of Bayesian decision by calculating the ratio
S_{X}of the PDFs for the class and the nonclass distributions, and comparing this ratio to a decision threshold :
where the optimal threshold
Rdoes not depend on the distribution of class
X, but only of the operating conditions of the system via the ratio of the prior probabilities of the two hypotheses and the ratio of the costs of false acceptance and false
rejection.
In practice, however, the Bayesian theory cannot be applied straightforwardly, because the quantities provided by the probabilistic models are not the true PDFs, but only likelihood functions which approximate the true PDFs more or less accurately, depending on the quality of the model of the class.
The rule of optimal decision must then be rewritten :
and the optimal threshold
_{X}(
R)must be adjusted for class
X, by modeling the behaviour of the ratio
on external (development) data.
The issue of how to estimate the optimal threshold
_{X}(
R)in the case of the likelihoo ratio test, can be formulated in an equivalent way as finding a normalisation of the likelihood ratio which brings back the optimal
decision threshold to its theoretical value. Several transformations are now well known within the framework of speaker verification, in particular the Znorm and the Tnorm methods.
In the past years, increasing interest has focused on Bayesian models for multisource signals, such as polyphonic music signals. These models are particularly interesting, since they enable a formulation of music information retrieval in a probabilistic modelling framework, together with the exploitation of various priors on the model parameters.
A first issue is the one of the model design, i.e. the chosen variables for parameterizing the signal, their priors and their conditional dependency structure. The second problem, called the inference problem, consists in estimating the activity states of the model for a given signal in the maximum a posteriori sense. A number of techniques are available to achieve this goal, whose challenge is to achieve a good compromise between tractability and accuracy.
The large family of audio signals includes a wide variety of temporal and frequential structures, objects of variable durations, ranging from almost stationary regimes (for instance, the note of a violin) to short transients (like in a percussion). The spectral structure can be mainly harmonic (vowels) or noiselike (fricative consonants). More generally, the diversity of timbers results in a large variety of fine structures for the signal and its spectrum, as well as for its temporal and frequential envelope.
In addition, a majority of audio signals are composite, i.e. they result from the mixture of several sources (voice and music, mixing of several tracks, useful signal and background noise). Audio signals may have undergone various types of distortion, recording conditions, media degradation, coding and transmission errors, etc.
To account for these factors of diversity, our approach is to focus on techniques for decomposing signals on redundant systems (or dictionaries). The elementary atoms in the dictionary correspond to the various structures that are expected to be met in the signal.
Traditional methods for signal decomposition are generally based on the description of the signal in a given basis (i.e. a free, generative and constant representation system for the whole signal). On such a basis, the representation of the signal is unique (for example, a Fourier basis, Dirac basis, orthogonal wavelets, ...). On the contrary, an adaptive representation in a redundant system consists of finding an optimal decomposition of the signal (in the sense of a criterion to be defined) in a generating system (or dictionary) including a number of elements (much) higher than the dimension of the signal.
Let
ybe a monodimensional signal of length
Tand
Da redundant dictionary composed of
N>
Tvectors
g_{i}of dimension
T.
If
Dis a generating system of
R^{T}, there is an infinity of exact representations of
yin the redundant system
D, of the type:
We will denote as
, the
Ncoefficients of the decomposition.
The principles of the adaptive decomposition then consist in selecting, among all possible decompositions, the best one, i.e. the one which satisfies a given criterion (for example a
sparsity criterion) for the signal under consideration, hence the concept of adaptive decomposition (or representation). In some cases, a maximum of
Tcoefficients are nonzero in the optimal decomposition, and the subset of vectors of
Dthus selected are refered to as the basis adapted to
y. This approach can be extended to approximate representations of the type:
with
M<
T, where
is an injective function of
[1,
M]in
[1,
N]and where
e(
t)corresponds to the error of approximation to
Mterms of
y(
t). In this case, the optimality criterion for the decomposition also integrates the error of approximation.
Obtaining a single solution for the equation above requires the introduction of a constraint on the coefficients _{i}. This constraint is generally expressed in the following form :
Among the most commonly used functions, let us quote the various functions :
Let us recall that for
0<
<1, the function
is a sum of concave functions of the coefficients
_{i}. Function
L_{0}corresponds to the number of nonzero coefficients in the decomposition.
The minimization of the quadratic norm
L_{2}of the coefficients
_{i}(which can be solved in an exact way by a linear equation) tends to spread the coefficients on the whole collection of vectors in the dictionary. On the other hand, the minimization of
L_{0}yields a maximally parsimonious adaptive representation, as the obtained solution comprises a minimum of nonzero terms. However the exact minimization of
L_{0}is an untractable NPcomplete problem.
An intermediate approach consists in minimizing norm
L_{1}, i.e. the sum of the absolute values of the coefficients of the decomposition. This can be achieved by techniques of linear programming and it can be shown that, under some (strong)
assumptions the solution converges towards the same result as that corresponding to the minimization of
L_{0}. In a majority of concrete cases, this solution has good properties of sparsity, without reaching however the level of performance of
L_{0}.
Other criteria can be taken into account and, as long as the function
Fis a sum of concave functions of the coefficients
_{i}, the solution obtained has good properties of sparsity. In this respect, the entropy of the decomposition is a particularly interesting function, taking into account its links with
the information theory.
Finally, let us note that the theory of nonlinear approximation offers a framework in which links can be established between the sparsity of exact decompositions and the quality of
approximate representations with
Mterms. This is still an open problem for unspecified redundant dictionaries.
Three families of approaches are conventionally used to obtain an (optimal or suboptimal) decomposition of a signal in a redundant system.
The “Best Basis” approach consists in constructing the dictionary
Das the union of
Bdistinct bases and then to seek (exhaustively or not) among all these bases the one which yields the optimal decomposition (in the sense of the criterion selected). For dictionaries
with tree structure (wavelet packets, local cosine), the complexity of the algorithm is quite lower than the number of bases
B, but the result obtained is generally not the optimal result that would be obtained if the dictionary
Dwas taken as a whole.
The “Basis Pursuit” approach minimizes the norm
L_{1}of the decomposition resorting to linear programming techniques. The approach is of larger complexity, but the solution obtained yields generally good properties of sparsity, without
reaching however the optimal solution which would have been obtained by minimizing
L_{0}.
The “Matching Pursuit” approach consists in optimizing incrementally the decomposition of the signal, by searching at each stage the element of the dictionary which has the best correlation with the signal to be decomposed, and then by subtracting from the signal the contribution of this element. This procedure is repeated on the residue thus obtained, until the number of (linearly independent) components is equal to the dimension of the signal. The coefficients can then be reevaluated on the basis thus obtained. This greedy algorithm is suboptimal but it has good properties for what concerns the decrease of the error and the flexibility of its implementation.
Intermediate approaches can also be considered, using hybrid algorithms which try to seek a compromise between computational complexity, quality of sparsity and simplicity of implementation.
The choice of the dictionary
Dhas naturally a strong influence on the properties of the adaptive decomposition : if the dictionary contains only a few elements adapted to the structure of the signal, the results
may not be very satisfactory nor exploitable.
The choice of the dictionary can rely on a priori considerations. For instance, some redundant systems may require less computation than others, to evaluate projections of the signal on the elements of the dictionary. For this reason, the Gabor atoms, wavelet packets and local cosines have interesting properties. Moreover, some general hint on the signal structure can contribute to the design of the dictionary elements : any knowledge on the distribution and the frequential variation of the energy of the signals, on the position and the typical duration of the sound objects, can help guiding the choice of the dictionary (harmonic molecules, chirplets, atoms with predetermined positions, ...).
Conversely, in other contexts, it can be desirable to build the dictionary with datadriven approaches, i.e. training examples of signals belonging to the same class (for example, the same
speaker or the same musical instrument, ...). In this respect, Principal Component Analysis (PCA) offers interesting properties, but other approaches can be considered (in particular the
direct optimization of the sparsity of the decomposition, or properties on the approximation error with
Mterms) depending on the targeted application.
In some cases, the training of the dictionary can require stochastic optimization, but one can also be interested in EMlike approaches when it is possible to formulate the redundant representation approach within a probabilistic framework.
Extension of the techniques of adaptive representation can also be envisaged by the generalization of the approach to probabilistic dictionaries, i.e. comprising vectors which are random
variables rather than deterministic signals. Within this framework, the signal
y(
t)is modeled as the linear combination of observations emitted by each element of the dictionary, which makes it possible to gather in the same model several
variants of the same sound (for example various waveforms for a noise, if they are equivalent for the ear). Progress in this direction are conditioned to the definition of a realistic
generative model for the elements of the dictionary and the development of effective techniques for estimating the model parameters.
METISS is especially interested in source and signal separation in the underdetermined case, i.e. in the presence of a number of sources strictly higher than the number of sensors.
In the particular case of two sources and one sensor, the mixed (monodimensional) signal writes :
y=
s_{1}+
s_{2}+
where
s_{1}and
s_{2}denote the sources and
an additive noise.
Under a probabilistic framework, we can denote by _{1}, _{2}and the model parameters of the sources and of the noise. The problem of source separation then becomes :
By applying the Bayes rule and by assuming statistical independence between the two sources, the desired result can be obtained by solving :
The first of the three terms in the argmax can be obtained via the model noise :
P(
y
s_{1},
s_{2})
P(
y(
s_{1}+
s_{2})
) =
P(

)
The two other terms are obtained via likelihood functions corresponding to source models trained from examples, or designed from knowledge sources. For example, commonly used models are the Laplacian model, the Gaussian Mixture Model or the Hidden Markov Model.
These models can be linked to the distribution of the representation coefficients in a redundant system in which are pooled together several bases adapted to each of the sources present in the mixture.
This section reviews a number of application domains in which the METISS projectteam has been particularily active : speaker characterisation, audio description and indexing (including speech recognition) and advanced audio processing (in paticular, source separation).
A number of audio signals contain speech, which conveys important information concerning the document origin, content and semantics. The field of speaker characterisation and verification covers a variety of tasks that consist in using a speech signal to determine some information concerning the identity of the speaker who uttered it. Indeed, even though the voice characteristics of a person are not unique , many factors (morphological, physiological, psychological, sociological, ...) have an influence on a person's voice. One focus of the METISS group in this domain is speaker verification, i.e the task of accepting or rejecting an identity claim made by the user of a service with access control. We also dedicate some effort to the more general problem of speaker characterisation. In parallel, METISS maintains some knowhow and develops new research in the area of acoustic modeling of speech signals and automatic speech transcription, mainly in the framework of the semantic analysis of audio and multimedia documents.
Speaker recognition and verification has made significant progress with the systematical use of probabilistic models, in particular Hidden Markov Models (for textdependent applications) and Gaussian Mixture Models (for textindependent applications). As presented in the fundamentals of this report, the current stateoftheart approaches rely on bayesian decision theory.
However, robustness issues are still pending : when speaker characteristics are learned on small quantities of data, the trained model has very poor performance, because it lacks generalisation capabilities. This problem can partly be overcome by adaptation techniques (following the MAP viewpoint), using either a speakerindependent model as general knowledge, or some structural information, for instance a dependency model between local distributions.
A key issue, in many practical applications, is the noncontrolable deviation of speaker models from the exact probability density functions. This requires a step of normalisation before comparing the verification score to a decision threshold. This issue has been a particular focus for our recent efforts in the domain of speaker verification and has led to the design and evaluation of various strategies of model and test normalisation.
METISS also adresses a number of other topics related to speaker characterisation, in particular speaker selection (i.e. how to select a representative subset of speakers from a larger population), speaker representation (namely how to represent a new speaker in reference to a given speaker population), speaker adaptation for speech recognition, and more recently, speaker's emotion detection.
In order to address needs related to the implementation of speaker verification technology on personal devices, specific algorithmic approaches have to be developed to contribute to the scalability, the complexity reduction and the process distribution. In this context, speaker modelling approaches and classification procedures need to be designed, simulated and tested.
Speech modeling and recognition is complementary with other speech related activities in the group, in particular, speaker recognition and audio description. In the first case, detecting speech segments in a continuous audio stream and segmenting the speech portions into pseudosentences is a preliminary step to automatic transcription. Detecting speaker changes and grouping together segments from the same speaker is also a crucial step for segmentation as for speaker adaptation, and can rely on acoustic as well as lexical and linguistic features. Last, in speaker recognition for secured transactions over the telephone, recognizing the linguistic content of the message might be useful, for example to hypothesize an identity, to recognize a spoken password or to extract linguistic parameters that can benefit to the speaker models.
Automatic tools to locate events in audio documents, structure them and browse through them as in textual documents are key issues in order to fully exploit most of the available audio documents (radio and television programmes and broadcasts, conference recordings, etc). In this respect, defining and extracting meaningful characteristics from an audio stream aim at obtaining a structured representation of the document, thus facilitating contentbased access or search by similarity. Activities in METISS focus on sound class and event characterisation and tracking in audio documents for a wide variety of features and documents.
Speaker characteristics, such as the gender, the approximate age, the accent or the identity, are key indices for the indexing of spoken documents. So are information concerning the presence or not of a given speaker in a document, the speaker changes, the presence of speech from multiple speakers, etc.
More precisely, the above mentioned tasks can be divided into three main categories: detecting the presence of a speaker in a document (classification problem); tracking the portions of a document corresponding to a speaker (temporal segmentation problem); segmenting a document into speaker turns (change detection problem).
These three problems are clearly closely related to the field of speaker characterisation, sharing many theoretical and practical aspects with the latter. In particular, all these application areas rely on the use of statistical tests, whether it is using the model of a speaker known to the system (speaker presence detection, speaker tracking) or using a model estimated on the fly (speaker segmentation). However, the specificities of the speaker detection task require the implementation of adequate solutions to adapts to situations and factors inherent to this task.
Locating various sounds or broad classes of sounds, such as silence, music or specific events like ball hits or a jingle, in an audio document is a key issue as far as automatic annotation of sound tracks is concerned. Indeed, specific audio events are crucial landmarks in a broadcast. Thus, locating automatically such events enables to answer a query by focusing on the portion of interest in the document or to structure a document for further processing. Typical sound tracks come from radio or TV broadcasts, or even movies.
In the continuity of research carried out at IRISA for many years (especially by Benveniste, Basseville, AndréObrecht, Delyon, Seck, ...) the statistical test approach can be applied to abrupt changes detection and sound class tracking, the latter provided a statistical model for each class to be detected or tracked was previously estimated. For example, detecting speech segments in the signal can be carried out by comparing the segment likelihoods using a speech and a “nonspeech” statistical model respectively. The statistical models commonly used typically represent the distribution of the power spectral density, possibly including some temporal constraints if the audio events to look for show a specific time structure, as is the case with jingles or words. As an alternative to statistical tests, hidden Markov models can be used to simultaneously segment and classify an audio stream. In this case, each state (or group of states) of the automaton represent one of the audio event to be detected. As for the statistical test approach, the hidden Markov model approach requires that models, typically Gaussian mixture models, are estimated for each type of event to be tracked.
In the area of automatic detection and tracking of audio events, there are three main bottlenecks. The first one is the detection of simultaneous events, typically speech with music in a speech/music/noise segmentation problem since it is nearly impossible to estimate a model for each event combination. The second one is the not so uncommon problem of detecting very short events for which only a small amount of training data is available. In this case, the traditional 100 Hz frame analysis of the waveform and Gaussian mixture modeling suffer serious limitations. Finally, typical approaches require a preliminary step of manual annotation of a training corpus in order to estimate some model parameters. There is therefore a need for efficient machine learning and statistical parameter estimation techniques to avoid this tedious and costly annotation step.
Applied to the sound track of a video, detecting and tracking audio events, as mentioned in the previous section, can provide useful information about the video structure. Such information is by definition only partial and can seldom be exploited by itself for multimedia document structuring or abstracting. To achieve these goals, partial information from the various media must be combined. By nature, pieces of information extracted from different media or modalities are heterogeneous (text, topic, symbolic audio events, shot change, dominant color, etc.) thus making their integration difficult. Only recently approaches to combine audio and visual information in a generic framework for video structuring have appeared, most of them using very basic audio information.
Combining multimedia information can be performed at various level of abstraction. Currently, most approaches in video structuring rely on the combination of structuring events detected independently in each media. A popular way to combine information is the hierarchical approach which consists in using the results of the event detection of one media to provide cues for event detection in the other media. Application specific heuristics for decision fusions are also widely employed. The Bayes detection theory provides a powerful theoretical framework for a more integrated processing of heterogeneous information, in particular because this framework is already extensively exploited to detect structuring events in each media. Hidden Markov models with multiple observation streams have been used in various studies on video analysis over the last three years.
The main research topics in this field are the definition of structuring events that should be detected on the one hand and the definition of statistical models to combine or to jointly model lowlevel heterogeneous information on the other hand. In particular, defining statistical models on lowlevel features is a promising idea as it avoids defining and detecting structuring elements independently for each media and enables an early integration of all the possible sources of information in the structuring process.
Music pieces constitue a large part of the vast family of audio data for which the design of description and search techniques remain a challenge. But while there exist some wellestablished formats for synthetic music (such as MIDI), there is still no efficient approach that provide a compact, searchable representation of music recordings.
In this context, the METISS research group dedicates some investigative efforts in high level modeling of music content along several tracks. The first one is the acoustic modeling of music recordings by deformable probabilistic sound objects so as to represent variants of a same note as several realisation of a common underlying process. The second track is music language modeling, i.e. the symbolic modeling of combinations and sequences of notes by statistical models, such as ngrams.
New search and retrieval technologies focused on music recordings are of great interest to amateur and professional applications in different kinds of audio data repositories, like online music stores or personal music collections.
The METISS research group is devoting increasing effort on the fine modeling of multiinstrument / multitrack music recordings. In this context we are developing new methods of automatic metadata generation from music recordings, based on Bayesian modeling of the signal for multilevel representations of its content. We also investigate uncertainty representation and multiple alternative hypotheses inference.
Speech signals are commonly found surrounded or superimposed with other types of audio signals in many application areas. The former are often mixed with musical signals or background noise. Moreover, audio signals frequently exhibit a composite nature, in the sense that they were originally obtained by combining several audio tracks with an audio mixing device. Audio signals are also prone to suffer from all kinds of degradations –ranging from nonideal recording conditions to transmission errors– after having travelled through a complete signal processing chain.
Recent breakthrough developments in the field of voice technology (speech and speaker recognition) are a strong motivation for studying how to adapt and apply this technology to a broader class of signals such as musical signals.
The main themes discussed here are therefore those of source separation and audio signal representation.
The general problem of “source separation” consists in recovering a set of unknown sources from the observation of one or several of their mixtures, which may correspond to as many microphones. In the special case of speaker separation, the problem is to recover two speech signals contributed by two separate speakers that are recorded on the same media. The former issue can be extended to channel separation, which deals with the problem of isolating various simultaneous components in an audio recording (speech, music, singing voice, individual instruments, etc.). In the case of noise removal, one tries to isolate the “meaningful” signal, holding relevant information, from parasite noise. It can even be appropriate to view audio compression as a special case of source separation, one source being the compressed signal, the other being the residue of the compression process. The former examples illustrate how the general source separation problem spans many different problems and implies many foreseeable applications.
While in some cases –such as multichannel audio recording and processing– the source separation problem arises with a number of mixtures which is at least the number of unknown sources,
the research on audio source separation within the METISS projectteam rather focusses on the socalled underdetermined case. More precisely, we consider the cases of one sensor (mono
recording) for two or more sources, or two sensors (stereo recording) for
n>2sources.
The standards within the MPEG family, notably MPEG4, introduce several sound description and transmission formats, with the notion of a “score”, i.e.a highlevel MIDIlike description, and an “orchestra”, i.e.a set of “instruments” describing sonic textures. These formats promise to deliver very low bitrate coding, together with indexing and navigation facilities. However, it remains a challenge to design methods for transforming an arbitrary existing audio recording into a representation by such formats.
Atomic decompositionmethods are yielding a rising interest in the field of sound representation, compression and synthesis. They attempt to provide such representation of audio signals
as linear sums of elementary signals (or “atoms”) from a “dictionary”. In the classical model, “sonic grains” are deterministic functions (modulated sinusoïds, chirps, harmonic molecules, or
even arbitrary waveforms stored in a wavetable, etc.). The reconstructed signal
y(
t)is then the
Mterm adaptive approximation of the original signal from the dictionary
D. Nonlinear approximation theory and decomposition methods such as Matching Pursuit and derivatives respectively provide a mathematical framework and powerful tools to tackle this
kind of problem.
Audio object coding is an extension of the notion of parametric coding, where the signal is decomposed into meaningful sound objects such as notes, chords and instruments, described using highlevel attributes.
As well as offering the potential for very low bitrate compression, this coding paradigm leads to many other potential applications, including browsing by content, source separation and interactive signal manipulation.
The SPro toolkit provides standard frontend analysis algorithms for speech signal processing. It is systematically used in the METISS group for activities in speech and speaker recognition as well as in audio indexing. The toolkit is developed for Unix environments and is distributed as a free software with a GPL license. It is used by several other French laboratories working in the field of speech processing.
In the framework of our activities on audio indexing and speaker recognition, AudioSeg, a toolkit for the segmentation of audio streams has been developed and is distributed for Unix platforms under the GPL agreement. This toolkit provides generic tools for the segmentation and indexing of audio streams, such as audio activity detection, abrupt change detection, segment clustering, Gaussian mixture modeling and joint segmentation and detection using hidden Markov models. The toolkit relies on the SPro software for feature extraction.
Contact : guillaume.gravier@irisa.fr
URL :
http://
In collaboration with the computer science dept. at ENST, METISS actively participates in the development of the freely available Sirocco large vocabulary speech recognition software . The Sirocco project started as an INRIA Concerted Research Action now works on the basis of voluntary contributions.
We use the Sirocco speech recognition software as the heart of the transcription modules whithin our spoken document analysis platform IRENE. In particular, it has been extensively used in our researches on ASR and NLP as well as for our work on phonetic landmarks in statistical speech recognition.
Contact : guillaume.gravier@irisa.fr
The Matching Pursuit ToolKit (MPTK) is a fast and flexible implementation of the Matching Pursuit algorithm for sparse decomposition of monophonic as well as multichannel (audio) signals. MPTK is written in C++ and runs on Windows, MacOS and Unix platforms. It is distributed under a free software license model (GNU General Public License) and comprises a library, some standalone command line utilities and scripts to plot the results under Matlab.
MPTK has been entirely developed within the METISS group mainly to overcome limitations of existing Matching Pursuit implementations in terms of ease of maintainability, memory footage or computation speed. One of the aims is to be able to process in reasonable time large audio files to explore the new possibilities which Matching Pursuit can offer in speech signal processing. With the new implementation, it is now possible indeed to process a one hour audio signal in as little as twenty minutes.
Thanks to an INRIA software development operation (Opération de Développement Logiciel, ODL) started in September 2006, METISS efforts this year have been targeted at easing the distribution of MPTK by improving its portability to different platforms and simplifying its developpers' API. Besides pure software engineering improvements, this implied setting up a new website with an FAQ, developing new interfaces between MPTK and Matlab and Python, writing a portable Graphical User Interface to complement command line utilities, strengthening the robustness of the input/output using XML where possible, and most importantly setting up a whole new plugin API to decouple the core of the library from possible third party contributions.
Collaboration : Laboratoire d'Acoustique Musicale (University of Paris VII, Jussieu).
Contact : remi.gribonval@irisa.fr
URL :
http://
BSS_ORACLE is a MATLAB toolbox to compute the best performance achievable by a class of source separation algorithms in an evaluation framework where the true sources are known. Version 2.1 has been released this year. The toolbox provides oracle estimators defined in and for four classes of algorithms (timeinvariant multichannel filtering, singlechannel timefrequency masking, multichannel timefrequency masking and best basis masking), each with several variants (timedomain vs. frequencydomain, MDCT vs. STFT, etc).
Contact : emmanuel.vincent@irisa.fr
Acoustic model based adaptation techniques have become in recent years an important element in speech recognition systems to tune the system to the user's voice. Moreover, in somme applicative contexts, speaker's adaptation must take place online and rapidly.
We have designed a novel algorithm for fast speaker adaptation using small amounts of adaptation data. The approach is based on a set of representative speakers which can provide a priori knowledge to guide the estimation a new speaker's model in the speaker space.
The proposed scenario is based on an a posterioriselection of reference models as opposed to conventional techniques (such as eignevoices) which uses a fixed set of reference speakers. It calls for a userdependent linear interpolation of the parameters of the reference speaker models
Comparisons of the proposed approach on the IDIOLOGOS and PAIDIALOGOS corpora have yields to slightly better performances tha eigenvoices on a phoneme recognition task, especially for atypical speakers such as children .
This work is taking place in the context of an industrial PhD just starting with Orange FTR&D Labs.
Increased interest is noticeable in the field of speaker characterisation for approaches able to describe and classify voice expressions such as emotion, cognitive state and, more generally, any type of information conveyed by the voice of a speaker voice and indicative of his/her state of mind.
Joint work between the Metiss Group is just starting to investigate descriptors and models for representing this type of speaker's characteristics at several linguistic and paralinguistic levels, together with training algorithms and decision strategies which enable the fusion multiple sources of information.
This work has been done in the context of the ITEA PELOPS project, in close cooperation with Thomson Multimedia).
Extracting relevant information in sports programmes (such as soccer matches) is a challenge which is closely linked to applicative considerations, such as automatic content summarization and fast postproduction and repurposing. In this context, the activities of the Metiss group in the PELOPS Project were focused on 2 tasks :
Generation of acoustic and semantic descriptors from audio soundtracks.
Audiovisual information fusion and integration, for the classification of highlights in a sport event (collaboration with Thomson Multimedia who provides video descriptors).
A set of low levels audio descriptors have been setup, using statistical and pattern recognition techniques. BSS techniques are used as a preprocessing phase to separate the commentator track from the crowd and field ambiance. This preprocessing step improved the robustness of several audio descriptors, such as commentator pitch tracking, rate of commentator speech and cheering level.
The fusion and integration of audio and visual information addresses the problem of combining heterogenous descriptors with asynchronous streams. The events are modelled by means of contextual relations between time intervals, using different statistics on the descriptors (max, min, standard deviation). Support Vector Machine classifiers have been used to train the models and to score the test matches, as described in .
The resulting event classification is synchronized with the video shot segmentation and each shot is assigned a score for the considered events (goals, cards, goal attempts, other). The classification is evaluated using a precisionrecall curve. In our experiments on a corpus of 12 soccer matches, 100of the shots with the highest estimated goal probability.
This work has been done in close collaboration with the Texmexprojectteam at IRISA and has led to a rising collaboration with the NLP group at the Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE, Puebla, Mexico).
Automatic speech recognition (ASR) systems aim at generating a textual transcription of a spoken document, usually for further analysis of the transcription with natural language processing (NLP) techniques. However, most current ASR systems solely rely on statistical methods and seldom use linguistic knowledge. In collaboration with the NLP group in the Texmexprojectteam of IRISA, we investigated several directions toward a better use of linguistic knwoledge such as morphology, syntax, semantics and pragmatics in ASR.
The works described here under were implemented in our Sirocco software and incorporated in our spoken document analysis platform Irene. The proposed approaches were benchmarked on the EsterFrench broadcast news corpus which consitutes a reference in ASR for the French language.
In 2006, we had demonstrated the interest of a score combining acoustic, language and morphosyntactic information to rescore Nbest sentence hypothesis lists. This year, we consolidated these results with various configuration of our ASR system and studied the impact of morphosyntactic information for confidence measure computation. In particular, we demonstrated that confidence measures can be improved based on our combined score function .
Spoken document segmentation is a crucial step for the analysis of multimedia documents which requires the combination of linguistic and acoustic cues. To this end, we extended a statistical method based on lexical cohesion for topic shift detection to take into account additional knowledge such as semantic relation between words, syntactic coherence and acoustic cues. Our technique enables us to improve segmentation, although a few parts —particularly those corresponding to the news headlines— have still to be refined.
We proposed a method to adapt the language model of an ASR system for each segment resulting from the segmentation step described above. The method is completely unsupervised and uses neither a prioriknowledge about topics nor a static collection of texts. The idea is to gather textual adaptation data for each segment, based on information retrieval (IR) methods to extract keywords which are used to retrieve documents from the Web. IR techniques, used both for keyword extraction and for document selection, have been adapted to tackle the specificities of automatic transcriptions (e.g. misrecognized words, named entities). Results indicate a large improvement of the language model, which finally yields a small improvement of the word error rate .
This preliminary work has demonstrated the potential of our approach to efficiently transcribe speech streams and suggests further work on language model and vocabulary adaptation based on IR methods to gather adaptation data from the Internet. The thesis of Gwénolé Lecorvé, which started in September 2007 in collaboration with the Texmexprojectteam, will be dedicated to language model and vocabulary adaptation for the robust transcription of multimedia streams.
HMMbased automatic speech recognition can hardly accomodate prior knowledge on the signal, apart from the definition of the topology of the phonebased elementary HMMs. In the previous years, we have shown that such knowledge can be efficiently used during decoding with the Viterbi algorithm as constraints on the best path.
Preliminary experiments have shown that accurately detecting broad phonetic landmarks, such as vowels or stops, can greatly benefit to ASR. Hence, we focused this year on the actual detection of such landmarks. Experiments on HMMbased landmark detection demonstrated that, if HMMs can be used to provide a segmentation into broad phonetic events, the classification rate is not high enough to benefit speech recognition. This is also due to the fact that the same paradigm (features and models) is used for both landmark detection and speech recognition. We therefore focused on the use of support vector machines to classify feature vectors into broad phonetic classes, achieving classification rates around 95 % for vowels, fricatives and nasals .
In the future, we plan to improve SVMbased landmark detection using different features and to demonstrate the actual feasibility of broad phonetic landmarkdriven speech recognition.
Discovering repeating motifs —such as advertisments, jingles or even words— in audio streams or databases is a crucial task for the unsupervised strucuring of audio data collections and a necessary step toward the lightly supervised design of audio event recognition systems. Research in this field are oriented along two main axes, namely efficient search of a motif (querry) and efficient representation of a motif to deal with variability.
In 2007, our activity in the field of audio motif discovery mainly focuses on the study of sequence models for fast retrieval of audio sequences, in collaboration with the Texmexprojectteam at Irisa. Extending existing multidimensional indexing techniques is not possible as these were designed for description schemes in which the concept of sequence lacks. A solution is to summarize the sequence in a model before indexing and comparing models rather than sequences. To this end, we investigated the use of support vector machines as a prediction model and compared the SVMbased comparison of sequences with the more traditionnal featurebased dynamic time warping alignment method. Overall, we have shown that relying on models (instead of relying on descriptors) provides a better robustness to severe modifications of sequences, like temporal distortions for example , .
These encouraging results motivate further investigation on SVMbased models of audio sequences. In parallel, the thesis of Armando Muscariello, which started in October 2007, will focus on the practical application of sequence models for motif discovery in audio streams, aiming at the discovery of variable motifs.
The work described in this section is carried out in the framework of the Ph. D. thesis of Siwar Baghdadi, in collaboration with the Texmexprojectteam of IRISA and Thomson Multimedia Research.
Bayesian networks provide an interesting framework for the joint modeling of multimodal information. Morevoer, unlike HMMs and segment models, it is possible to learn the structure of a Bayesian network, i.e.the relation between the variables describing the problem, from data.
We investigated the use of dynamic Bayesian networks and the potential of structure learning algorithms such as K2 for multimodal integration in a commercial detection application. A video stream is considered as a succession of shots, where a shot is represented by a set of visual and audio features, which can be labeled either as commercial or not. We have shown that structure learning algorithms can efficiently learn the relations between the variables describing a shot. We investigated different approaches to model temporal relations between shots, in particular using an explicit duration model as in segment models.
Future work involves the extension of this approach to event detection in soccer games with a focus on structure learning, either static or temporal, in order to provide a framework for the lightly supervised development of new applications.
Music signals can be described by a score consisting of several notes defined by their onset time, duration, pitch and instrument class. The task of estimating the notes underlying a given signal is termed polyphonic music transcription. It involves two substasks, namely pitch transcription and instrument identification. This task can also form the core of an "objectbased" coder, encoding the signal in terms of resynthesis parameters for each note and allowing highlevel manipulation of the signal.
We also investigated alternative methods addressing this task in the framework of sparse representations. The first method represents the signal in each time frame as a linear combination of harmonic atoms learnt on isolated notes from various instruments. The relevant atoms are selected by Matching Pursuit and additional structural constraints are used to extract sequences of atoms modeling individual notes. The second method represents the shortterm magnitude spectrum as a linear combination of magnitude spectra corresponding to different pitches. These spectra are adapted from the signal alone ny minimizing the loudness of the residual under harmonicity constraints. This method provided similar pitch transcription accuracy as stateoftheart methods, while allowing better generalization to unknown instruments.
Finally, we investigated the use of such notebased representations for bandwidth extension and "resolutionfree" audio coding.
This work was conducted in collaboration with Mark D. Plumbley and Steve Welburn (Queen Mary, University of London), Pierre Leveau and Laurent Daudet (Université Paris 6) and Nancy Bertin and Roland Badeau (GET  Télécom Paris). Previous results have been published as journal articles , . New results have been submitted to a journal and published in the proceedings of a conference and an evaluation campaign .
Speech recognition is very advantageously guided by statistical language models : we hypothetise that music description, recognition and retranscription can strongly benefit from music models that express dependencies between notes within a music piece, due to melodic patterns and harmonic rules.
To this end, we have investigated the approximate modeling of syntactic and paradigmatic properties of music, through the use of ngrams models of notes, succession of notes and combinations of notes.
In practice, we consider a corpus of MIDI files on which we learn cooccurences of concurrent and consecutive notes, and we use these statistics to cluster music pieces into classes of models and to measure predictability of notes within a class of models.
The model is intended to be used in complement to source separation and acoustic decoding, to form a consistent framework embedding signal processing techniques, acoustic knowledge sources and music rules modeling. A publication is in preparation.
The source separation problem consists in retrieving unknown signals (the sources) form the only knowledge of one or more mixtures of these signals (the channels coming from each sensor). In the case we study, each channel is a linear combination of the sources, and there are more sources than channels, and at least two channels. Due to the underdeterminacy of the problem, knowing all the parameters of the mixing process is not sufficient to retrieve the sources. Focussing on the estimation of the sources –assuming the mixing process is known– we have studied methods to perform the separation based on sparse decomposition of the mixture with Matching Pursuit. Methods for the estimation of the mixing parameters are developped apart (see next section).
Last year we concentrated on methods based on the difference in spatial direction between sources, assuming the source signals can be sparsely decomposed on a joint dictionary. This year, we explored the possibility of simultaneously exploiting spatial differences and “morphological” differences, by choosing a distinct dictionary to sparsely model each source signal in the spirit of . For sources wich can be modeled sparsely in sufficiently distinct domains (e.g., drums and electric guitar), our experiments showed that this approach can drastically improve separation performance. While learning appropriate dictionaries for each source based on training data is straightforward, the problem of training adapted dictionaries based on the only knowledge of the mixture remains a challenge.
This work is has been presented in a workshop.
An important step for audio source separation consists in finding both the number of mixed sources and their directions in a multisensor mixture.
In complement to the separation methods based on Matching Pursuit, which we developed and evaluated assuming the mixing matrix is known, we proposed last year a robust technique to address this problem in the case of linear instantaneous mixtures , even with more sources than sensors. This year, we extended the approach to a more realistic setting of linear anechoic mixture (where the mixture involves not only intensity difference but also time delays between channels).
The method relies on the assumption that in the neighborhood of some timefrequency points, only one source contributes to the mixture. Such timefrequency points, located with a local confidence measure, provide estimates of the attenuation, as well as the phase difference at some frequency, of the corresponding source. Combining the phase differences at different frequencies, the time delay parameters are estimated, by a method similar to GCCPHAT, on points having similar intensity differences. As a result, unlike DUET type methods, our method makes it possible to estimate timedelays higher than only one sample.
Experiments show that, in more than 65% of the cases, DEMIX Anechoic correctly estimates the number of directions until 6 sources. Moreover, it outperforms DUET in the accuracy of the estimation by a factor ten.
This work is currently submitted for publication.
Probabilistic approaches can offer satisfactory solutions to source separation with a single channel, provided that the models of the sources match accurately the statistical properties of the mixed signals. However, it is not always possible in practice to construct and use such models.
To overcome this problem, we propose to resort to an adaptation scheme for adjusting the source models with respect to the actual properties of the signals observed in the mix. We develop a general formalism for source model adaptation. In a similar way as it is done for instance in speaker (or channel) adaptation for speech recognition, we introduce this formalism in terms of a Bayesian Maximum A Posteriori (MAP) adaptation criterion. We show then how to optimize this criterion using the EM (Expectation  Maximization) algorithm at different levels of generality.
Formulated in such a general way this adaptation formalism can be applied for different models (GMM, HMM, etc.) and using different types of priors (probabilistic laws, structural priors, etc.). Also, we extend this formalism by explaining how to integrate to the adaptation scheme any auxiliary information available in addition to the mix. This can be for example visual information, time segmentation of sound classes, some forms of incomplete separation, etc.
To show the use of model adaptation in practice, we apply this adaptation formalism to the problem of separating voice from music in popular songs. In 2005 we proposed some adaptation techniques based on some segmentation of the processed song into vocal and nonvocal parts. These techniques include learning of music model from the nonvocal parts and voice model filter adaptation from the vocal parts , .
We show that these adaptation techniques are just some particular forms of our general adaptation formalism. Furthermore, we introduce a new Power Spectral Density (PSD) gains adaptation technique, and we explain how to perform joint filter and PSD gains adaptation for voice model, which leads to better performance than filter adaptation alone. Finally, in addition to what was done in , , where a manual vocal / nonvocal segmentation was used, we have developed some automatic segmentation module.
Thus, we have developed a one microphone voice / music separation system based on adapted models. This system performs in a completely automatic manner, i.e. without any human intervention, and the computation load is quite reasonable (not more than 10 times real time). The obtained results show that for this task an adaptation scheme can significantly improve (at least by 5 dB) the separation performance in comparison with nonadapted models.
This work is accepted for publication and is thoroughly detailed in Alexey Ozerov's Ph.D. manuscript . It was done in close collaboration with FTR&D (Pierrick Philippe).
Source separation is the task of retrieving the source signals underlying a multichannel mixture signal, where each channel is the sum of scaled versions of the sources (instantaneous case) or filtered versions thereof (convolutive case). A popular approach is to assume that the sources admit a sparse representation in some (possibly overcomplete) basis. Separation can then be achieved by sparse decomposition of the mixture signal. Previous work in the group focussed on fixed timefrequency bases and sourceadapted bases trained on isolated samples of each source.
This year we proposed two methods to adapt the bases directly from the mixture signal. The first method aims to find a timefrequency basis such that the source signals overlap as little as possible in this basis, so that separation can be performed by binary masking, i.e. associating each timefrequency bin with a single source. Such a basis is estimated by minimizing a quadratic overlap criterion, given the spatial directions of the sources. Experiments with Cosine Packet (CP) bases showed that this method outperformed binary masking on a fixed MDCT basis for the separation of stereo instantaneous mixtures of three sources.
The second method assumes that each time frame of the mixture signal can be represented as a sparse linear combination of multichannel atoms forming a complete basis, where each atom belongs to a single source. The best basis is found for all time frames by minimizing the lp norm of the combination weights. The spatial direction associated with each atom is then estimated using the GCC PHAT estimator and the set of atoms corresponding to each source is estimated by clustering of the directions. This method outperformed both convolutive ICA and DUET approaches on lowreverberation convolutive mixtures.
We also studied the minimization of the lp norm of the combination weights for complexvalued overcomplete bases. This optimization problem is difficult since it is nonconvex and theoretical results for realvalued data do not apply for complexvalued data. We characterized the local minima of the lp norm in a simple case and derived a fast algorithm for the estimation of the global minimum. This algorithm has been applied to the separation of stereo instantaneous and convolutive mixtures of three sources.
This work was conducted in collaboration with Maria G. Jafari and Mark D. Plumbley (Queen Mary, University of London) and Mike E. Davies (University of Edinburgh). The results have been published in the form of a journal article , a book chapter and two conference papers , .
Source separation of underdetermined and/or convolutive mixtures is a difficult problem that has been tackled by many algorithms based on different source models. Their performance is usually limited by badly designed source models or local maxima of the function to be optimized. Moreover, it may be limited by algorithmic constraints, such as the length of the demixing filters or the number of frequency bins of the timefrequency masks. The best possible source signal that can be estimated under these constraints (in the ideal case where source models and optimization algorithms are perfect) is called an oracle estimator of the source. We have expressed and implemented oracle estimators for four classes of algorithms (timeinvariant beamforming, singlechannel timefrequency masking, multichannel timefrequency masking and best basis masking) and studied their performance on realistic speech and music mixtures. The results have led to interesting conclusions concerning the performance bounds of blind algorithms, the choice of the best class of algorithms and the assessment of the separation difficulty.
This work, which builds up on our previous contribution published in , was done in collaboration with Emmanuel Vincent and Mark D. Plumbley (Queen Mary, University of London). For more detail, please refer to and .
Sparse approximation using redundant dictionaries is an efficient tool for many applications in the field of signal processing. The performances largely depend on the adaptation of the dictionary to the signal to decompose. As the statistical dependencies are most of the time not obvious in natural highdimensional data, learning fundamental patterns is an alternative to analytical design of bases and has become a field of acute research. Most of the time, several different observed patterns can be viewed as different deformations of one generating function. For example, the underlying patterns of a class of signals can be found at any time, and in the design of a dictionary, this shift invariance property should be present. We developed a new algorithm for learning short generating functions, each of them building a set of atoms corresponding to all its translations. The resulting dictionary is highly redundant and shift invariant.
This algorithm learns the set of generating functions iteratively, from a set of learning signals. Each iteration is an alternate routine : we begin with a sparse decomposition of the learning signals on the dictionary generated by the learnt generating functions. We used Matching Pursuit for this step, mostly because of the availability of a fast implementation . Then, for each generating function, we get one signal patch for each occurrence of this function found by the decomposition and we update the function to obtain a leastsquare error approximation of the patches. Depending on whether you allow some decomposition coefficients to be updated or not during this step, the new function is given by the first principal component or the centroid of the corresponding patches. The first method gives a better approximation of the patches while the second one yields a lower algorithmic complexity. Then we iterate the same process.
On natural images, the learnt atoms are similar to what is generally found in the litterature. On other data, like ECG or EEG, typical waveforms are retrieved. We also show the results of a test on audio data, where the approximation using some learnt atoms is sparser than using local cosines.
This work, which extends our previous work with the MOTIF algorithm , was presented at a workshop It was done in collaboration with the group of Pierre Vandergheynst (EPFL, Lausanne). We are currently working on other deformation classes, such as phase shifts for audio signals, dilatation and rotation for images.
Realworld phenomena involve complex interactions between multiple signal modalities. As a consequence, humans are used to integrate at each instant perceptions from all their senses in order to enrich their understanding of the surrounding world. This paradigm can be also extremely useful in many signal processing and computer vision problems involving mutually related signals. The simultaneous processing of multimodal data can in fact reveal information that is otherwise hidden when considering the signals independently. However, in natural multimodal signals, the statistical dependencies between modalities are in general not obvious. Learning fundamental multimodal patterns could offer a deep insight into the structure of such signals. Typically, such recurrent patterns are shift invariant, thus the learning should try to find the best matching filters. In this paper we present an algorithm for iteratively learning multimodal generating functions that can be shifted at all positions in the signal. The learning is defined in such a way that it can be accomplished by iteratively solving a generalized eigenvector problem, which makes the algorithm fast, flexible and free of userdefined parameters. The proposed algorithm is applied to audiovisual sequences and we show that it is able to discover underlying structures in the data. In particular, it is possible to locate the mouse of a speaker based on the learnt multimodal dictionaries, even in adverse conditions where the audio is corrupted by noise and other speakers are visible (but not audible) who utter the same words as the target speaker. This work, which was done in collaboration with G. Monaci, P. Jost and P. Vandergheynst from EPFL was published in and is currently submitted for possible journal publication.
Recent developments in sparse signal models mainly focus on analyzing sufficient conditions which which guarantee that various algorithms (matching pursuits, basis pursuit, ...) can “recover” a sparse signal representation. Typical conditions involve both basic properties of the representation itself (which should be sufficiently sparse or compressible) and of the dictionary used to represent the signal, which should satisfy some uniform uncertainty principle. Even though random dictionary models can be used to prove that strong uniform uncertainty principles are met by “most” dictionaries, it seems to remain combinatorial to check it for a specific dictionary, for which estimates based on the coherence provide very pessimistic recovery conditions.
In parallel to developments in sparse signal models, various application scenarios motivated renewed interest in processing not just a single signal, but many signals or channels at the same time. A striking example is sensor networks, where signals are monitored by low complexity devices whose observations are transfered to a central collector . This central node thus faces the task of analyzing many, possibly highdimensional, signals. Moreover, signals measured in sensor networks are typically not uncorrelated: there are global trends or components that appear in all signals, possibly in slightly altered forms.
We developped an analysis of the theoretical performance of two families of simultaneous sparse representation algorithms. First, we considered
pthresholding, a simple algorithm for recovering simultaneous sparse approximations of multichannel signals. Our analysis is based on studying the average behaviour in addition to the
worst case one, and the spirit of our results is the following: given a not too coherent dictionary and signals with coefficients sufficiently large and balanced over the number of channels,
pthresholding can recover superpositions of up to
atoms
with overwhelming probabilityin dimension
d. Our conditions on
are thus much less restrictive than in the worst case where only
atoms can be recovered. Numerical simulations confirm our theoretical findings and show that
pthresholding is an interesting low complexity alternative to simultaneous greedy or convex relaxation algorithms for processing sparse multichannel signals with balanced
coefficients.
This work was done in collaboration with Karin Schnass and Pierre Vandergheynst, EPFL, and Holger Rauhut, University of Vienna. A paper is in preparation and a conference paper was submitted for publication.
This project entitled "Multimodal description for automatic structuring of TV streams" started in Oct. 2004 and is funded by the ACI Masse de Données. The partners are the METISS and TEXMEX groups at IRISA and the DCA group at INA.
The aim of this project is to propose and evaluate algorithms to structure the video stream in order to automate this tedious part of the indexing process at INA. The main scientific objectives are the joint modeling of different medias (image, text, metadata, sound, etc.) in a statistical framework and the use of prior information, mainly the program guide, in collaboration with a statistical model.
In the framework of this project, our team works on the use of segment models for video structuring as well as on the segmentation and transcription of the video stream soundtrack.
The PELOPS project is a EUREKAITEA Project which started in 2005. IRISA joined the project in July 2006. The project terminated in June 2007.
The partners are Thomson Multimedia, Acotec, Barco, EVS, Leo Vision, MOG and Telefonica.
The project was targeted towards content creation and repurposing for live sports events.
The contribution of IRISA was focused on the conception of audio analysis tools and processes for content analysis, structuration and prioritisation, using statistical approaches for audio classification and source separation techniques.
A bilateral collaboration with the Signal Processing group (LTS2) led by Pierre Vandergheynst at EPFL (Switzerland) was initiated a few years ago within the HASSIP European research training network. Since 2005, thanks to bilateral funding by the foreign affairs ministry, the collaboration has been reinforced, and has lead to several student exchanges and academic visits, including a two month visit of Rémi Gribonval at EPFL in the summer of 2006. Since the fall of 2005, a cosupervised Ph.D. thesis (Boris Mailhé) has started to reinforce even more the collaboration, and an INRIA Associated Team called SPARS officially started in January 2007 to strengthen and build upon this collaboration in the coming years. The collaboration resulted so far in joint theoretical contributions on sparse signal approximation, as well as on multimodal audiovisual signal analysis, using the complementary competences in audio (METISS) and image/video (LTS2) applications of sparse signal models.
In the framework of the INRIA Associated Team SPARS, several junior and senior researches from LTS2 (EPFL) visited the METISS group in 2007. A first visit by Pierre Vandergheynst and Karin Schnass was the occasion to complete a paper on the theoretical analysis of multichannel sparse approximation algorithms, which is currently submitted for publication. These results were presented at several international conferences this year. During a visit by Anna Llagostera and Gianluca Monaci, we experimented with multimodal signal models, using visual information from audiovisuel data to train audio source models for single channel source separation. Independently from the SPARS Associated Team, with the help of GDR ISIS  CNRS funding, we invited Matthieu Kowalski, a Ph.D. student with Bruno Torrésani at LATP, Université de Provence, for a one month visit where we studied iterative optimization algorithms for structured multichannel decompositions, experimenting their possible applications to convolutive source separation.
Guillaume Gravier visited the Language Technology Lab (LTL) at the INAOE (Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico) for two month in July and August 2007 with the goal of a collaboration in the field of spoken document analysis. This stay has been the opportunity to investigate the application of the natural language processing techniques developed at LTL (text segmentation, clustering, summarization ...) on the output of the IRENE transcription system. Experiments have been carried out on text segmentation and on text pattern discovery. As a result of this exploratory visit (INRIA exploratory visit program 01AUTR50508), a more formal collaboration with the LTL is under study, targeting in particular a joint participation in the QAst (Question Answering on speech transcriptions) track of the CLEF evaluation campaign.
Rémi Gribonval was an invited tutorial lecturer at the 7th International Conference on Independent Component Analysis and Signal Separation (ICA 2007), Londres, UK, september 2007.
Rémi Gribonval was a member of the Program Committee for the GRETSI french speaking Workshop on Signal and Image Processing which was held in Troyes, France in september 2007.
Frédéric Bimbot is a member of the Programme Committee for the Odyssey 2008 Workshop on Speaker Recognition, to be held in Stellenbosch, South Africa, January 2125, 2008.
Frédéric Bimbot is a member of the Programme Committee for the Eusipco 2008 Conference, to be held in Lausanne, Switzerland, August 2529, 2008.
Guillaume Gravier is part of the NOE MUSCLE.
Emmanuel Vincent was one of the panelists of the discussion session on evaluation campaigns held at the ICA Conference, London, september 2007.
Rémi Gribonval participates to the CNRS expert committee “methods in signal and image processing”.
Guillaume Gravier is a member of the Administration Board of the Association Francophone de la Communication Parlée (AFCP).
Guillaume Gravier is the organiser of the second ESTER evaluation campaign on the segmentation and transcription of audio contents.
Emmanuel Vincent was the chair of first Stereo Audio Source Separation Evaluation Campaign (SASSEC).
Rémi Gribonval has given 8 hours of lecture on signal and image representation within the ARD module of the Masters in Computer Science, Université de Rennes 1.
Guillaume Gravier has given two 2hour conferences on Voice Technologies at the École Supérieure d'Applications des Transmissions (ESAT, Rennes) and the Institut de Formation Supérieure en Informatique et Communication (IFSIC, Univ. Rennes 1).
Frédéric Bimbot is the coordinator of the ARD module and has given 6 hours of lecture in speech and audio description within the FAV module of the Masters in Computer Science, Rennes I.
Frédéric Bimbot visited three secondary schools in Britanny and gave presentations on speaker recognition to several classes, in the context of “A la découverte de la Recherche”.
Guillaume Gravier has given 10 hours of lecture in Data Analysis and Statistical Modeling within the ADM module of the Master in Computer Science, Rennes I.
Guillaume Gravier has given 2 lectures (4 h) at the Ermites 2007 summer school (Ecole Recherche Multimodale d'informations) on automatic speech recognition and on multimodal information fusion.
Emmanuel Vincent gave lectures about audio rendering, coding and source separation for a total of 6 hours as part of the CTR module of the Masters in Computer Science, Rennes I.
Emmanuel Vincent taught general tools for signal compression and speech compression for 10 hours within the DT SIC RTL course at the École Supérieure d'Applications des Transmissions (ESAT, Rennes).
The projectteam prepared demonstrations for the 40th anniversary of INRIA, (Lille, 1011 December 2007) under the technical coordination of Gilles Gonon.