Team METISS

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Dissemination
Bibliography
Inria / Raweb 2004
Project: METISS

Project : metiss

Section: New Results


Keywords : speaker recognition, speech recognition, Bayesian adaptation, structural models, hierarchical clustering, denoising, source separation.

Speaker and speech recognition

Comparison, normalisation and adaptation of speaker models

Participants : Mathieu Ben, Frédéric Bimbot, Mikaël Collet, Guillaume Gravier.

In speaker recognition, Bayesian adaptation of Gaussian Mixture Models (GMM) [62] with the Maximum A Posteriori (MAP) criterion have shown to be more efficient than the Maximum Likelihood (ML) estimation, because it limits over-adaptation on the training data by assuming a prior distribution for the model parameters. However, this technique is not sufficient in practice to compensate for the lack of training data, and the statistical behaviour of the score provided by the likelihood ratio test is not consistent with the Bayesian theory.

This problem is usually dealt with by score normalisation techniques, such as z-norm, t-norm, etc... [1]. In the framework of his PhD  [9], Mathieu Ben has established formal relations between the statistics of likelihood ratio scores, the Kullback-Leibler distance between GMM models and the Euclidean distance between GMM parameters (under specific yet realistic hypotheses). These results have then been used to substitute to the concept of score normalisation, the approach of model normalisation which proves to be as efficient in terms of speaker recognition performance and much more advantageous in terms of speaker representation and score computation complexity. These results should also impact more recent work on anchor speaker models.

We have also studied a structural adaptation scheme which assumes a hierarchical structure of speech common to all speakers. We introduce multi-resolution GMMs in which the mean vectors are structured in a binary tree, with coarse-to-fine resolution when going down the tree. Bayesian adaptation [45] is then performed in a hierarchical way, propagating the estimated values of the coarsest GMM means down the tree via linear regression between contiguous depth. This allows some of the mean of the finest resolution speaker GMM which are not observed in the training set to be adapted according to their parent (or ancestor) node. As in the classical Bayesian adaptation approach, the parameters of the multi-resolution prior background GMMs are estimated using prior data. However, except offering a more general formalism as the conventional approach, the hierarchical scheme has not yielded yet a clear advantage in practice  [22].

Denoising speech using single sensor source separation techniques

Participants : Guillaume Gravier, Rémi Gribonval, Alexey Ozerov.

Real-life speech material often contains speech with background noise. In particular for broadcast news, it is common to hear a jingle in the background when listening to the headline titles. Detecting the presence of background music and being able to remove it from the speech signal is of utmost importance in order to obtain a better automatic transcription.

Both detection and removal of background music can be stated in terms of source separation using a single sensor, where one source is the speech signal while the second one is the background music signal.

In a previous work [38], we demonstrated that, assuming statistics on the power spectral densities of the jingle and speech signals are known, the jingle can be efficiently removed from the speech material using adaptive Wiener filtering. On the other hand, classical methods such as spectral subtraction or time-frequency shrinkage gave poor results because of the non-stationarity of both the noise and speech signals. However, non-linearities introduced by source separation algorithm limits the benefit in terms of speech recognition.

Previous experiments were carried out on a limited corpus of 50 read sentences. In 2004, we validated these results on a larger corpus and we showed that a robust front-end using normalized cepstral coefficients can partially compensate for the non-linearities introduced in the denoising process [27]. However, performances after denoising are still far from that on the original clean signal and a more realistic setup where the spectral characteristics of the noise is not known a priori must be investigated.


previous
next

Logo Inria