Project : metiss
Section: New Results
Speaker and speech recognition
Structural adaptation of speaker models
In speaker recognition, Bayesian adaptation of GMMs  with the Maximum A Posteriori (MAP) criterion have shown to be more efficient than the Maximum Likelihood (ML) estimation, because it limits over-adaptation on the training data by assuming a prior distribution for the model parameters. However, when training data is very limited and sparse, this technique suffers from the fact that only components of the model which are observed in the training set are adapted, unseen components remaining unchanged  .
We also study a structural adaptation scheme which assumes a hierarchical structure of speech common to all speakers. We introduce multi-resolution GMMs in which the mean vectors are structured in a binary tree, with coarse-to-fine resolution when going down the tree. Bayesian adaptation  is then performed in a hierarchical way, propagating the estimated values of the coarsest GMM means down the tree via linear regression between contiguous depth. This allows some of the mean of the finest resolution speaker GMM which are not observed in the training set to be adapted according to their parent (or ancestor) node. As in the classical Bayesian adaptation approach, the parameters of the multi-resolution prior background GMMs are estimated using prior data.
We use the same data to estimate the regression coefficients and the binary tree. The latter can be constructed by several hierarchical clustering methods. We used a hierarchical EM algorithm with a Gaussian splitting process. This structural Bayesian adaptation technique is currently under development and experimentation and has not yet shown improvement over classical Bayesian MAP adaptation.
Noise robust speech recognition using source separation techniques
Real-life speech material often contains speech with background noise. In particular for broadcast news, it is a common practice to have a jingle in the background when giving the headline titles. Detecting the presence of background music and being able to remove it from the speech signal is of utmost importance in order to obtain a good transcription.
Both detection and removal of background music can be stated in terms of source separation using a single sensor, where one source is the speech signal while the second one is the background music signal. This approach was validated on the BREF corpus  to which a jingle was artificially added at various signal-to-noise ratios. Assuming statistics on the power spectral densities of the jingle and speech signals are known, we were able to show  that the jingle can be efficiently removed from the speech material using adaptive Wiener filtering while classical methods such as spectral subtraction or time-frequency shrinkage gave poor results because of the non-stationarity of both the noise and speech signals. However, the non-linearities introduced by this type of algorithm limits the benefit in terms of speech recognition. The use of smoothed representations of the speech signal in the recognizer, such as RASTA filtering  or short-term gaussianization, can partially compensate for these non-linearities but more efficient spectral (or cepstral) smoothing techniques are still required.