Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Keywords : speaker characterisation, speaker verification, Gaussian Mixture Models (GMM), Anchor Models, mormalisation, speaker selection, Classification and Regression Trees (CART)..

Speaker characterisation

Speaker characterisation in the model space

Participants : Mathieu Ben, Frédéric Bimbot, Guillaume Gravier.

In speaker recognition, Bayesian adaptation of Gaussian Mixture Models (GMM) [78] with the Maximum A Posteriori (MAP) criterion has shown to be more efficient than the Maximum Likelihood (ML) estimation, because it limits over-adaptation on the training data by assuming a prior distribution for the model parameters. However, this technique is not sufficient in practice to compensate for the lack of training data, and the statistical behaviour of the score provided by the likelihood ratio test is not consistent with the Bayesian theory.

This problem is usually dealt with by score normalisation techniques, such as z-norm, t-norm, etc...  [3] . In the framework of his PhD   [1] , Mathieu Ben has established formal relations between the statistics of likelihood ratio scores, the Kullback-Leibler distance between GMM models and the Euclidean distance between GMM parameters (under specific yet realistic hypotheses).

Furthermore, the relation between likelihood ratio scores and the Euclidean distance between GMM parameters can be exploited for efficient score computation, avoiding explicit likelihood computation. Experiments in speaker verification on the NIST 2005 Speaker Recognition corpus and in speaker tracking on the ESTER Broadcast News corpus demonstrated that the computation time can be reduced by up to 75% without any decrease in performance  [28] .

Relative speaker information and related metrics

Participants : Mikaël Collet, Frédéric Bimbot.

The representation of speaker information relatively to a set of other speaker models (anchor models) yields a compact representation of the speaker information. This representation can be advantageous for speaker segmentation, indexing, tracking and adaptation.

In this framework, the speaker-related properties of a speech segment can be represented as a vector of likelihood ratio values (SCV) corresponding to the speech observations being scored by a pre-determined collection of reference (anchor) speaker models.

In previous work, several deterministic metrics (euclidean, angular and correlation) were investigated and evaluated for the comparison of speakers in the anchor space  [79] , [72]   [32] . More recently, a probabilistic approach based on a speaker-dependent Gaussian modeling of the SCV was proposed  [33] and yielded considerable improvement of the anchor speaker approach, making it competitive with respect to conventional GMMs.

Optimizing the speaker coverage of a speech database

Participants : Sacha Krstulovic, Mathieu Ben, Frédéric Bimbot.

The state of the art techniques in the various domains of Automatic Speech Processing (be it for Automatic Speaker Recognition, Automatic Speech Recognition or Text-To-Speech Synthesis) make extensive use of speech databases. Nevertheless, the problem of optimizing the contents of these databases to make them adequate to the development of a considered speech processing task has seldom been studied [73] .

In this context, we have proposed a general database design method aiming at optimizing the contents of new speech databases by focusing the data collection on a selection of speakers chosen for its good coverage of the voice space. Such databases would be better adapted to the development of recent speech processing methods, such as those based on multi-models (e.g adaptation of speech recognition with specialized models, speaker recognition with anchor models, speech synthsis by unit selection, etc.). Such developments require indeed a much larger quantity of data per speaker than the traditional databases can offer [61] . Nevertheless, the increase in the collection cost for such newer and larger databases should be limited as much as possible, while preserving a good coverage of the speaker variability.

The corresponding work, led in the framework of the Neologos project (The following public, academic and industrial partners have participated in the Neologos project, funded by the French Ministry of Research in the framework of the Technolangues program: ELDA, France Telecom R&D company/lab, IRISA-ENSSAT (Cordial), IRISA (Metiss), LORIA and TELISMA.), therefore re-thinks the design of speech databases in the following terms: it focuses on optimizing the contents of the speech databases in order to guarantee the diversity of the recorded voices, both at the segmental and supra-segmental levels, so that each of the recorded speakers can be precisely modeled and localized in an abstract space of speakers. In addition to this scientific objective, this method addresses the practical concern of reducing the collection costs for new speech databases.

The resulting methodology proposes to operate a selection by optimizing a quality criterion defined in a variety of speaker similarity modeling frameworks  [31] . The selection can be operated and validated with respect to a unique similarity criterion, using classical clustering methods such as hierarchical or k-median clustering, or it can be operated and validated across several speaker similarity criteria, thanks to a newly developed clustering method that we called Focal Speakers selection  [43] . In this framework, four different speaker similarity criteria have been tested, and three different speaker clustering algorithms have been compared. The outcome of this work has been used for the final specification of the list of speakers to be recorded in the Neologos database.

A manuscript detailing the methodology and the results of this speaker-driven database design was submitted to an international journal.

Improved CART trees for fast speaker verification

Participants : Gilles Gonon, Rémi Gribonval, Frédéric Bimbot.

The main motivation for using decision trees in the context of speaker recognition comes from the fact that they can be directly applied in real-time implementations on a PC or a mobile device. Also, they are particularly suitable for embedded devices as they work without resorting to a log/exp calculus.

We address the problem of using decision trees in the rather general context of estimating the Log Likelihood Ratio (LLR) used in speaker verification, from two GMM models (speaker model and ``background'' model). Former attempts at using trees performed quite poorly compared to state of the art results with Gaussian Mixture Models (GMM). Two new solutions have been studied to improve the efficiency of the tree-based approach :

The first one is the introduction of a priori informations on the GMM used in state of the art techniques at the tree construction level. Taking into account the training method of the models with EM algorithms and maximum a posteriori techniques, it is possible to implicitly choose locally optimal hyperplane splits for some nodes of the trees. This is equivalent to building oblique trees using a specific set of oblique directions determined by the baseline GMM and thus limiting the complexity of the training phase.

The second one is the use of different complexity score functions within each leaf of the trees. These functions are computed after the creation of the trees, drawing data into the tree leaves and computing a regression function over the LLR scores. Mean score functions, linear score functions as well as quadratic score functions have been successfully tested resulting in more accurate trees.

These improvements applied to the classical classification and regression trees (CART) method in a speaker verification system allow to reduce more than 10 times the complexity of the LLR function computation. Considering a baseline state of the art system with an equal error rate (EER) of 11.6% on the NIST 2005 evaluation, a previous CART method provided typical EER ranging between 19% and 22% while the proposed improvements decrease the EER to 13.7%  [38] .

This work was carried out in the framework of a feasibility study concerning security requirements for a ``Trusted Personal Device'' within the Inspired IST Project  [54] .


Logo Inria