Section: New Results
Speech Analysis and Synthesis
Participants : Anne Bonneau, Vincent Colotte, Dominique Fohr, Yves Laprie, Joseph di Martino, Slim Ouni, Asterios Toutios, Nadia Amar, Imen Jemaa, Sébastien Demange, Ammar Werghi, Fadoua Bahja, Farid Feïz, Agnès Piquard-Kipffer, Utpala Musti, Fabian Monnay.
Acoustic-to-articulatory inversion
Our approach of acoustic-to-articulatory inversion is an advanced table lookup method. The table is built by synthesizing speech spectra from a set of articulatory configurations generated by the articulatory model, which thus plays an important role. The current articulatory model is that designed by Maeda [59] by using a semi-polar grid. Some articulatory configurations, especially those corresponding to back vowels, are not well interpolated since the tongue contour does not intersect all the grid lines. We thus evaluated new strategies by using an adaptive grid which is attached to the jaw and compared it to the standard semi-polar grid [23] The main advantage is to get a better deformation model of the tongue in the front part of the mouth part. We also substantially improved the base of articulatory contours outlined from X-ray images recorded at the IPS laboratory in the eighties by using MRI images recorded in the framework of the ASPI project by the same speaker.
Our objective is to elaborate inversion algorithms that work on standard spectra data, i.e. cepstral vectors for instance, in real time. Despite the great theoretical interest of the codebook approach, mainly the possibility of potentially exploring the entire articulatory space, it is hard to imagine using it in a real time context. We thus developed a new iterative approach using a neutral articulatory trajectory which is then deformed in order to account for formant trajectories extracted from speech. The algorithm is inspired by the variational approach we previously designed to improve the acoustic proximity with data extracted from natural speech. The main strong point is that the convergence is guarantied even if the initial neutral curve is not in the vicinity of the expected inverse trajectory[27] . Even if the current version of the algorithm does not reach real time since it requires the derivation of the articulatory to acoustic synthesis we envisage to use the articulatory codebook as a fast computation of the synthesis.
Using standard spectra data as input of inversion requires a distance able to compare synthetic speech spectra together with natural speech spectra. This distance thus should minimize distorsions due to the influence of the source since synthetic spectra do not invole the source characteristics. The idea investigated was to design a lifter (i.e. the filtering of the cepstral coefficients) which minimizes a perceptual distance based on formants (spectral prominences affiliated to resonance cavities of the vocal tract) between a corpus of formant data extracted from natural speech and entries of the articulatory codebook. The optimal lifter has been derived by minimizing the average perceptual distance. Preliminaries results are very encouraging and this new distance will be exploited in our inversion framework.
Using Articulography for Speech production
The recent purchase of the articulograph AG500 allowed acquiring almost unlimited quantity of articulatory data and thus several speech production studies are possible. Electromagnetic articulography (EMA) is a current technique to record articulatory data with a very good temporal resolution as movement signals are sampled at 200 Hz. This allows capturing very fine speech movement. The system uses 12 sensors that can be glued on the tongue and lips for instance.
-
Mapping EMA data to an articulatory model. Acoustic-to-articulatory maps based on articulatory models have typically been evaluated in terms of acoustic accuracy, that is, the distance between mapped and observed acoustic parameters. Since last year we have been developing a method that allows the evaluation of such maps in the articulatory domain. The proposed method estimates the parameters of Maeda's articulatory model on the basis of electromagnetic articulograph data, thus producing full midsagittal views of the vocal tract from the positions of a limited number of sensors attached to articulators. The match between the EMA data and the articulatory model is good. However, some improvements need to be done to take into account the larynx position (which cannot be covered by EMA). This method will allow a direct comparison of articulatory trajectories derived by inversion against those corresponding to the actual vocal tract dynamics, as recorded by EMA [64] .
-
Studying pharyngealization using EMA. Pharyngealization is an important characteristic of a set of consonants in Arabic and it has an important coarticulation effects on the neighboring vowels. Studying articulatory aspects of pharyngealization is currently accessible using articulography. One way to study the coarticulation effect of pharyngealization is to compare the dynamics of the articulation of sequences containing pharyngealized phonemes with similar sequences containing their non-pharyngealized cognates. We highlighted the differences between pharyngealized phonemes and non-pharyngealized ones in addition to pharyngeals vs. pharyngealized phonemes. The articulation of the tongue was tracked by four sensors glued on the tongue. A corpus of Arabic words uttered by a male speaker was recorded using AG500, labeled and analyzed. The main finding of this work is that the secondary articulation of moving the tongue back can be observed, while the main articulation of the tongue is the forward movement toward alveolar and dental positions. The dynamics observation showed that in pharyngealized context a backing of the tongue starts earlier than the production of the pharyngealized phoneme or pharyngeal. This anticipatory coarticulation is in accordance with earlier studies. We also showed that the phoneme context has an influence on backing of the tongue during the articulation of pharyngealized phonemes and there exists a mutual influence between pharyngealized phonemes and pharyngeals [18] .
Labial coarticulation
We investigated the effect of “adverse” contexts, especially that of the consonant / / and the “transconsonantal” vowel /i/, on labial parameters for French /i/ or /y/. Five parameters were analysed: the height, width and area of lip opening, the distance between the corners of the mouth, as well as lip protrusion. Ten speakers uttered a corpus made up of isolated vowels, syllables and logatoms. A special procedure has been designed to evaluate lip opening contours.
Results showed that the carry-over effect of the consonant / / can have drastic consequences on lip protrusion for /i/, impeding, for about half of the speakers involved in this study, the distinction between /i/ and /y/ in this dimension. The (labial) opposition between these vowels was nevertheless ensured by other labial parameters, such as the height of lip opening. Results also put in light the existence of large variations among speakers' coarticulatory habits. More experiments appeared necessary to find out the various influences of consonantal contexts and transconsonantal vowels on visual cues for vowels [21] .
Speech synthesis
Text-To-Speech
This year, the development of the software platform TTS SoJA (Synthesis platfOrm in JAva) has continued (after 2 years of Associate Engineer INRIA grant, 2006-2008). Some corrections and improvements were made (more numerous than expected) what explains the delay of the expected official release.
Meanwhile, we have studied and proposed two improvements of our synthesis method. As previously explained in 3.2.4 , the originality of our approach is that the selection is made from linguistic features without using a prosodic model. With a prosodic model, we can constrain the selection to be sure that the result has a good prosodic behavior. For our approach, the lack of strong constrain is an advantage for prosodic variability, but occasionally, can be a drawback because the selection can chose an unit without the right prosodic features (for instance the unit is too short for its position in the sentence). Duration is important in French, because the perception of stress is partly expressed by the lengthening of the last syllable of a word. A “mistake” in the length of a selected unit can completely disrupt the word from its accentuation and give an unnatural effect to the whole sentence. To constrain target selection slightly, we proposed to penalize a unit during the selection by comparing its length with the distribution of the duration of the phoneme (according to several significant positions in the sentence). The computation of this distribution was made on the same corpus to reflect the prosody (for the duration) of the speaker and not that of a standard model. The incorporation of this new feature gives good preliminary results.
The second improvement deals with the concatenation of units. The concatenation is a slight smoothing by standard OverLap and Add technique (pitch-synchronized). To avoid the complexity of GCI (Glottal Closure Instant) algorithms, we chose to put pitch marks on important negative or positive peaks of the speech signal [44] . Unfortunately, the positions of these marks are not consistent from one period to another period somewhere else in the audio corpus. This can result in a large dephasing at the time of unit concatenation, and consequently in acoustic artifacts (perceived as a glottal sound) in the signal obtained. Experimentally, this dephasing occurs either with positive peaks or with negative peaks but not with these both kinds of peaks for the same period. To eliminate the large potential dephasings, we have proposed to slightly change the concatenation by adding an algorithm based on the computation of correlations to choose the right kind of peak (positive or negative). This method removes large dephasings and the effects of slight remaining dephasings disappear with the OLA smoothing.
Acoustic-Visual synthesis
This year, we started our work on acoustic-visual speech synthesis within the framework of the ANR Jeunes Chercheurs ViSAC. This new challenge is a natural extension of our work from purely acoustic speech synthesis to acoustic-visual speech synthesis. In addition, this allows having a tied link with our ongoing work on speech production. Our main goal is to develop an acoustic-visual speech synthesis system using bimodal unit concatenation. This is a new approach of a text-to-acoustic-visual speech synthesis, which allows animating a 3D talking head and providing the associated acoustic speech. The major originality of this work is to consider the speech signal as bimodal (composed of two channels acoustic and visual) "viewed" from either facet visual or acoustic. We keep this association during the different processes. The key advantage is to guarantee that the redundancy of two facets of speech, acknowledged as determining perceptive factor, is preserved. As this work is done in collaboration with the Magrit team, they started by setting up the acquisition system of acoustic and stereo-visual data (the main challenge is to perform a real-time acquisition and the synchronization of the acoustic and the video entries). The Magrit team processed the visual data to provide 3D data. We performed some testing sessions to verify the quality of the recording to continue to tune up the content of the corpus, and the recording conditions (synchronization, noise, etc.). We started also studying the visual alignment of the corpus and how to keep it linked with the acoustic alignment for which existing alignment methods were used. We expect to finish developing a visual alignment algorithm during first quarter of next year.
Phonemic discrimination evaluation in language acquisition and in dyslexia and dysphasia
The evaluation of phonemic discrimination has been based on the test specially made by [50] for her longitudinal study. 36 pairs of pseudowords, similar or different were presented to the child who must say if he heard the same item or not.
Concerning dyslexia and normal acquisition of reading, a group study has been conducted. The 85 children of our population (age 5.6) were separated in a group "at risk" for dyslexia (39 children) and a control group (45 children). The results have been analysed to characterize the performance pattern of these subjects, as a group. Three different types of oppositions have been examined (voicing, place of articulation, intervertions and insertions). Statistical analyses have been conducted. Publications are submitted.
Concerning dyslexia, a multiple case study has been conducted in collaboration with the CNRS (Paris-Descartes University, Savoie University and University Hospital Paris-Bicêtre). The results indicates that the deficit of phonemic awareness is more prevalent than the deficit in short term memory or in rapid naming in the 15 french-speaking dyslexics than to those of reading level controls. This research was supported by a grant from the ACI 'Cognitique' (COG 129, French Ministry of Research). Publication is in revision [62] .
For dysphasia, a multiple case study has been started in September 2007. 3 dysphasic children will be tested, matched with 3 children who are simply retarded in reading. A speech and language therapist student, Margaud Martin, is working on this project.
Enhancement of esophageal voice
Detection of F0 in real-time for audio: application to pathological voices
The PhD thesis subject of Fadoua Bahja is: "Detection of F0 in real-time for audio: application to pathological voices". To achieve this goal, the first step has consisted in optimizing the CATE algorithm developed by Joseph Di Martino and Yves Laprie [45] . The CATE (Circular Autocorrelation of the Temporal Excitation) algorithm is based on the computation of the autocorrelation of the temporal excitation signal which is extracted from the speech log-spectrum. We tested the performance of the parameters using the Bagshaw database, which is constituted of fifty sentences, pronounced by a male and a female speaker. The reference signal is recorded simultaneously with a microphone and a laryngograph in an acoustically isolated room. These data are used for the calculation of the contour of the pitch reference. When the new optimal parameters from the CATE algorithm were calculated, we carried out statistical tests with the C functions provided by Paul BAGSHAW. At the beginning, we studied the different steps implemented in the CATE algorithm. Then, we tuned new parameters and made tests on various thresholds in order to find the most pertinent. Finally, we compared the results obtained with the eSRPD method (Enhanced Super Resolution Pitch Determination) developed by P. Bagshaw in 1993. The results obtained are satisfactory. Then we decided to make a bibliographical study in order to learn the various existing methods.
Voice conversion techniques applied to pathological voice repair
The subject of Ammar Werghi's thesis is the improvement of the esophageal voice using voice conversion techniques. To do this, we need to implement techniques similar or better than those described in the literature. Voice conversion is a technique that modifies a source speaker's speech to be perceived as if a target speaker had spoken it. One of the most commonly used techniques is the conversion by GMM (Gaussian Mixture Model). This model, proposed by Stylianou [63] , allows for efficient statistical modeling of the acoustic space of a speaker. Let "x" be a sequence of vectors characterizing a spectral sentence pronounced by the source speaker and "y" be a sequence of vectors describing the same sentence pronounced by the target speaker. The goal is to estimate a function F that can transform each source vector as nearest as possible of the corresponding target vector. In the literature, two methods using GMM models have been developed: In the first method (Stylianou), the GMM parameters are determined by minimizing a mean squared distance between the transformed vectors and target vectors. In the second method [49] , source and target vectors are combined in a single vector "z". Then, the joint distribution parameters of source and target speakers is estimated using the EM optimization technique. Contrary to these two well known techniques, the transform function F, in our laboratory, is statistically computed directly from the data: no needs of EM or LSM techniques are necessary. On the other hand, F is refined by an iterative process. The consequence of this strategy is that the estimation of F is robust and is obtained in a reasonable lapse of time. The preliminary results obtained until now are quite promising.