Section: New Results
Improving the lexical coverage of statistical parsers
Participants : Marie Candito, Benoit Crabbé, Djamé Seddah, Enrique Henestroza Anguiano.
Probabilistic parsers are trained on treebanks, namely syntactically annotated sentences, and this training allows to capture syntactic regularities. Yet, though lexical information is known to play a crucial role in determining the syntactic structure of a sentence, many lexical phenomena cannot be learned simply by training on a treebank of a few thousands of sentences (the French treebank we use contains about 12000 sentences). First because treebanks cover only a small part of the French vocabulary. Second, because lexical data is very sparse : a corpus contains a few very frequent words, and a lot of rare words. Compared to English, this is even truer for French, or more generally inflected languages : morphological marks for gender, number, tense etc... drastically augment the vocabulary size.
Word clustering
To cope with this inherent limitation of statistical parsing techniques, we have investigated the use of word clusters instead of words as input to the parser. Our work was inspired by [91] , who have shown that word clusters obtained with unsupervised techniques could improve statistical dependency parsing, when used as features for classifiers determining the weights of dependency arcs. We tried to use word clusters within the framework of generative statistical parsing. We have first defined an algorithm to get rid of morphological marks for gender, number, tense and mood, without resorting to part-of-speech tagging. It makes use of the Lefff lexicon [108] , and allows to cluster forms on a morphological basis, still keeping the morpho-syntactic ambiguities of input words. We applied this process to the L'Est Républicain corpus, a 125 million word journalistic corpus, freely available at CNRTL (http://www.cnrtl.fr/corpus/estrepublicain ). Then, we applied Brown's algorithm for unsupervised word clustering ( [71] ) on the resulting corpus. Using the resulting word to cluster mapping, we were able to train a parser (using Petrov's algorithm) on a modified treebak, where word forms are replaced by their cluster. This has led to a significant improvement of parsing performance [25] , when tested on part of the treebank used as a test set. The method has two advantages. First, because the reduction of the vocabulary size (to clusters) leads to better probability estimations, that explains the improvement on a test set taken from the treebank. Second, this reduced vocabulary (the set of clusters) corresponds in fact to an augmented set of word forms known at training time. There are then less totally unknown word forms at parsing time. This suggests that parsing performance should also be better for parsing text of a domain different from that of the treebank.
Data driven lemmatization
In conjunction with the work being done in the team on word clusterization, where the goal is to obtain better probability estimates (cf. previous section), we are also working on the integration of a lemmatization process into our parsing chain. Let us recall that the lemmatization is the process of getting the canonical form of a given word form (ie. mangerions is lemmatized as manger for instance), therefore it is a mean to reduce data sparseness issues (common when one is working with very small amount of annotated data). A collaboration between Gzregorz Chrupała, a postdoctoral researcher from the Saarland University, and our team was initiated via an invitation offered to Djamé Seddah to work for a week on the adaptation of Chrupala's state-of-the-art data driven morphology learner tool (Morfette, [77] ) to the French language. This was done by the integration of the Alpage's wide coverage lexicon (Lefff, [108] ) into the Morfette's training set. This fruitful collaboration led to the development of Morfette's module aimed toward French that exhibits the best results so far for French both in POS tagging and in lemmatization. The POS tagging state-of-the-art, for example, is 97.88% with Morfette (Overall accuracy in MElt [30] 97.70% — but let us recall here that the MElt models does not use lemmatization information during training); on unseen words, Morfette reaches 92.50% (90.01% for MElt). Therefore the inclusion of a tuple lemma+POS instead of a simple word form in one of our parsers will help to improve parsing results. Papers on this topic are in preparation to be submitted to ACL and to COLING 2010.