Team Alpage

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

State-of-the-art French tagging with MElt

Participants : Pascal Denis, Benoît Sagot, Djamé Seddah.

Pascal Denis and Benoît Sagot worked on a new MaxEnt-based tagger, MElt, trained on the French TreeBank for building a tagger for French. This baseline, which makes no use of an external lexical resource, can be significantly improved by coupling it with the French morphosyntactic lexicon Lefff . The resulting tagger, MElt fr , reaches a 97.7% accuracy that is, to our best knowledge state-of-the-art for that task (i.e., tagging with no lemmatization information). More precisely, the addition of lexicon-based features yield error reductions of 23.3% overall and of 27.5% for unknown words (corresponding to accuracy improvements of .7% and 3.9%, respectively) compared to the baseline tagger [30] .

Pascal Denis and Benoît Sagot also showed that the use of a lexicon improves the quality of the tagger at any stage of lexicon and training corpus development. Moreover, they approximately estimated development times for both resources, and show that the best way to optimize human work for tagger development is to work on the development of both an annotated corpus and a morphosyntactic lexicon.

Moreover, Djamé Seddah has initiated a collaboration with Grzegorz Chrupała (University of Saarbrücken, Germany), who independently proposed a system called Morfette [77] based on the same machine learning techniques than MElt but that benefits from lemmatization information in the training data for improving tagging accuracy and providing lemmas in addition to tags in the output. This collaboration should lead to joint efforts between MElt and Morfette, in order to improve tagging and lemmatization accuracy, applying these techniques to other languages (including resource-scarse languages), and studying the influence of tagging and lemmatization on parsers' performances when used as pre-processing steps.


Logo Inria