Comparing various models for statistical parsing

Participants : Djamé Seddah, Marie Candito, Benoit Crabbé.

In parallel with the effort to reduce data sparseness issues coming from small treebank with rich morphology, we are also experimenting various parsing models for the French Treebank. This work started in fall of 2008 and is still on going. It involved adapting the famous Charniak's parser [74] to a romance language, which had never been done before, to various instances of the Collins' model [78] and to two types of Stochastic Tree Insertion Grammar [76] , one of those being a very promising formalism (spinal-stig, see [48] ). For the latter, the idea is to consider sequence of unary branching rules as fragment of trees (called spines) instead of seeing those trees as set of CFG rules. For one instance of the French Treebank, the grammar is thus very compact, being made of 83 unlexicalized spines instead of 14 000 CFG trees for the same treebank. Some attention is raised by the parsing community on this topic: using a similar formalism, [73] achieved state of the art results on parsing the WSJ. Seeing this bubbling on this topic, one can consider that a paradigm shift is actually on its way in the parsing community: working in a horizontal (CFG) way means data sparseness whereas switching to vertical grammars (spines) implies working with very compact grammars. A preliminary paper has been submitted last November [48] .


