Team Alpage

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: Scientific Foundations

Probabilistic parsing approaches

Participants : Marie Candito, Benoit Crabbé, Pascal Denis, Djamé Seddah, Benoît Sagot, Pierre Boullier.

The development of large scale symbolic grammars has long been a lively topic in the French NLP community. Surprisingly, the acquisition of probabilistic grammars aiming at stochastic parsing, using either supervised or unsupervised methods, has not attracted much attention despite the availability of large manually syntactic annotated data for French. Nevertheless, the availability of the Paris 7 French Treebank [57] , allowed [86] to carry out the extraction of a Tree Adjoining Grammar [89] and led [58] to induce the first effective lexicalized parser for French. Yet, as noted by [112] , the use of the treebank was “challenging”. Indeed, before carrying out successfully any experiment, the authors had to perform a deep restructuring of the data to remove errors and inconsistencies.

On the other hand, [79] showed that with a new released and corrected version of the treebank. it was possible to train statistical parsers from the original set of trees. This path has the advantage of an easier reproducibility and eases verification of reported results.

Before that, it is important to describe the characteristics of the parsing task. In the case of statistical parsing, two different aspects of syntactic structures are to be considered : their capacity to capture regularities and their interpretability for further processing.


Learning for statistical parsing requires structures that capture best the underlying regularities of the language, in order to apply these patterns to unseen data.

Since capturing underlying linguistic rules is also an objective for linguists, it makes sense to use supervised learning from linguistically-defined generalizations. One generalization is typically the use of phrases, and phrase-structure rules that govern the way words are grouped together. It has to be stressed that these syntactic rules exist at least in part independently of semantic interpretation.


But the main reason to use supervised learning for parsing, is that we want structures that are as interpretable as possible, in order to extract some knowledge from the analysis (such as deriving a semantic analysis from a parse). Typically, we need a syntactic analysis to reflect how words relate to each other. This is our main motivation to use supervised learning : the learnt parser will output structures as defined by linguists-annotators, and thus interpretable within the linguistic theory underlying the annotation scheme of the treebank. It is important to stress that this is more than capturing syntactic regularities : it has to do with the meaning of the words.

It is not certain though that both requirements (generalizability / interpretability) are best met in the same structures. In the case of supervised learning, this leads to investigate different instantiations of the training trees, to help the learning, while keeping the maximum interpretability of the trees. As we will see with some of our experiments, it may be necessary to find a trade-off between generalizability and interpretability.

Further, it is not guaranteed that syntactic rules infered from a manually annotated treebank produce the best language model. This leads to methods that use semi-supervised techniques on a treebank-infered grammar backbone, such as [94] , [104] .

The Alpage's statistical parsing architecture

In order to carry out the task of building a statistical parser for French, we started by exploring the state-of-the-art in statistical parsing technology for others languages. However, as much of the work being done in statistical parsing has been carried out by English speaking teams, most of the parser publicly available is specifically tuned for the English language underlying current practice in the field by mostly training and parsing the Wall Street Journal sections of the Penn Treebank.

That is why we decided to adapt to French, the state-of-the-art parsers available in the two phrase structures parsing paradigms : lexicalized and unlexicalized parsers. We found out that in order to get the best performance from our annotated data, the annotation scheme has to be modified to include some important morpho-syntactic information [79] ,[3] . This led the unlexicalized parser we adapted ( [104] , [79] ) to offer the best performance for French so far. Meanwhile, as we are working on a very small data set, we explored various means (lemmatization, clustering) of reducing the data sparseness issues originating from a somewhat small lexicon. In the context of working with various phrase structure based parsers, we have been naturally inclined to design a data driven phrase structure to dependency parsing process that remains generic whatever the parser being used. Overall, this architecture exhibits state-of-the-art results, even though recent work on adapting pure statistical dependency parser to French, which was carried out in our team, for the sake of thoroughness, shows that a model à la McDonald [96] exhibits a slight improvement over our main architecture [28] .


Logo Inria