Section: New Results
Optimized reduction of probabilized shared parse forests
Participants : Pierre Boullier, Benoît Sagot.
- PCFG (Probabilistic Context-Free Grammar)
a Context-Free Grammar (CFG) with probabilities associated with each production.
Collaboration with Alexis Nasr (LIF, Université de Marseille-Provence), within the ANR funded-project SEQUOIA (see 8.1.2 ).
The output of a CFG parser such as parsers created with Syntax is a shared parse forest, which is an acyclic graph that represents all the syntactic parses of the parsed sentence. Such a graph can represent an exponential number (with respect to the length of the sentence) of parses as a cubic object. Therefore, when probabilistic information is associated with the rules of the CFG (Probabilistic CFG, PCFG), it is necessary to extract from the forest the n most likely parses with respect to the PCFG. Standard state-of-the-art algorithms that extract the n best parses (Huang 2005) produce a collection of trees, losing the factorization that have been realized by the parser, and reproduce some identical sub-trees in several parses. This situation is not satisfactory since the post-parsing processes (such as reranking) will not take advantage of the factorization and will reproduce some identical work on common sub-trees. One way to solve the problem is to prune the forest by eliminating sub-forests that do not contribute to any of the n most likely trees. Such techniques usually over-generate: the pruned forest contains more than the n most likely trees.
The new direction that we explored since 2008 is the production of shared forests that contain exactly the n most likely trees, avoiding the explicit construction of n different trees and the over-generation of pruning techniques. This process can be seen as a forest transduction which is applied on a forest and produces another forest. The transduction applies some local transformations on the structure of the forest, developing some parts of the forest when necessary. If n is not very small, the forest produced is generally larger than the input forest even if it contains less trees. We developed two types of algorithms for building such a forest containing exactly n trees, which try to minimize its size.
The integration of these algorithms within the system Syntax has been achieved, thus allowing to get very interesting quantitative results  : in general, the size of the resulting forest, for reasonable values of n (say, 100), has the same order of magnitude as that of the pruned forest, but it contains only the best n trees.