Section: Scientific Foundations
Treebanks development and exploitation
Participants : Benoit Crabbé, Marie Candito, Éric Villemonte de La Clergerie.
- Treebank
a treebank is a set of sentences whose syntactic analysis has been performed manually (it is called a “treebank” in reference to the fact that in most cases, these analyses are represented as trees, be them constituency or dependency trees)
At the international level, the last decade has seen the emergence of a very strong trend of researches on statistical methods in NLP. This trend results from several reasons but one of them, in particular for English, is the availability of large annotated corpora, such as the Penn Treebank (1M words extracted from the Wall Street journal, with syntactic annotations) or the the British National Corpus (100M words covering various styles annotated with parts of speech). Such annotated corpora are very valuable to extract stochastic grammars or to parametrize disambiguation algorithms.
These successes have lead to many similar proposals of corpus annotations. A long (but non exhaustive) list may be found on the internet(http://www.ims.uni-stuttgart.de/projekte/TIGER/related/links.shtml ) and includes mostly resources for languages other than French, apart from the French Treebank, developed in Anne Abeillé's team at University Paris 7 [57] .
However, the development of such treebanks is very costly from a human point of view and represents a long standing effort. The volume of data that can be manually annotated remains limited and is generally not sufficient to learn very rich information (sparse data phenomena). Furthermore, designing an annotated corpus involves choices that may block future experiments to acquire new kinds of linguistic knowledge. Last but not least, it is worth mentioning that even manually annotated corpora are not error prone.
Hence, two directions are investigated by Alpage members, and will be of increasing importance. First, Alpage members are working actively on the exploitation of the French Treebank for developing probabilistic parsers.
Second, a bootstrapping approach is also investigated, where corpora can be parsed by many different parsing systems, so as to build automatically a consensual treebank which can reach a very large size (typically 100-million words); such a treebank (or parsing results from individual parsers) can be used to acquire linguistic information so as to enrich lexica, leading to better parsers. This has been achieved for example at Alpage thanks to error mining techniques in parsing results, and the PASSAGE ANR project, lead by Éric de La Clergerie, applies this bootstrapping approach at a national level [117] . Such an approach leads to resources and parsers that co-evolve, in a virtuous circle: resources are used by tools on corpus to improve resources and prepare the next generation of resources (by adding richer information). This constitutes the first steps towards the definition of generic learning algorithms, not relying on costly manually annotated corpora.
Nevertheless, members of Alpage are involved in the Rhapsodie ANR project (see 8.1.4 ). One of the tasks of this project, coordinated by Sylvain Kahane, is to develop a dependency Treebank for a little corpus of Spoken French (3 hours = 36,000 words). The corpus, orthographically transcripted, is manually segmented by linguists in rectional units, where words are linked by dependency relations. These units will be parsed by the Alpage team. A difficulty comes form the fact, that due to disfluencies, reformulation, and so on, rectional unit are not disjoint, and the syntactic trees we obtain must be patched up. This is a first step in the direction of the parsing of spoken languages. The next step would be to see how to obtain automatically a segmentation in rectional units.