## Section: Scientific Foundations

### Machine learning for `XML` document transformations

Participants : Jérôme Champavère, Jean Decoster, Jean-Baptiste Faddoul, Édouard Gilbert, Rémi Gilleron, Grégoire Laurence, Aurélien Lemay, Joachim Niehren, Sławek Staworko, Marc Tommasi, Fabien Torre.

Automatic or semi-automatic tools for inferring tree transformations
are needed for information extraction. Annotated examples may support
the learning process. The learning target will be models of `XML` tree
transformations specified in some of the languages discussed above.

**Grammatical inference** is commonly used to learn languages
from examples and can be applied to learn transductions. Previous work
on grammatical inference for transducers remains limited to the case
of strings [29] , [45] . For the tree
case, so far only very basic tree transducers have been shown to be
learnable, by previous work of the Mostrare project. These are node
selecting tree transducer (NSTTs) which preserve the structure of
trees while relabeling their nodes deterministically.

**Statistical inference** is most appropriate for dealing with
uncertain or noisy data. It is generally useful for information
extraction from textual data given that current text understanding
tools are still very much limited. `XML` transformations with noisy
input data typically arise in data integration tasks, as for instance
when converting `PDF` into `XML` .

Stochastic tree transducers have been studied in the context of
natural language processing [37] , [39] .
A set of pairs of input and output trees defines a relation that can
be represented by a 2-tape automaton called a *stochastic
finite-state transducer* (SFST). A major problem consists in
estimating the parameters of such transducer. SFST training algorithms
are lacking so far [33] .

Probabilistic context free grammars (pCFGs)
[43] are used in the context of `PDF` to `XML` conversion [30] . In the first step, a
labeling procedure of leaves of the input document by labels of the
output DTD is learned. In the second step, given a CFG as a generative
model of output documents, probabilities are learned. Such two steps
approaches are in competition with one step approaches estimating
conditional probabilities directly.

A popular non generative model for information extraction is
*conditional random fields* (`CRF` , see a survey
[46] ). One main advantage of `CRF` is to take
into account long distance dependencies in the observed data. `CRF` have been defined for general graphs but have mainly been applied to
sequences, thus `CRF` for `XML` trees should be investigated.

So called *structured output* has recently become a research
topic in machine learning
[48] , [47] . It aims
at extending the classical categorization task, which consists to
associate one or some labels to each input example, in order to handle
structured output labels such as trees. Applicability of structured
output learning algorithms remains to be asserted for real tasks such
as `XML` transformations.