## Section: Scientific Foundations

Keywords : grammatical inference, statistical learning, wrapper induction, tree annotations, tree transformations.

### Machine learning for
`XML` document transformations

Automatic or semi-automatic tools for inferring tree transformations are needed for information extraction. Annotated examples may support the learning process. The learning target will be models of
`XML` tree transformations specified in some of the languages discussed above.

**Grammatical inference**is commonly used to learn languages from examples and can be applied to learn transductions. Previous work on grammatical inference for transducers remains limited to the case of strings
[32] ,
[49] . For the tree case, so far only very basic tree transducers have been shown to be learnable, by previous work of the Mostrare project. These are node selecting tree
transducer (NSTTs) which preserve the structure of trees while relabeling their nodes deterministically.

**Statistical inference**is most appropriate for dealing with uncertain or noisy data. It is generally useful for information extraction from textual data given that current text understanding tools are still very much limited.
`XML` transformations with noisy input data typically arise in data integration tasks, as for instance when converting
`PDF` into
`XML` .

Stochastic tree transducers have been studied in the context of natural language processing
[40] ,
[42] . A set of pairs of input and output trees defines a relation that can be represented by a 2-tape automaton called a
*stochastic finite-state transducer* (SFST). A major problem consists in estimating the parameters of such transducer. SFST training algorithms are lacking so far
[36] .

Probabilistic context free grammars (pCFGs)
[46] are used in the context of
`PDF` to
`XML` conversion
[33] . In a first step, a labeling procedure of leaves of the input document by labels of the output DTD is learned. In a second step, given a CFG as a generative model of
output documents, probabilities are learned. Such two steps approaches are in competition with one step approaches estimating conditional probabilities directly.

A popular non generative model for information extraction is
*conditional random fields* (
`CRF` , see a survey
[50] ). One main advantage of
`CRF` is to take into account long distance dependencies in the observed data.
`CRF` have been defined for general graphs but have mainly been applied to sequences, thus
`CRF` for
`XML` trees should be investigated.

So called
*structured output* has recently become a research topic in machine learning
[52] ,
[51] . It aims at extending the classical categorization task, which consists to associate one or some labels to each input example, in order to handle structured output
labels such as trees. Applicability of structured output learning algorithms remains to be asserted for real tasks such as
`XML` transformations.