## Section: New Results

### Machine learning for XML document transformations

#### Grammatical Inference

Participants : Jérôme Champavère, Jean Decoster, Rémi Gilleron, Grégoire Laurence, Aurélien Lemay, Joachim Niehren, Sławek Staworko, Marc Tommasi, Fabien Torre.

The PhD thesis of J. Champavère on schema guided query induction, directed by Niehren, Lemay, and Gilleron, will be submitted beginning of 2010. It presents query learning algorithms based on grammatical inference. Schema guidance is based on a new efficient inclusion algorithms for tree languages defined by deterministic tree automata or XML schemas [12] . In particular they show how to translate XML schemas defined by various classes of EDTDs to bottom-up or top-down deterministic tree automata, based on the Curried or firstchild-nextsibling encoding of unranked into ranked trees.

Kong (a Postdoc in 2008) with Lemay and Gilleron study keyword search for XML [19] . They designed an improvement of the MaxMatch algorithm, called Relaxed Tightest Fragment (RTF). RTF is a representation of the results of a keyword search, together with a query evaluation algorithm. Experiments showed that RTF is more precise than the MaxMatch algorithm, in the sense that it discards more irrelevant nodes while keeping more nodes relevant for the query.

Torre with Terlutte from Grappa reconsider classes of
languages that learnable from positive examples alone
[23] . They introduce so called rational
languages with k -disjoint residuals, and show for every k
that the family with k -disjoint residuals subsumes the family of
k -reversible languages, known to be learnable from positive examples
only. It is also shown that the union for all of
the rationals with k -disjoint residuals is the set of all rational
languages. Finally, for each k , the corresponding family can be
identified in polynomial space and time from positive examples only,
when represented by a DFA.
In [24] , they present
general framework for supervised classification based on so
called *most general generalizations* . They show that defining
such generalizations offer without any further cost, the opportunity
to apply algorithms for supervised learning. The authors show how this generic
framework can be used for grammatical inference and for
classification. The interest of the method is confirmed by
experiments.

#### Statistical Inference

Participants : Jean Decoster, Marc Tommasi, Fabien Torre.

Faddoul, Gilleron and Torre, in collaboration with with Chidlovskii (Xerox Grenoble), study applications of machine learning to the task of labeling documents and authors in social networks. The aim is to propose new algorithms for this task, using relational representations for documents and networks, semi-supervised and multi-task techniques, and label propagation methods.

Gilleron and Torre started the PhD project of Decoster in October.
They investigate the use of Inductive Logic Programming in order
to automatically classify or transform `XML` trees.
Inductive Logic Programming is a machine learning technique
that aims to learn logic programs from examples.

Gilbert, Gilleron and Tommasi continue their work on Tree Series and Weighted Tree Automata as part of the PhD project of Gilbert. Their work has been focusing on problems of convergence of Tree Series and extension of the DEES learning algorithm to tagging tasks. Their collaboration with the LIF Marseille goes on, resulting in a new algorithm for Weighted Tree Automata inference based on Principal Component Analysis.

Laurence, Lemay, Niehren, Staworko and Tommasi continue their work on learning tree transducers, as part of the PhD project of Laurence. They have recently devised an algorithm for learning top-down deterministic tree-to-word transducers from examples. This should allow to infer transformations between documents in different XML schemata.