Team MOSTRARE

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Dissemination
Bibliography

Section: New Results

Machine learning for XML document transformations

Grammatical Inference

Keywords : tree automata, node selection queries.

Participants : Jérôme Champavère, Rémi Gilleron, Aurélien Lemay [ correspondent ] , Joachim Niehren, Grégoire Laurence, Marc Tommasi.

Champavère, Gilleron, Lemay, and Niehren [15] investigate the induction of monadic node selecting queries from partially annotated XML -trees. They show how incorporate the document schema information into existing algorithms for learning tree automata queries, like RPNI-based learning algorithm. None of the alternative approaches to wrapper induction has included schema information so far, most probably, since they cannot guide learning of queries when are represented stochastically (rather than by tree automata). In our case, monadic queries are represented by pruning node selecting tree transducers. Since target queries of the learning problem are subject to schema constraints, the idea is to avoid generalization errors in the learning process by taking the schema information into account. Compatible queries select answers only from documents that are consistent with the given schema. We have implemented the new learning algorithm with schema guidance. Experimental results for guidance by the schema of HTML are presented.

From the algorithmic perspective, the central problem of schema guided query induction is inclusion checking in deterministic tree automata or DTDs. Champavère, Gilleron, Lemay, and Niehren [14] present a new efficient algorithm for this inclusion problem. For testing language inclusion Im1 ${L(A)\#8838 L(B)}$ between tree automata, it operates in time O(| A|*| B|) if B is deterministic. It can be applied to testing inclusion Im2 ${L(A)\#8838 L(D)}$ in deterministic DTDs D in time O(| A|*| $ \upper_sigma$|*| D|) where $ \upper_sigma$ is the signature. No previous algorithms with these complexities existed.

Tellier studies connections between Categorial Grammars and Recursive Automata in [24] . She exhibits connexions between learning strategies "by specialisation" implemented in both contexts. This leads to a new interpretation of previous works on learning categorial grammars and provides a better understanding of when this strategy can be effectively used. It is the case when some kind of "bounds" about the target is available, in the form of information derived from it by a morphism.

Tommasi, Lemay, Staworko and Niehren started the PhD project of Laurence on learning streaming tree transducers. The objective is to lift previous learning algorithms for node selection queries to transformations, in order to approach new applications in Web Data Exchange.

Statistical Inference

Keywords : probabilistic automata, conditional random fields, XML trees, tree labeling.

Participants : Edouard Gilbert, Florent Jousse, Lingbo Kong, Rémi Gilleron, Aurélien Lemay, Isabelle Tellier, Marc Tommasi [ correspondent ] .

Gilbert, Gilleron, and Tommasi in collaboration with Denis, Habrard and Ouardi study probability distributions over free algebras of trees. and show that distributions can be defined by weighted tree automata or by tree series. They adapt definitions to handle the case of unranked trees and define learning algorithms for probability distributions in this case. In [16] they show that any representation of a rational stochastic tree language can be transformed in a reduced normalized representation that can be used to generate trees from the underlying distribution. They also study some properties of consistency for rational stochastic tree languages and discuss their implication for the inference.

Tommasi and Gilleron in collaboration with Senellart, Mittal and Muschick [26] propose an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. This approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its fields to concepts of the domain, and understanding how resulting records are represented in an HTML result page. For the former problem, they use a combination of heuristics and of probing with domain instances; for the latter, they use a supervised machine learning technique adapted to tree-like information ( XCRFs ) on an automatic, imperfect, and imprecise, annotation using the domain knowledge. Some experiments demonstrate the validity and potential of the approach.

Laurence and Tellier have started experiments using XCRFs for natural language processing. In the context of the ANR MDCO "Crotal" (CRFs for TAL), Tellier and colleagues made experiments using XCRFs on linguistic treebanks. The corpus used was the French treebank produced by the Paris 7 team, a set of 10 000 sentences extracted from the French newspaper "Le monde", syntacticaly analyzed and tagged with functionnal labels (of the kind : "SUJ", "OBJ"...). The purpose was to evaluate whether these labels could be inferred from the syntactic structures alone by XCRFs.

Kong, Lemay and Gilleron proposed new algorithms for adapting keyword search to XML data. They proposed a framework of retrieving meaningful fragments in XML data. They defined new filtering mechanisms in order to improve the quality of answers.


previous
next

Logo Inria