## Section: New Results

### XML transformation languages

#### Streaming for XML transformations

Participants : Alain Frisch, Keisuke Nakano [ U. Tokyo ] .

The classical processing model for XML transformations consists in loading a textual input document in memory as a tree, processing it to produce a new tree, which is then sent to the output as a new textual document. Of course, this model does not make a good use of three different resources: input channel, processing unit, output channel. By interleaving the three phases, one can start processing the document and producing the output while still parsing the input and sometimes without having to keep it entirely in memory. The advantages of this approach are evident when dealing with huge documents and/or with slow input or output channels (e.g. a transformation node in a network).

Writing streaming transformation by hand means that the programmer has to deal explicitly with buffering. This is extremely tricky and error-prone. In particular, the same conceptual behavior can result in drastically different implementations according to the ``streaming context'' in which it is used. Also, deciding whether a piece of transformation can be evaluated in streaming mode or requires buffering cannot generally be done statically, because the answer can depend on dynamic properties (the input document). This adds to the complexity of hand-written code.

There have been previous proposals for transformation languages to support streaming. There are all either limited in expressivity (e.g. to simple top-down transductions), overly conservative (requiring parts of the input to stay in memory when this is not necessary), or not automatic enough (requiring programmer-provided annotations).

We have designed and implemented a language, called XStream, that supports an efficient streaming evaluation model based on the theory of term rewriting. This language is purely functional and Turing-complete, and requires no annotations from the programmer. The language allows the programmer to express any computable tree transformation, including those that cannot be efficiently implemented in streaming; a ``best effort'' approach is taken to obtain streaming when possible, but without any optimality guarantee (optimal streaming is undecidable). It turns out that the compiler produces an optimal streaming behavior for all the examples we have tried. In practice, transformations compiled by XStream always compare well with widespread XML transformation technologies, and they are orders of magnitude more efficient when the transformation can be evaluated in streaming.

The theory behind the language and the techniques used in its implementation are described in a paper [23] to be presented at the PLAN-X 2007 workshop.

#### Extending Caml with XML types

Participant : Alain Frisch.

The type system of Objective Caml extends the original Hindley-Milner type system in various directions but keeps principal type inference based on first-order unification. In parallel, various type systems for manipulating XML documents in functional languages have been proposed recently. They usually build on an interpretation of types as sets of values, which induces a natural subtyping relation. Unlike Hindley-Milner type systems, they do not feature full type inference: function argument and return types have to be given by the programmer. Instead, they rely on tree automata algorithms to compute the subtyping relation and to propagate precise types through complex operations on XML documents such as regular expression pattern matching, deep iterations, and path navigation.

In order to facilitate the manipulation of XML documents in Caml, one may want to integrate some features from these domain specific languages into Objective Caml. Alain Frisch developed a type system and a typing algorithm for a language that combines OCaml and CDuce. This system preserves nice theoretical and practical properties of ML and CDuce. It was implemented in OCamlDuce, a modified version of OCaml that embeds CDuce. OCamlDuce was implemented by merging the OCaml and CDuce code bases and adding a relatively small amount of glue code.

The theoretical foundations of OCamlDuce's type system are described in two papers [21] , [22] that were presented at the ICFP 2006 conference and at the PLAN-X 2006 workshop.

#### Exact type checking for XML transformations

Participants : Alain Frisch, Haruo Hosoya [ U. Tokyo ] .

Type systems for programming languages are usually sound but incomplete: they reject programs that would not cause type errors at run-time. There are good reasons for this incompleteness: for most programming languages, exact (complete) typing is undecidable. However, this might not be the case for type systems for XML transformation languages, which usually rely on tree automata and regular tree languages to precisely constrain the structure.

We are interested in importing results from the theory of tree transducers into programming languages for XML. There is a strong analogy between top-down tree transducers and functional programs (top-down traversal of values through pattern matching and mutually recursive functions). There is a rich literature about tree transducers. Many of the existing formalisms enjoy a property of exact type-checking: given two regular tree languages interpreted as input and output constraints, it is possible to decide without any approximation whether a given tree transducer is sound with respect to this specification.

There are several challenges to address if we want to integrate such techniques into programming languages: design of new programming features and paradigms, design of type system based on tree transducers, extending algorithms to deal with additional features not found in simple tree transducers.

The long-term objective is to design programming languages for XML with exact type-checking. This implies that no well-typed program is rejected even without any type annotations; that very precise errors can be produced for ill-typed programs; and that polymorphism comes for free. Exactness of type-checking cannot be obtained for a Turing-complete language; we need to introduce explicit typing approximations (presented as type annotations) to deal with that.

One of the reasons that can explain the relative lack of interest from the programming language community for tree transducer techniques is that most of the problems are EXPTIME-complete and algorithms are quite complex. We believe that a proper reformulation of the algorithms will allow us to define interesting classes of transformations that support efficient type-checking and to experiment with original implementation techniques.

We are particularly interested in the formalism of so-called macro-tree transducers, which directly capture the essence of top-down functional transformations with accumulators. We have obtained a new backward type-inference algorithm for this kind of transducers. From a deterministic bottom-up tree automaton describing the output type, this algorithm produces an alternating tree automaton that represents all the valid input trees, in polynomial time. (Alternating tree automata can have both conjunctive and disjunctive transitions, which makes them exponentially more succinct than normal tree automata.) The type-checking problem then reduces to checking emptiness of alternating tree automata, which is an DEXPTIME-complete problem in general, but some algorithms are efficient for many common situations. In particular, we established that a transducer that traverses the input tree a bounded number of times results in an alternating tree automata whose emptiness can be checked in polynomial time. Most transducers that appear in practice satisfy this condition. We also started to experiment with efficient implementations of emptiness algorithm for alternating tree automata, with the hope to produce a usable type-checking tool for macro-tree transducers. In contrast, the classical algorithm produces a (non-alternating) tree automaton whose size is always exponential even for simple transducers. This algorithm can be reinterpreted as the composition of our new algorithm and a classical (exponential) translation from alternating tree automata to tree automata.