Section: New Results
XML specification and verification
Participants : Serge Abiteboul, Diego Figueira, Luc Segoufin, Cristina Sirangelo.
In general Dahu aims at making systems with data safer and more reliable. This means providing suitable models together with a toolbox for helping in the design and implementation of such systems.
We have tackled this year several specific scenarios: models for describing and verifying incomplete information, models for exchanging data, models for distributed XML repository, decision procedure around the query language XPath and static analysis of dynamic XML systems.
XPath is arguably the most widely used XML query language as it is implemented in XSLT and XQuery and it is used as a constituent part of several specification and update languages. Hence in order to perform static analysis on a system manipulating XML data it is important to master the static analysis for XPath. Most of the important static analysis problems reduce to satisfiability checking: does a given query return a non-empty answer on some data. In general the satisfiability of XPath is undecidable, however important fragments can be shown to be decidable. In [26] we have shown that when restricting the navigational axis of XPath to the child and descendant relation then satisfiability can be decided in ExpTime. In [27] we have shown that for many other natural fragments of XPath satisfiability is, if decidable, not primitive recursive.
Active XML is a high-level specification language tailored to data-intensive, distributed, dynamic Web services. Active XML is based on XML documents with embedded function calls. The state of a document evolves depending on the result of internal function calls (local computations) or external ones (interactions with users or other services). Function calls return documents that may be active, so may activate new sub-tasks. In [13] , [12] , we studied the verification of temporal properties of runs of Active XML systems, specified in a tree-pattern based temporal logic, Tree-LTL, that allows expressing a rich class of semantic properties of the application. The main results establish the boundary of decidability and the complexity of automatic verification of Tree-LTL properties.
Towards a data-centric workflow approach, we introduced in [19] an artifact model to capture data and workflow management activities in distributed settings. As above, the model is built on Active XML. We argue that the model captures the essential features of service calls and the essential features of business artifacts as described informally by Nigam and Caswell in 2003. We also briefly consider the monitoring of distributed systems and the verification of temporal properties for them.
A distributed XML document is an XML document that spans several
machines or Web repositories. We assume that a distribution design of
the document tree is given, providing an XML tree some of whose leaves
are "docking points", to which XML subtrees can be attached. These
subtrees may be provided and controlled by peers at remote locations,
or may correspond to the result of function calls, e.g., Web
services. If a global type , e.g. a DTD, is specified for a
distributed document T, it would be most desirable to be able to break
this type into a collection of local types, called a local typing,
such that the document satisfies
if and only if each peer (or
function) satisfies its local type. In [21] , we lay
out the fundamentals of a theory of local typing and provide formal
definitions for several variants of locality.
Data exchange between different independent applications has been a central database application since the early development of database systems, and sees now a renewed interest with XML – originally designed as an exchange language. The general problem is how to transfer data from a source database to a target database, structured according to different schemas, knowing a mapping relation between the two schemas. In the literature two main semantics of data exchange existed: one based on the Open World Assumption (OWA), and another one based on the Closed World Assumption (CWA) on target instances. We have studied the effect of introducing an explicit CWA/OWA annotation on target attributes of schema mappings, and we have formalized a corresponding mixed CWA/OWA semantics of data exchange. We have studied the complexity of answering queries over the set of all possible target solutions by establishing a complexity characterization based on the number of open attributes in schema mappings. We have also studied one of the main schema mapping operations, schema composition, for annotated schema mappings. We have shown that large classes of CWA schema mappings enjoy closure under composition. These results are surveyed in [31] .
XML key applications on the Web, such as data integration and exchange, make the presence of incomplete information unavoidable in XML data, due to the incompatibility of schemas and constraints among different sites. In [23] we have developed a general model of incomplete information in XML which, in analogy with its relational counterpart, is centered on the notion of null to represent missing information. However the structure of XML documents is much more involved than that of relational databases, and missing information may occur not only in attribute values, but also in the tree structure of the documents. We have considered several models of incomplete information in XML and we have investigated how different features characterizing these models affect the complexity of some relevant computational problems. Among these, the problem of checking consistency of an incomplete representation and answering queries over it. As a result we have traced a boudary between tractabilty and intractability of these problems. In particular this study allowed us to find a robust class of incomplete documents and queries that make query answering tractable.
Incomplete information can also be represented using probabilities. In addition to ordinary XML documents, a p-documents have distributional nodes that specify the possible worlds and their probabilistic distribution. Particular families of p-documents are determined by the types of distributional nodes that can be used as well as by the structural constraints on the placement of those nodes in a p-document. Some of the resulting families provide natural extensions and combinations of previously studied probabilistic XML models. The expressive power of families of p-documents has been investigated in [15] . The evaluation of aggregate functions such as count, sum, avg, for probabilistic XML is the topic of [20]