## Section: Research Program

### Scientific Foundations

We now detail some of the scientific foundations of our research on complex data management. This is the occasion to review connections between data management, especially on complex data as is the focus of Valda, with related research areas.

#### Complexity & Logic

Data management has been connected to logic since the advent of the relational model as main representation system for real-world data, and of first-order logic as the logical core of database querying languages [37]. Since these early developments, logic has also been successfully used to capture a large variety of query modes, such as data aggregation [63], recursive queries (Datalog), or querying of XML databases [46]. Logical formalisms facilitate reasoning about the expressiveness of a query language or about its complexity.

The main problem of interest in data management is that of query
evaluation, i.e., computing the results of a query over a database.
The complexity of this problem has far-reaching consequences.
For example, it is because first-order logic is in the ${\text{AC}}_{0}$
complexity class that evaluation of SQL queries can be parallelized
efficiently. It is usual [74] in data management to distinguish
*data complexity*, where the query is
considered to be fixed, from *combined complexity*, where both the
query and the data are considered to be part of the input. Thus, though
conjunctive queries, corresponding to a simple SELECT-FROM-WHERE fragment
of SQL, have PTIME data complexity, they are NP-hard in combined
complexity. Making this distinction is important, because data is often
far larger (up to the order of terabytes) than queries (rarely more than
a few hundred bytes). Beyond simple query evaluation, a central question
in data management remains
that of complexity; tools from algorithm analysis,
and complexity theory can be used to pinpoint the tractability frontier
of data management tasks.

#### Automata Theory

Automata theory and formal languages arise as important
components of the study of many data management tasks: in temporal
databases [36], queries, expressed in temporal
logics, can often by compiled to automata; in graph
databases [42], queries are naturally given as
automata; typical query and schema languages for XML databases such as
XPath and XML Schema
can be compiled to tree automata [67], or for more
complex languages to data tree
automata[4]. Another
reason of the importance of automata theory, and tree automata in
particular, comes from Courcelle's results [50]
that show that very expressive queries (from the language of monadic
second-order language) can be evaluated as tree automata over *tree
decompositions* of the original databases, yielding linear-time
algorithms (in data complexity) for a wide variety of applications.

#### Verification

Complex data management also has connections
to verification and static analysis. Besides query evaluation, a central
problem in data management is that of deciding whether two queries are
*equivalent* [37]. This is critical
for query optimization, in order to determine
if the rewriting of a query, maybe cheaper to evaluate, will return
the same result as the original query. Equivalence can easily be seen to
be an instance of the problem of (non-)satisfiability: $q\equiv {q}^{\text{'}}$ if
and only if $(q\wedge \neg {q}^{\text{'}})\vee (\neg q\wedge {q}^{\text{'}})$ is not satisfiable.
In other words, some aspects of query optimization are static analysis
issues.
Verification is also a critical part of any database application where it is
important to ensure that some property will never (or always) arise
[48].

#### Workflows

The orchestration of distributed activities (under the responsibility of a conductor) and their choreography (when they are fully autonomous) are complex issues that are essential for a wide range of data management applications including notably, e-commerce systems, business processes, health-care and scientific workflows. The difficulty is to guarantee consistency or more generally, quality of service, and to statically verify critical properties of the system. Different approaches to workflow specifications exist: automata-based, logic-based, or predicate-based control of function calls [34].

#### Probability & Provenance

To deal with the uncertainty attached to data, proper models need to
be used (such as attaching
*provenance* information to data items
and viewing the whole database as being
*probabilistic*) and
practical methods and systems need to be developed to both reliably
estimate the uncertainty in data items and properly manage provenance
and uncertainty information throughout a long, complex system.

The simplest model of data uncertainty is the NULLs of SQL databases,
also called Codd tables [37]. This
representation system is too basic for any complex task, and has the
major inconvenient of not being closed under even simple queries or
updates. A solution to this has been proposed in the form of
*conditional tables* [61] where every tuple is
annotated with a Boolean formula over independent Boolean random events.
This model has been recognized as foundational and extended in two
different directions: to more expressive models of *provenance* than
what Boolean functions capture, through a semiring
formalism [57], and to a
probabilistic formalism by assigning independent probabilities to the
Boolean events [58]. These two extensions form the basis of
modern provenance and probability management, subsuming in a large way
previous works [49], [43]. Research in the past
ten years has focused on a better understanding of the tractability of
query answering with provenance and probabilistic annotations, in a
variety of specializations of this
framework [72]
[62], [40].

#### Machine Learning

Statistical machine learning, and its applications to data mining and data analytics, is a major foundation of data management research. A large variety of research areas in complex data management, such as wrapper induction [68], crowdsourcing [41], focused crawling [56], or automatic database tuning [44] critically rely on machine learning techniques, such as classification [60], probabilistic models [55], or reinforcement learning [73].

Machine learning is also a rich source of complex data management problems: thus, the probabilities produced by a conditional random field [65] system result in probabilistic annotations that need to be properly modeled, stored, and queried.

Finally, complex data management also brings new twists to some classical
machine learning problems. Consider for instance the area of *active
learning* [70], a subfield of machine
learning concerned with how to optimally use a (costly) oracle, in an
interactive manner, to label training data that will be used to build a
learning model, e.g., a classifier. In most of the active learning
literature, the cost model is very basic (uniform or fixed-value costs),
though some works [69] consider
more realistic costs. Also, oracles are usually assumed to be perfect
with only a few exceptions [53]. These
assumptions usually break when applied to complex data management
problems on real-world data, such as crowdsourcing.