Section: Scientific Foundations
The generation of massive amounts of data with various levels of control and quality makes data uncertainty ubiquitous in many applications. Examples include web data cleaning, sensor networks, information extraction, data integration, RFID stream analysis, etc. Data uncertainty can be well captured by associating probabilites of data which is the basis for probabilistic databases. Thus, a probabilistic database management system (PDBMS) is a system that deals with storing and retrieving probabilistic data as well as supporting complex queries over the data. There are two important issues which any PDBMS should address: 1) how to represent a probabilistic database, i.e. data model; 2) how to answer queries using the chosen representation, i.e. query evaluation.
There are two main probabilistic data models which are the tuple level and attribute level models. With the tuple level model, each tuple t has an attribute that indicates the membership probability (also called existence probability) of t , i.e. the probability that the tuple appears in a random instance of the database. In the attribute level model, each tuple t has at least one uncertain attribute, e.g. a . The value of a in t is determined by a random variable whose probability density function (pdf) may be form a discrete or continuous domain. In both models, the tuples of the probabilistic database may be independent or correlated. Although the models that support correlation are more powerful than the others; they usually require exponential processing complexity.
Query evaluation is the hardest technical challenge in a PDBMS. A naïve solution for evaluating probabilistic queries is to enumerate all possible worlds, i.e. all possible instances of the database, execute the query in each world, and return the possible answers together with their cumulative probabilities. However, this solution is not efficient due to the exponential number of possible worlds which a probabilistic database may have. Some queries can be evaluated on a probabilistic database by pushing the probabilistic computation inside the query plan. Thus, for these queries the output probabilities are computed inside the database engine, using the normal query processing. Queries for which this computation is possible are called safe queries, and the execution plan that computes the output probabilities is called a safe plan. However, there are many queries for which there is no safe plan, e.g. those containing self joins. For some complex queries, e.g. top-k and aggregate queries, we need to redefine the semantics of the query. For example, for top-k queries we should decide on how to take into account both tuple probabilities and scores in ranking the tuples. Although much research has been done in few last years on complex query evaluation in probabilistic databases, there remain many open problems in this domain.
Though difficult in centralized systems, the problem of query evaluation is more complicated in distributed systems, particularly because of new challenges in schema mapping and query routing. There may be some type of uncertainty in the defined schema mappings which should be considered in query reformulation, and in execution plans. Furthermore, the query must be routed to the nodes that involve relevant data with high probabilities.