Team ATLAS

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Contracts and Grants with Industry
Other Grants and Activities
Dissemination
Bibliography

Section: New Results

P2P Query Support

We addressed three aspect related to efficient query support in P2P networks. by exploiting, in particular, DHTs and gossiping. First, we exploit DHTs and gossiping for improving the performance of join queries over data streams. Second, we exploit DHTs and gossiping for improving content distribution. Third, we started a new research direction which considers uncertain data.

Join Queries over Data Streams

Participants : Reza Akbarinia, Esther Pacitti, Wenceslao Palma, Patrick Valduriez.

Recent years have witnessed the growth of a new class of data-intensive applications that do not fit the DBMS data model and querying paradigm. Instead, the data arrive at high speeds taking the form of an unbounded sequence of values (data streams) and queries run continuously returning new results as new data arrive. The unbounded nature of data streams makes it impossible to store the data entirely in bounded memory. However, approximate answers are often sufficient when the goal of a query is to understand trends and making decisions about measurements or utilizations patterns. One technique for producing an approximate answer to a continuous query is to execute the query over a window that maintains a restricted number of recent data items. In continuous query processing the join operator is one of the most important operators, which can be used to detect trends between different data streams. To emphasize access to recent data, the window conceptually slides over the input streams thereby giving rise to a type of join called sliding window join.

In [23] , we addressed the problem of computing approximate answers to windowed stream joins over data streams. We propose a method, called DHTJoin, which combines hash-based placement of tuples in a Distributed Hash Table (DHT) and dissemination of queries by exploiting the embedded trees in the underlying DHT, thereby incuring little overhead. DHTJoin identifies, using query predicates, a subset of tuples in order to index the data required by the user's queries, thus reducing network traffic [37] . This is more efficient than the approaches based on structured P2P overlays, e.g. PIER and RJoin, which typically index all tuples in the network. We provided an analytical evaluation in [37] of the best number of nodes to obtain a certain degree of completeness given a continuous join query. DHTJoin tackles the dynamic behavior of DHT networks during query execution and dissemination of queries [23] [38] . When nodes fail during query dissemination, DHTJoin uses a gossip-based protocol that assures 100% of network coverage. When nodes fail during query execution, DHTJoin propagates messages to prevent nodes of sending intermediate results that do not contribute to join results, thereby reducing network traffic. Finally, DHTJoin provides an efficient solution to deal with overloaded nodes as a result of data skew [23] [38] . The key idea is to distribute the tuples of an overloaded node to some underloaded nodes, called partners. When a node gets overloaded, DHTJoin discovers partners using information in the routing table and determines what tuples to send them using the concept of domain partitioning. We show that, in this case, DHTJoin incurs only one additional message per joined tuple produced, thus keeping response time low.

We evaluated the performance of DHTJoin through simulation. The results show the effectiveness of our solution compared with previous work.

P2P Content Distribution Network

Participants : Manal El Dick, Esther Pacitti.

P2P networks provide a very cost-effective alternative to build highly scalable infrastructures for content distribution. This is particularly useful for non-profit websites with a large user base (e.g., non-profit organizations) that cannot afford to distribute their popular content at large scales via commercial content distribution networks (CDN) like Akamai.

Our first contribution [29] , [30] consists in building a P2P CDN, Flower-CDN , that enables any under-provisioned website to efficiently distribute its content, with the help of the non-profit community interested in its content. Our solution exhibits several unique characteristics that enable us to overcome all of the above mentioned challenges. It combines the strengths of both structured and unstructured P2P networks, exploiting DHT efficiency and gossip robustness. Flower-CDN introduces a novel DHT usage and management, called D-ring, that relies on a new locality- and interest-aware key service. It helps new peers to quickly find peers in the same network locality that are interested in the same website. We organize peers that share the same locality and are interested in the same website into unstructured overlay clusters (called petals ). Within a petal, peers use gossip protocols to exchange information about their content and contacts, allowing Flower-CDN to maintain accurate information despite dynamic changes in order to support eventual queries. We use this novel two-layered architecture consisting of a D-ring and petals to provide hybrid locality-aware query routing. The D-ring ensures a reliable access for new clients, while subsequent searches are performed within the petals. Thus, most of the query routing takes place within a local cluster leading to short query search and local data transfer. Our empirical analysis show that Flower-CDN can reduce lookup latency by a factor of 9 and the transfer distance by a factor of 2, compared to an existing P2P CDN (i.e. Squirrel). Moreover, Flower-CDN incurs very acceptable overhead in terms of gossip bandwidth, which can also be tuned according to hit ratio requirements and bandwidth availability.

Our second contribution [28] aims at providing our P2P CDN with high scalability and robustness under large scale and dynamic participation of peers. Thus, we propose PetalUp-CDN , which dynamically adapts Flower-CDN to increasing numbers of participants in order to avoid overload situations. In short, PetalUp-CDN enables D-ring to progressively expand to manage larger petals so that all the participants share the workload rather evenly. In addition, we maintain our P2P CDN in face of high churn and failures, by relying on low-cost gossip protocols. Our maintenance protocols preserve the locality and interest aware features of our achitecture and enables fast and efficient recovery. Based on extensive simulations, we show that our approach leverages larger scales to achieve higher improvements. Furthermore Flower-CDN can maintain an excellent performance under a highly dynamic participation of peers. This work was done in cooperation with Bettina Kemme (Mc Gill Univ.).

Uncertain Data Management

Participants : Reza Akbarinia, Esther Pacitti, Patrick Valduriez.

We are witnessing a rapid and important increase in the interest for uncertain data management. One of the main reasons is the emergence of many applications in which data uncertainty is unavoidable; e.g. data cleaning, sensor networks, information extraction, etc. In a recent work [36] , we investigated the challenges of uncertain query processing in P2P online communities. In these environments the data are not 100 % certain, precise and correct, particularly when coming from peers with different levels of confidence. Query processing techniques designed for P2P systems should be revisited to deal with data uncertainty at all levels. Similarly, the recent extensions of DBMS that support data uncertainty should be revisited for P2P networks. We also addressed the problem of estimating the data confidence in P2P community information management systems. Since the data are not certain, we need to estimate the certainty degree (i.e. confidence) of the data. For this, we rely on the knowledge of all users of the systems, and use their feed-back to estimate the data confidence. We proposed a new data model, called feedback graph that models the relation between the users, their data and feedbacks. Based on this model, we developed a distributed approach for managing the feedback graph, and computing the data confidence based on a recursive formula.

In another work, we have started to study uncertain aggregate (aggr) queries which have been proven to be very useful for many uncertain data management applications. Examples of these applications are day-ahead energy market estimation, moving objects surveillance, mortgage default prediction, stock market prediction, etc. To evaluate aggr queries over uncertain data, we must firt provide a definition (semantics) of these queries in uncertain databases. In initial works, the aggr queries were defined based on the expected value semantics, i.e. expected value of aggregate attributes in uncertain tuples. However, recent works have shown that this semantics is not sufficient for many applications, and other semantics are needed. In our work, in addition to taking into account the previously proposed semantics we proposed new semantics which are very useful for uncertain applications. The evaluation of aggr queries in both new and previously proposed semantics is quite challenging, particularly for SUM and AVG queries. Naïve algorithms, which are based on enumerating possible worlds, evaluate the aggr queries in exponential time. We developed new algorithms that in most cases execute aggr queries in polynomial time. We plan to extend our algorithms to distributed systems, in particular, P2P systems. We should take into account the data distribution which makes the problem of uncertain aggr query processing much more complicated than that in centralized systems. Furthermore, we should deal with the dynamic behavior of peers that may leave the system or fail during query processing.


previous
next

Logo Inria