Team KerData

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: New Results


Participants : Bogdan Nicolae, Gabriel Antoniu, Luc Bougé.

Starting from a preliminary experimental implementation, we developed BlobSeer to a fully fledged data-storage service for large-scale distributed data-intensive applications that process unstructured data, which is stored as huge sequences of bytes: BLOBs . We focused on demonstrating the benefits of using versioning when manipulating such large sequences of bytes, as well as the benefits of using data and metadata decentralization to support heavy write access concurrency efficiently.

Efficient Versioning for Large Object Storage

We targeted large-scale data-intensive distributed applications built on top of paradigms that exploit data parallelism explicitly. In this context, applications need to acquire and maintain huge unstructured datasets, while performing computations in the background over these datasets.

We formalized a simple, yet versatile versioning-oriented access interface to optimize the data management. This interface enables creating a BLOB, reading/writing parts of the BLOB and appending new data to the BLOB. Data is never overwritten as each time a write/append occurs, a new snapshot of the BLOB is created. Read operations are forced to access a particular snapshot explicitly, thus enabling readers to be decoupled from writers and thus allowing data gathering and data processing to avoid the need of synchronizing between each other. Moreover, we guarantee linearizability for all operations, thus eliminating the need of explicit synchronization at operation level. Finally, we illustrated the benefits of using our interface in a real-life, data-intensive MapReduce scenario.

As a next step, we extended BlobSeer to provide efficient support for the interface we proposed. This involved implementing the append operation (missing from our previous implementation) and further develop our distributed metadata management scheme to accommodate this operation efficiently while maintaining the same level of performance for reads and writes.

We conducted preliminary large-scale experimentation on the Grid'5000 testbed evaluating append performance. Results suggest a good scalability with respect to the data size and to the number of concurrent accesses. There results have been published in [8] .

High Write Throughput in Desktop Grids

We evaluated BlobSeer in its role as a storage service for write-intensive applications running in Desktop Grids that have high output data requirements and where the access grain and the access pattern may be random.

In this context, the main challenge is to deal with heavy write concurrency in an efficient way. We addressed this challenge by combining data striping with our decentralized, versioning-oriented metadata structure built on top of distributed segment trees and spread over a Distributed Hash Table (DHT).

To prove the benefits of our decentralized approach to data and metadata management, we conducted extensive experimentation on the Grid'5000 testbed. We evaluated both the impact of data decentralization and metadata decentralization. In a final large-scale experiment, we demonstrated the importance of the latter on sustaining high write throughput under heavy write concurrency. The results suggest clear benefits of using a decentralized metadata approach. They have been published in [9] .


Logo Inria