Team KerData

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: Scientific Foundations

Managing massive unstructured data under heavy concurrency on large-scale distributed infrastructures

Massive unstructured data: BLOBs

Studies show more than 80%  [42] of data globally in circulation is unstructured. On the other hand, data sizes increase at a dramatic level: for example, medical experiments  [56] have an average requirement of 1 TB per week. Large repositories for data analysis programs, data streams generated and updated by continuously running applications, data archives are just a few examples of contexts where unstructured data that easily reaches the order of 1 TB. Such unstructured data are often stored as binary large objects (BLOBs) within databases or files. However, traditional databases or file systems can hardly cope with BLOBS which grow to huge sizes.

Scalable processing of massive data: heavy access concurrency

To address the scalability issue, specialized programming frameworks like Map-Reduce  [37] and Pig-Latin  [54] propose high-level data processing frameworks intended to hide the details of parallelization from the user. Such platforms are implemented on top of huge object storage and target high performance by optimizing the parallel execution of the computation. This leads to heavy access concurrency to the BLOBs, thus the need for the storage layer to offer specific support. Parallel and distributed file systems also consider using objects for low-level storage (see next subsection  [38] , [62] , [41] ). In other application areas, huge BLOBs need to be used concurrently in the highest layers of applications directly: high-energy physics, multimedia processing  [35] or astronomy.


When addressing the problem of storing and efficiently accessing very large unstructured data objects  [50] , [56] in a distributed environment, a challenging case is the one where data is mutable and potentially accessed by a very large number of concurrent, distributed processes. In this context, versioning is an important feature. Not only it allows to roll back data changes when desired, but it also enables cheap branching (possibly recursively): the same computation may proceed independently on different versions of the BLOB. Versioning should obviously not significantly impact access performance to the object, given that objects are under constant heavy access concurrency. On the other hand, versioning leads to increased storage space usage and becomes a major concern when the data size itself is huge. Versioning efficiency thus refers to both access performance under heavy load and reasonably acceptable overhead of storage space.


Logo Inria