Team KerData

Members
Overall Objectives
Scientific Foundations
Application Domains
Software
New Results
Other Grants and Activities
Dissemination
Bibliography

Section: Scientific Foundations

Emerging large-scale infrastructures for distributed applications

During the last few years, research and development in the area of large-scale distributed computing led to the clear emergence of several types of physical execution infrastructures for large-scale distributed applications.

Cloud computing infrastructures

The cloud computing model  [61] , [49] , [33] is gaining serious interest from both industry and academia in the area of large-scale distributed computing. It provides a new paradigm for managing computing resources: instead of buying and managing hardware, users rent virtual machines and storage space. Various cloud software stacks have been proposed by leading industry companies, like Google, Amazon or Yahoo!. They aim at providing fully configurable virtual machines or virtual storage (IaaS: Infrastructure-as-a-Service   [19] , [26] , [20] ), higher-level services including programming environments such as Map-Reduce  [37] (PaaS: Platform-as-a-Service   [21] , [24] ) or community-specific applications (SaaS: Software-as-a-Service   [22] , [25] ). On the academic side, one of the most visible projects in this area is Nimbus  [26] , [46] , from the Argonne National Lab (USA), which aims at providing a reference implementation for a IaaS. In parallel to these trends, other research efforts focused on the concept of grid operating system: a distributed operating system for large-scale wide-area dynamic infrastructure spanning multiple administrative domains. XtreemOS  [52] , [27] is such a grid operating system, which provides native support for virtual organizations. Since both the cloud approach and the grid operating system approach deal with resource management on large-scale distributed infrastructures, the relative positioning of these two approaches with respect to each other is currently subject to on-going investigation within the Paris /Myriads Project-Team (http://www.irisa.fr/paris/ ) at Inria Rennes – Bretagne Atlantique   [51] .

In the context of the emerging cloud infrastructures, some of the most critical open issues relate to data management. Providing the users with the possibility to store and process data on externalized, virtual resources from the cloud requires simultaneously investigating important aspects related to security, efficiency and quality of service. Exploring ways to address the main challenges raised by data storage and management on cloud infrastructures is the major factor that motivated the creation of the KerData Research Team (http://www.irisa.fr/kerdata/ ) of Inria Rennes – Bretagne Atlantique . To this purpose, it clearly becomes necessary to create mechanisms able to provide feedback about the state of the storage system along with the underlying physical infrastructure. The monitored information can be further fed back into the storage system and used by self-managing engines, in order to enable an autonomic behavior  [47] , [55] , [44] , possibly with several goals such as self-configuration, self-optimization, or self-healing.

Petascale infrastructures

In 2011, a new NSF-funded petascale computing system, Blue Waters, will go online at the University of Illinois. Blue Waters is expected to be the most powerful supercomputer in the world for open scientific research when it comes online. It will be the first system of its kind to sustain one petaflop performance on a range of science and engineering applications. The goal of this facility is to open up new possibilities in science and engineering by providing computational capability that makes it possible for investigators to tackle much larger and more complex research challenges across a wide spectrum of domains: predict the behavior of complex biological systems, understand how the cosmos evolved after the Big Bang, design new materials at the atomic level, predict the behavior of hurricanes and tornadoes, and simulate complex engineered systems like the power distribution system and airplanes and automobiles.

To reach sustained-petascale performance, Blue Waters relies on advanced, dedicated technologies under development at IBM at several levels: processor, memory subsystem, interconnect, operating system, programming environment, system administration tools. A similar effort was initiated by RIKEN (Japan), who aimed to build a next-generation supercomputer targeting 10 Petaflops performance in its research center in Kobe. (This program was stopped by the Japanese government, however this decision may be reconsidered.)

In the context of such efforts whose goal is to provide sustained Petascale (and beyond Petascale) performance, data management is again a critical issue that highly impacts the application behavior. Petascale supercomputers exhibit specific architectural features (e.g., a multi-level memory hierarchy scalable to tens to hundreds of thousands of nodes). It needs to be specifically taken into account in order to enable a parallel file system to fully benefit from the capabilities of the machine. Providing scalable data throughput on such unprecedented scales is clearly an open challenge today.

Desktop grids

During the recent years, Desktop grids have been extensively investigated as an efficient way to build cheap, large-scale virtual supercomputers by gathering idle resources from a very large number of users. Physical infrastructures for Grid Computing typically rely on clusters of workstations belonging to institutions, and interconnected through dedicated, high-throughput wide-area networks. In contrast, desktop grids rely on individual desktop computers, interconnected through Internet, provided by volunteer users. The initial, widely-spread usage of Desktop grids for parallel applications consisting in non-communicating tasks with small input/output parameters is a direct consequence of the physical infrastructure (volatile nodes, low bandwidth), unsuitable for communication-intensive parallel applications with high input or output requirements. However, the increasing popularity of volunteer computing projects has progressively lead to attempts to enlarge the set of application classes that might benefit of Desktop Grid infrastructures. If we consider distributed applications where tasks need very large input data, it is no longer feasible to rely on classical centralized server-based Desktop Grid architectures, where the input data was typically embedded in the job description and sent to workers: such a strategy could lead to significant bottlenecks as the central server gets overwhelmed by download requests. To cope with such data-intensive applications, alternative approaches have been proposed, with the goal of offloading the transfer of the input data from the central servers to the other nodes participating to the system, with potentially under-used bandwidth.

Two approaches follow this idea. One of them adopts a P2P strategy, where the input data gets spread across the distributed Desktop Grid (on the same physical resources that serve as workers)  [36] . A central data server is used as an initial data source, from which data is first distributed at a large scale. The workers can then download their input data from each other when needed, using for instance a BitTorrent-like mechanism. An alternative approach  [36] proposes to use Content Distribution Networks (CDN) to improve the available download bandwidth by redirecting the requests for input data from the central data server to some appropriate surrogate data server, based on a global scheduling strategy able to take into account criteria such as locality or load balancing. The CDN approach is more costly then the P2P approach (as it relies on a set of data servers), however it is potentially more reliable (as the surrogate data servers are supposed to be stable enough).

More recent research makes a step further and considers using Desktop grids for distributed applications with high output data requirements. Each such application consists of a set of distributed tasks that produce and potentially modify large amounts of data in parallel, under heavy concurrency conditions. Such characteristics are featured by 3D rendering applications, or massive data processing applications that produce data transformations. Such a context requires new approaches to data management, in order to cope with both input and output data in a scalable way.


previous
next

Logo Inria