Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Operating system and runtime for clusters and grids

Cluster operating systems

Participants : Matthieu Fertré, Jérôme Gallard, Adrien Lèbre, Christine Morin, Pierre Riteau.

Kerrighed is a Single System Image (SSI) OS providing the illusion that a distributed cluster is a virtual multiprocessor machine. In 2008, we continued to contribute to the design and implementation of Kerrighed in the framework of the XtreemOS European IP project. We contributed to several new releases (based on Linux 2.6.20) of Kerrighed (2.2.1 and 2.3.0) and of its customized version for the cluster flavour of XtreemOS (LinuxSSI 0.9 in June 2008 and 1.0 in November 2008). We contributed to the packaging of Kerrighed and LinuxSSI for Mandriva Linux distribution and XtreemOS LiveCD.

The implementation of IPC System V semaphores arrays and messages queues in Kerrighed has been finalized [73] . These mechanisms have been heavily tested and fixed to run on SMP cluster nodes. Patches have been submitted and accepted to the Linux Test Project (LTP - to fix concurrent running of IPC tests in an SMP context.

Kerrighed checkpoint/restart mechanisms have been significantly improved [73] . First, session identifiers and process group identifiers are restored correctly for all application processes. Secondly, an API has been developed to easily handle the checkpoint/restart of different objects that can be shared by processes of a same tree depending of the flags used for the clone system call: file pointers and the files_struct , fs_struct , mm_struct , sighand_struct , signal_struct , sysvsem descriptors. Before this work, when an object was shared by two processes at the time of the checkpoint, the object was dumped twice and when restarting the application, each process had its own copy of the object. Now, the object is checkpointed only once and the sharing is rebuilt. Thanks to this API, checkpointing/restarting multi-threaded applications has been implemented (the prototype has been validated by checkpointing/restarting a Java Virtual Machine).

We also worked on the design and implementation of kDFS (kernel/ Kerrighed Distributed File System), a distributed file system exploiting the disks attached to the computing nodes of a cluster [58] , [30] , [73] .

In the context of Pierre Riteau's Master internship [70] , we focused on reliable execution of applications that use file systems for data storage in a distributed environment. An efficient and portable file versioning framework was designed and implemented in the distributed file system kDFS. This framework can be used to snapshot file data when a process' volatile state is checkpointed and thereby make it possible to restart a process using files in a coherent way. Our experiments showed that the overhead caused by the file versioning framework was negligible. A replication model synchronized with the checkpoint mechanisms was also proposed. It provides stable storage in a distributed architecture. The synchronization allows to reduce network and disk I/O compared to a synchronous replication mechanism like RAID1  [60] .

In the framework of Marco Obrovac's internship [69] , we studied new scheduling strategies taking into consideration I/O usage. This work led to a theoretical proposal where the load values of the resources are prioritized, i.e. every resource enters in the load calculation with a specific weight. These weights are not fixed, so every cluster architect can adjust the scheduler policy to suite her needs.

We improved the existing prototype in terms of stability and performance. kDFS can now successfully execute Bonnie++ benchmark and the NTFS3G test suite ( [73] . Using the Kargo tool, we deployed kDFS upon 48 nodes in the Grid'5000 platform. This experiment led to a 8.1TB storage space. We built virtual images of Kerrighed including KDFS for QEMU and VmWare systems and made them available to the community for testing ( ). Finally, kDFS was ported on the kDDM standalone framework [30] . This demonstrates that kDFS only relies on the kDDM kernel level data sharing mechanisms for inter-node communications and that it can be used without Kerrighed .

Grid operating systems

Participants : Surbhi Chitre, Matthieu Fertré, Jérôme Gallard, Yvon Jégou, Sylvain Jeuland, Adrien Lèbre, Christine Morin, Pierre Riteau, Thomas Ropars, Oscar Sanchez.

Vigne , a system for large-scale, dynamic Grids

Participants : Christine Morin, Thomas Ropars.

Our research aims at easing the execution of distributed computing applications on computational grids composed of a large number of geographically-distributed computing resources and characterized by a high churn. To ease the use of such dynamic, distributed systems, we propose to build a distributed operating system which provides a Single System Image (SSI), which is self-healing, and which can be tailored to the needs of the users. Such an operating system is composed of a set of distributed services, each of them providing a Single System Image for a specific type of resource, in a fault-tolerant way [29] . We are implementing this system on a research prototype called Vigne .

In the framework of Rajib Kummar Nath's internship we worked on fault tolerance mechanisms to make critical Vigne services highly available (application execution manager, memory coherence manager)  [68] . We have demonstrated that combining active replication and peer to peer techniques is an attractive solution to provide transparent high availability mechanisms for grid services. We have specified how to implement an active replication system based on consensus on top of a structured peer to peer network. The implementation of the proposed mechanisms is on-going.

To provide fault tolerance for large scale message passing applications, we proposed two protocols, O2P  [42] and O2P-CF [38] . O2P, targeting clusters, is an optimistic message logging protocol that aims at reducing the amount of data piggybacked on the application messages for the need of the optimistic protocol [42] . Experiments conducted on the Grid'5000 platform have shown that the amount of data piggybacked on messages has a significant impact on application performance. O2P-CF combines O2P with a pessimistic message logging protocol. It targets applications executed on cluster federations. We are implementing these two protocols in the Open-MPI library and experimenting them in the context of Vigne system.

XtreemOS Grid operating system

Participants : Surbhi Chitre, Matthieu Fertré, Jérôme Gallard, Yvon Jégou, Sylvain Jeuland, Adrien Lèbre, Christine Morin, Pierre Riteau, Thomas Ropars, Oscar Sanchez.

The objective of XtreemOS project is to design, implement and promote a Linux-based Grid operating system providing a native virtual organization support [54] . The scientific coordination of the XtreemOS European project is done by Ch. Morin, assisted by O. Sanchez, Technical Manager and Release Manager, and S. L'Hermitte, Project Office Assistant [61] , [62] .

In 2008, the research activities of the Paris Project-Team were focused on the design and implementation of virtual organization and security services and of a Grid checkpointing service, on the study of virtualized environments and on the design and implementation of LinuxSSI , leveraging Kerrighed SSI operating system for the cluster flavour of XtreemOS system. Our work on LinuxSSI is described in Section 6.2.1 .

A key feature of XtreemOS is its support for Virtual Organizations (VO). We participated in the design of the XtreemOS approach for VO management in close collaboration with ICT, STFC and TID [15] , [41] . We contributed to the integration of the VO and security services with the other XtreemOS services: application execution management (AEM), XtreemFS Grid file system, overlays, LinuxSSI [67] , [66] . The current XtreemOS prototype does not properly address dynamic VO. VO are dynamic in a number of directions: addition and removal of users and resources, creation and deletion of subVOs, addition and removal of users and resources in subVOs, creation and deletion of attributes, addition and removal of user attributes, generation and invalidation of identity and attribute certificates, automatic VO generation when a new project is set up, VO federation. We have started to revisit XtreemOS VO and security management services to support dynamic VO.

We contributed in close collaboration with the University of Duesseldorf and BSC to the design and implementation of XtreemOS grid checkpointing service [32] , [33] , [59] . This service comprising of three layers is in charge of ensuring reliable application execution despite failures. It selects and applies the fault tolerance policy, manages data related to application life cycle, coordinates the fault tolerance actions for distributed applications spanning multiple Grid nodes and interacts with the AEM service which monitors the jobs and takes suspend, restart and migration decisions and with the XtreemFS Grid file system which is used to store checkpoints. We started to study the integration of the O2P-CF checkpointing protocol in the framework of XtreemOS Grid checkpointing service.

We have also investigated the use of virtualization technologies in the context of XtreemOS and studied scenarios of use of XtreemOS in the area of Cloud computing [74] . We have initialized a study aiming at optimizing the performance of the migration of virtual clusters. Continuing the study on virtual clusters initialized in 2007 [28] , we have proposed an extension of Goldberg model [55] . Goldberg classifies virtualization techniques in two models (Type-I and Type-II), which does not enable the classification of the latest virtualization technologies such as abstraction, emulation, partitioning and so on. Our extension formally defines these mechanisms by rigorously formalizing the following terms: virtualization, emulation, abstraction, partitioning, and identity. We also demonstrate that a single virtualization solution is generally composed of several layers of virtualization capabilities, depending on the granularity of the analysis.

In the framework of Oana Goga's internship [65] , we worked on designing and implementing a framework allowing to (i) to deploy virtual machines upon the Grid'5000 platform, and (ii) to deploy Kerrighed upon a set of resources of Grid'5000 (physical or virtual nodes). Two software tools have been implemented: (i) VMdeploy to deploy virtual machines on the top of Grid'5000 and (ii) Kargo to deploy Kerrighed on top of physical/virtual nodes [56] .

Oscar Sanchez, as release manager, coordinated the production and the testing of the first integrated version of XtreemOS Grid operating system, publicly released in November 2008 [71] , [75] . A permanent geographically distributed testbed made up of several computers provided by different XtreemOS partners have been set up and used for testing and demonstrating XtreemOS prototype.


Logo Inria