Section: New Results
Operating system and runtime for clusters and grids
Cluster operating systems
Participants : Marko Obrovac, Christine Morin, Eugen Feller.
Evaluation of LinuxSSI single system image operating system for clusters
In 2009, we have carried out an extensive performance evaluation of LinuxSSI (XtreemOS cluster flavour foundation layer based on Kerrighed Linux based single system image operating system). In particular, we have evaluated the kDFS distributed parallel file system, the global scheduling policy and the checkpointing mechanisms  .
Energy management in clusters
In 2009, we have initiated a study on energy consumption management in clusters. This work is carried out in the framework of the Eco-grappe ANR project (PhD thesis of Eugen Feller). The objective of this work is to be able to adapt the cluster configuration (hardware parameters, number of nodes) to the actual workload in order to save energy. Experimentations will be carried out with the Kerrighed open source cluster operating system.
Grid operating systems
Participants : Surbhi Chitre, Marko Obrovac, Jérôme Gallard, Yvon Jégou, Sylvain Jeuland, Peter Linnell, Christine Morin, Pierre Riteau, Thomas Ropars.
Access Control and Interactive Jobs
XtreemOS aims to provide to grid users an interface similar to their usual Linux desktop interface: application are run on the grid as if they were executed on the local desktop. In order to provide this interface, XtreemOS must provide means to support single-sign-on (no need to authenticate each time an application is run), delegation (applications running on the grid have the same capabilities as applications running on the desktop) and interactive applications (the user can interact with his application through ttys and his desktop display, grid applications can participate to pipes, ...). The single-sign-on and delegation system proposed by Yvon Jégou for XtreemOS  has been selected by the security work package members and is currently being implemented. Yvon Jégou is also in charge of providing support for interactive applications. The proposed mechanism  has been implemented and is available since the 2.0 release of XtreemOS .
Dynamic Virtual Organizations and Execution of Coordinated Services
A key feature of XtreemOS is its support for Virtual Organizations (VOs). The current XtreemOS prototype does not properly address dynamic VOs. VOs are dynamic in a number of directions: addition and removal of users and resources, creation and deletion of attributes, addition and removal of user attributes, generation and invalidation of identity and attribute certificates, automatic VO generation when a new project is set up, and VO federation. Moreover, the execution of service-based applications in the XtreemOS framework has not been addressed when designing the first version of the system.
In 2009, we have designed and implemented the XChor functionality which permits users to communicate, exchange data and run jobs in a collaborative workflow. The multi-party aspect of choreography (in opposition to orchestration) is privileged because no centralized leader controls the execution. In our case a choreography becomes a peer-to-peer collaboration of users and accomplishes a common goal where the user behaviors are defined by job dependencies and ordered message exchanges. An XtreemOS choreography -XChor- is executed in a Dynamic Virtual Organization  .
Since the long-lived VOs do not permit the building of XChors, we have introduced the Dynamic Virtual Organization (DVO) concept to support short-lived multi-user collaborations. A Dynamic Virtual Organization is generated on-the-fly when partners from different administration domains decide to run a choreography. Once all users have been registered and have retrieved their DVO credentials, the XtreemOS choreography is triggered and the collaborative jobs are submitted to XtreemOS Grid system resources according to the defined workflow.
A Checkpointing Service for the Grid
XtreemGCP  is a service of the XtreemOS grid system that provides grid applications with fault tolerance. It is able to apply different fault tolerance strategies and to make use of the various kernel checkpointers available on the grid nodes. In 2009, we have implemented a new kernel checkpointer exploiting OpenVZ features to suspend, checkpoint and restart jobs executed on individual PC nodes and integrated it in the XtreemGCP service. We have also improved the LinuxSSI checkpointer, providing call backs, in order to integrate it in the XtreemGCP service. All these new funtionalities have been integrated in the second XtreemOS release. Eugen Feller in the context of his Master internship has also implemented and integrated an independent checkpointing protocol in the XtreemGCP service allowing to checkpoint/restart distributed applications executed on heterogeneous Grid nodes (PC, clusters). This work demonstrates the genericity of the XtreemGCP service that is able to exploit different kernel checkpointers (BLCR, LinuxSSI checkpointer, OpenVZ based checkpointer) and to drive different checkpointing protocols for distributed applications (coordinated checkpointing, independent checkpointing). This work has been carried out in close collaboration with the Heinrich-Heine Universitaet Duesseldorf, Germany (Michael Schoettner's group). Future work includes the integration of a Grid application checkpointing protocol based on the O2P protocol described below and to exploit checkpoint/restart functionalities to migrate jobs in a Grid.
Fault Tolerance for Message Passing based Applications
O2P is an optimistic message logging protocol that targets large scale message passing applications and that has been implemented in the Open MPI Library. O2P is based on active optimistic message logging, a new message logging strategy that makes it more scalable than existing optimistic message logging protocols  . The scalability limit of O2P is the event logger, i.e. the centralized process used to manage logging on stable storage. This centralized event has been proposed in previous works on message logging protocols. In 2009, we have proposed a new event logger which is completely distributed to be scalable. This event logger makes use of the memory of the computation nodes to implement stable storage. Experiments show that this new event logger makes O2P more scalable than a centralized event logger  .
A new message logging model has been defined in the context of the MPI standard, that reduces the number of events that have to be logged on stable storage by a message logging protocol. In that context, we have shown in collaboration with Aurélien Bouteiller from the University of Tennessee, that the main cost of message logging protocols is not due to event logging on stable storage anymore, but due to sender-based message logging  .
High Availability for Grid Services
To provide high availability and self-healing for stateful services, we have proposed a new framework called Semias, which is based on the combination of peer-to-peer techniques with active replication. Active replication makes services highly available  . We have designed and implemented a software stack comprising of a failure detector, a consensus, an atomic broadcast, a group membership protocols and a reconfiguration service. In the framework of Sébastien Gillot's internship, we have validated the implementation of the consensus protocol (Paxos) and of the atomic broadcast protocol, using the Splay simulator. A tool to automatically deploy the Spray framework on Grid'5000 has been developed. One of the main challenge in service replication is to handle reconfigurations. In the context of Stefania Costache's Master internship, we have proposed a group monitoring layer which is in charge of gathering monitoring information about grid nodes to take appropriate reconfiguration decisions based on some safety conditions. To our knowledge, Semias is the first implementation of atomic broadcast on top of a structured peer-to-peer overlay. The peer-to-peer overlay provides fault tolerant routing mechanisms and makes replication completely transparent for the clients. Thus existing services can be replicated using Semias with very little modifications limited to the service state transfer. To validate our approach, we have used Semias to make services of the Vigne grid system highly available and self-healing. Experiments run on the Grid'5000 testbed show that Semias can replicate services with a very small overhead and can efficiently and automatically handle failures  .
Scalable failure information base
A very important part in ensuring reliability in distributed systems is failure detection, being able to accurately determine when a node has failed and to select the best nodes for a job. In the framework of Catalin Leonardu's internship we have proposed a storage system for nodes' failure history information. This solution aims at increasing the reliability of a distributed system by providing failure information which is as accurate as possible, even in a very dynamic system. Evaluations made through simulation with Splay show the scalability of the system.
Federated Virtualized Infrastructures
Participants : Jérôme Gallard, Yvon Jégou, Christine Morin, Thierry Priol, Pierre Riteau.
Virtual infrastructure management
In 2009, we have further investigated the design of systems allowing to share resources in a peer-to-peer manner  ,  . System virtualization is an enabling technology as it allows to decouple the software from the hardware and avoids the deployment of application specific systems and libraries on the infrastructure nodes.
Based on our preliminary work done on the VMdeploy environment, we have built the Saline system that allows the management of applications run in virtual machines Grid-wide. Saline is now able to manage both regular and best efforts jobs   . It has been interfaced with the OAR resource reservation system and experimented on Grid'5000 platform. In particular, we have designed and implemented a service managing virtual machines IP addresses in a Grid.
We have also studied in 2009 the dynamic extension of the XtreemOS operating system in order to take advantage of cloud computing infrastructures. In the framework of Eliana-Dina Tirsa's internship we have designed and implemented a service enabling the automatic deployment of XtreemOS system in a set of virtual machines provisioned from a Nimbus Cloud. Thus, we are now able to extend a Grid infrastructure with resources from an infrastructure as a service cloud. In future work, we will make the proposed service more generic and interface it with various kinds of IaaS clouds. We also plan to study resource management policies in the context of federated virtualized infrastructures.
In the framework of the SER-OS associated team, we have worked on the design and the implementation of a novel management tool for managing virtualized infrastructures comprising of high performance computing clusters, massively parallel processing (MPP) systems, and grids  . This novel management tool integrates three main concepts: (i) Virtual System Environments (VSEs) describing the application requirements in terms of software configuration (ii) Virtual Organizations (VOs) defining sets of resources shared among users communities, and (iii) Virtual Platforms (VPs) describing the application requirements in terms of hardware platform.
Virtual machine migration
In relation with the previously described research activities, another research direction focuses on the management of sets of virtual machines aiming at improving migration and storage of virtual machines in grid/cloud computing environments.
Best efforts jobs running in a set of virtual machines need to be suspended and stored to permanent storage to free a set of nodes in order to execute higher priority jobs. We have developed the Kget+ tool enabling fast removal and storage of a set of virtual machines from a set of nodes without contention on the target storage system  . This tool is exploited in the Saline system.
We have investigated the use of distributed content addressing to enable efficient live migration of virtual machines over wide-area networks. We designed a customized live migration protocol that only sends cryptographic hashes of the virtual machine to the destination node, which results in much lesser network traffic. The target node then leverages a distributed hash table on the remote site to find VM pages that are already present in the local network. VM pages that cannot be found are requested from the original node.
A prototype has been implemented as a modification of the QEMU/KVM hypervisor. This prototype is currently being evaluated on the Grid'5000 platform.
In the context of Djawida Dib's Master internship, we have initialized work on network transparent live migration, allowing virtual machines to migrate without any impact on their network connections. This will enable parallel jobs relying on network communications (e.g. MPI programs) to live migrate to a remote site.
XtreemOS Release and Deployment Tools
Yvon Jégou and Peter Linnell, as release manager, coordinated the production and the testing of the second integrated version of XtreemOS Grid operating system (XtreemOS V2.0 version), publicly released in November 2009 (http://www.xtreemos.eu/software ). We have contributed to the XtreemOS admin and user guides  ,  . The permanent geographically distributed testbed made up of several computers provided by different XtreemOS partners has been updated with the new XtreemOS release and used for testing and demonstrating the XtreemOS prototype.
In 2009, we have developed a set of tools and environments to facilitate the deployment of the XtreemOS system on various infrastructures. We have designed and implemented a configuration tool to facilitate the deployment of XtreemOS system on a Grid made up of physical or virtual machines. Moreover, pre-configured sets of virtual machines for KVM and VirtualBox hypervisors have been produced, documented and made available for the XtreemOS consortium and the open source community (http://www.xtreemos.eu/software ). Tools to automatically deploy XtreemOS PC and cluster flavours on Grid'5000 platform have been developed and made available to XtreemOS consortium.