Section: New Results
Keywords : Cluster, cluster federation, grid, operating system, distributed system, single system image, global scheduling, process migration, high performance communication, data stream migration, distributed shared memory, distributed file system, resource management, checkpointing, high availability, MPI, peer-to-peer (peer to peer, p2p), self-organizing system, self-healing system, fault tolerance.
Operating system and runtime for clusters and grids
The Paris Project-Team is engaged in the design and development of Kerrighed , a genuine Single System Image (SSI) cluster operating system for general-purpose, high-performance computing  . A genuine SSI offers users and programmers the illusion that a cluster is a single high-performance and highly-available computer, instead of a set of independent machines interconnected by a network. A SSI should offer four properties:
- Resource distribution transparency:
Offering processes transparent access to all resources, and resource sharing between processes whatever the resource and process location.
- High performance.
- High availability:
Tolerating node failures and allowing application checkpoint and restart.
Dynamic system reconfiguration, node addition and eviction, transparently to applications.
In 2006, a major refactoring of Kerrighed has been carried out. It consists in porting the previous stable version of the system based on Linux 2.4 kernel, to Linux 2.6.11 kernel.
A port of Kerrighed on User Mode Linux (UML) architecture of virtual machine has also been done in 2006. The UML version of Kerrighed is useful to facilitate the debugging of the system, and for demonstration purposes.
The robustness of Kerrighed has been significantly enhanced, and several new functionalities have been implemented such as high-availability mechanisms to automatically reconfigure Kerrighed services in the event of the addition or eviction of a hot node  . Kerrighed V2.0 version has been released at the end of the COCA contract in March 2006  ,  ,  .
In 2006, we have evaluated the potential of Kerrighed for various application domains, different from the scientific applications: bio-informatics (internship of Jérôme Gallard  ) and Web services (internship of Robert Guziolowski  ).
A start-up, the Kerlabs SARL Company (http://www.kerlabs.com/ ), has been created in October 2006 by Pascal Gallard, Renaud Lottiaux and Louis Rilling in order to transfer the Kerrighed technology. Kerlabs has been hosted by the Inria Emergys incubator since February 2006. Kerlabs will continue the development and the industrialization of the Kerrighed technology, to deliver systems specifically suited to the management of clusters. Kerlabs intends to promote and develop a community of users and developers around the original Kerrighed free software.
OSCAR (http://oscar.openclustergroup.org/ ) is a distribution for Linux clusters which provides a snapshot of the best known methods for building, programming and using clusters. We have worked with the OSCAR Team at the Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA, in order to maintain the SSI-OSCAR package which integrates Kerrighed in the OSCAR software suite. By combining OSCAR with the Kerrighed Single System Image operating system, a cluster becomes easy to install, administrate, use and program.
Since 2005, the SSI-OSCAR package has become a standard third-party OSCAR package. As such, SSI-OSCAR is automatically proposed as all other official OSCAR packages (only core packages are directly included in the OSCAR suite, third-party packages being available through on-line repositories). In 2006, we have implemented a tool to automatically generate RPM and Debian packages as well as the corresponding SSI-OSCAR packages from Kerrighed source code. We have revisited the building process of Kerrighed to conform to the de facto building standards used in the Linux community. We have also participated in the design of OSCAR 5.0, which implements a more modular architecture than the previous OSCAR versions. SSI-OSCAR packages have been implemented to conform to this new OSCAR version.
Tool for testing Kerrighed .
In the framework of the ODL contract of Jean Parpaillon (Opération de développement logiciel , Inria short-term software development initiative), we have implemented a software testing tool. The goal of this tool is to automatically test Kerrighed software and the SSI-OSCAR packages. The tool automatically builds a Kerrighed binary image from the source code of the development version (available at the Inria Gforge repository). Test results are made available to developers through a DART server. Thanks to this tool, the compilation process of Kerrighed can now be automatically tested on each night. This tool allows detecting the regressions introduced during the development of Kerrighed . We plan to extend this tool to perform execution tests as well, for instance to validate the conformance of the Kerrighed system to the POSIX standard, using traditional test suites provided by the Linux community. The main issue is to deal with failing tests, leading to a crash of Kerrighed system.
Distributed file system.
KerFS , a Kerrighed distributed file system, has been designed and implemented to exploit the disks attached to cluster nodes. KerFS provides a unique, cluster-wide naming space, and enables to store the files of a directory in multiple disks throughout the cluster. It has been implemented based on the container concept, originally proposed for global memory management. Containers are used to manage not only memory pages, but also generic objects. The meta-data structures of the file system are kept consistent cluster-wide using object containers.
In 2006, in the framework of the XtreemOS project, we have started to revisit the design of KerFS to improve its performance. We have worked on providing parallel accesses to files split and/or replicated on the cluster disks. Also, we have designed an I/O scheduler to improve the behavior of the disk I/O system when the cluster nodes are used for executing multiple concurrent applications. We are now targeting the implementation of the proposed mechanisms. We will also study how fault-tolerance mechanisms can be integrated in KerFS .
One of the advantages of SSI operating system for clusters is that users can launch applications interactively, in the same way as they do when using a single PC running Linux. However, when multiple users launch applications on the same cluster (typically when a cluster is used as a departmental server), it may happen that the workload exceeds the cluster capacity. To avoid this situation, a solution is to execute a batch system on top of the SSI operating system. However, this makes the submission of applications more difficult for users who need to provide a description of their applications.
We have investigated a different approach, which consists in integrating a fork-delay mechanism in a SSI cluster operating system to delay the execution of processes when the cluster is overloaded (Jérôme Gallard internship  ). When an application is launched with the fork-delay capability enabled, its processes are queued if the cluster is overloaded. When a process terminates its execution, the global scheduler resumes the execution of the delayed processes, if any. At any time, if the cluster load is too high, the global scheduler may decide to suspend the execution of some processes. We have validated this approach with the implementation of a first prototype in Kerrighed . We plan to refine the distributed management of the delayed process queue, to take into account dependencies between processes belonging to the same application (gang scheduling ).
Kerrighed is a distributed system made up of co-operating kernels executing on the cluster nodes.Therefore, a node failure has a significant impact on the operating system itself, not only on the applications being executed on top of the system. We have implemented a generic service to be used by the various services composing the Kerrighed operating system to enable their automatic reconfiguration when a node is added or removed in the cluster  .
Monitoring is of uttermost importance to achieve robust computing. Monitoring is needed for failure and attack detection. It can also be used for system management and load balancing. A monitoring system must enjoy several properties. It should be non-intrusive (no need to modify the target OS), tamper-proof (no possible intrusion), and autonomous (no involvement of the target OS). It should provide a consistent view of the distributed OS state, and be customizable for flexibility. Moreover, it should be based on fail-safe communications.
In the context of the Phenix Associated Team, we have investigated the design of a monitoring system for Kerrighed based on the backdoor architecture developed in the DiscoLab laboratory of Rutgers University (Benoît Boissinot internship  ). Autonomy is achieved thanks to a monitoring system based on a virtualization technology.
Our proposition is a distributed virtual backdoor architecture . The idea to monitor the operating system running on a PC is to execute the backdoor and the monitored OS in different virtual machines on top of a virtual machine monitor. The main issue to be tackled in the implementation is the extraction of OS state from the memory. As Kerrighed is a distributed system running on multiple machines, a co-operation protocol has been designed to compute a consistent global state from the partial information gathered by each virtual backdoor. We have implemented the proposed architecture on top of the Xen virtual machine monitor to demonstrate how distributed virtual backdoors can co-operate to monitor a distributed state.
Grid-aware Operating System
The Vigne System.
Our research aims at easing the execution of distributed computing applications on computational grids. These grids are composed of a large number of geographically-distributed computing resources. This large-scale distribution makes the system dynamic: failures of single resources are frequent (interconnecting network failures, and machine failures), and any participating entity may decide at any time to add or remove nodes from the grid.
To ease the use of such dynamic, distributed systems, we propose to build a distributed operating system which provides a Single System Image, which is self-healing, and which can be tailored to the needs of the users  . Such an operating system is composed of a set of distributed services, each of them providing a Single System Image for a specific type of resource, in a fault-tolerant way. We are implementing this system on a research prototype called Vigne  . Experimental evaluations are made on the Grid 5000 research grid. The work of Year 2006 is twofold.
First, we have mainly worked on four services of the Vigne system that are: resource discovery, resource allocation, monitoring and system interface  ,  ,  . We have extended the resource discovery service by designing and implementing a new resource discovery protocol called Random Walk Optimized for Grid Scheduling (RW-OGS). RW-OGS uses learning and broadcasting strategies to improve the quality of the results obtained after a resource discovery. Thus, the resource discovery service provides less loaded resources to the resource allocation service  , and the efficiency of the global resource allocation is increased.
The resource allocation service has been extended to handle co-scheduled tasks. Vigne provides system features for coordinating tasks before execution, and for enriching the environment of each task with additional information useful for co-scheduling (real location of co-scheduled-tasks and machine file). Thus, Vigne is able to execute complex applications, like MPI or master/worker applications. A monitoring service has been designed and implemented with the aim to provide grid users with fine-grained information about the application execution (internship of Thomas Ropars  ,  ). This service is designed for a large-scale grid since it consumes very little bandwidth. To monitor applications, it uses a pre-loaded dynamic library that overloads some system calls like fork , waitpid or exit . Thus, it is able to detect crashes of the application processes, what is a keystone for reliable application execution and fault-tolerance policies. The service also provides accurate information about resource consumption. The system interface has been extended to allow job submission to the OpenPBS batch-scheduler through Vigne . Thus Vigne now handles three kinds of resources: Linux workstations, Kerrighed clusters, and OpenPBS clusters.
Second, we have worked on an integration between Vigne and the SALOME platform for numerical simulation (http://www.salome-platform.org/ ). We have designed and implemented a plug-in in the SALOME platform in order to allow the SALOME applications to be executed through Vigne . The plug-in wraps SALOME instructions for job submission into Vigne queries, and it automatically deploys the linked libraries of the SALOME applications. We have also extended Vigne to provide system features for handling input/output files, and for extending jobs from the SALOME platform.
The XtreemOS Linux-based Grid Operating System.
The European Integrated Project XtreemOS  , coordinated by Christine Morin, addresses Section 2.5.4 of the 2006 Work Programme: Advanced Grid Technologies, Systems and Services . It was launched in June 2006. The overall objective of the XtreemOS Project is the design, implementation, evaluation and distribution of an open source Grid operating system, with a native support for virtual organizations (VO). The proposed approach is the construction of a Grid-aware OS made up of a set of system services based on the traditional general-purpose Linux OS, extended as needed to support VO and to provide appropriate interfaces to the Grid OS services. The XtreemOS consortium includes 19 academic and industrial partners. Various end-users are involved in the XtreemOS Consortium, providing a wide range of test cases in scientific and business computing domains.
In 2006, apart from setting up the project, we have worked on the specification of XtreemOS operating system along three main directions.
XtreemOS flavor for clusters.
We have specified the cluster flavor of XtreemOS , LinuxSSI , which leverages Kerrighed technology. We plan to develop an efficient cluster file system and to provide mechanisms to tolerate reconfiguration events in a scalable way  .
We have started to design a modular Grid checkpointer architecture  ,  . The proposed architecture is hierarchical, involving a grid-level checkpointer, a system-level scheduler, and a kernel-level checkpointer. The grid checkpointer is in charge of coordinating the checkpointing protocols for applications composed of several units executed on multiple grid nodes. The system checkpointer is in charge of checkpointing an application unit on a single grid node. The kernel checkpointer, triggered by the system checkpointer, extracts, saves and restores the state of a process or thread on a grid node.
In the standard flavor of XtreemOS for individual PCs, the kernel checkpointer will be based on BLCR , which is one of the most advanced open-source implementation of a checkpoint/restart system for Linux. We plan to augment BLCR with the following features:
Save the shared libraries used by the process in the checkpoint, rather than assume that they will be present on the system when the process will be restarted.
Save the security context (VO specific information) in the snapshot of a process.
Extend saving of the snapshot from a specific file to a generic file descriptor, so that checkpoints can be stored in a grid object in the future.
At restart, provide information to the restarted process about the changes in the environment (process id, IP address, host name).
We will also study checkpointing strategies to be implemented in the system and grid checkpointers for large-scale applications executed on top of XtreemOS .
Virtual Organization Management
We have specified the overall approach for Virtual Organization management in XtreemOS  . The management work of a VO involves two levels: the VO level (or global level) and the node level (to be implemented as extensions to the Linux operating running on each grid node). The VO-level management includes membership management of users and nodes that join in or leave from a VO, policy management (e.g., group and role assignment), and runtime information management (e.g., querying active processes or jobs in a VO). The main responsibilities of node-level management include: translating from grid identities into local identities; granting or denying access to resources (files, services, etc.); checking limitations of resource usage (CPU wall time, disk quotas, memory, etc.); protecting and separating of resource usage by different users; logging and auditing of resource usage, etc.
XtreemOS supports VO management by the co-operative activities of VO-level and node-level management services. The key challenge here is to co-ordinate VO-level policies and local policies on nodes which depend on autonomic domain administrators. On the one hand, the enforcement of multiple VO security policies should be differentiated, while on the other hand, this kind of enforcement should not be conflicting with any local policy of nodes and it should not impair the usability of resources for grid users.