Project : paris
Topic : Operating system and runtime for clusters
Section : New Results
Keywords : Cluster , cluster federation , operating system , distributed system , single system image , global scheduling , process migration , multithreading , high performance communication , data stream migration , distributed shared memory , cooperative caching , remote paging , checkpointing , high availability , Pthread , OpenMP , MPI , peer-to-peer , self-organizing system , synchronization .
Operating system and runtime for clusters
Clusters are not only the most widely used general high-performance computing platforms for scientific computing, but they have also become the most dominant platform for high-performance computing today, according to the http://www.top500.org/ site. While the cluster architecture is attractive with respect to price/performance, there still exists a great potential for efficiency improvements at the software level. System software requires improvements to better exploit the cluster hardware resources and to ease cluster programming.
Since 1999, the Paris Project-Team is engaged in the design and development of Kerrighed, a genuine Single System Image cluster operating system for general high-performance computing   (Kerrighed is a registered trademark). A genuine SSI offers users and programmers the illusion that a cluster is a single high-performance and highly available computer, instead of a set of independent machines interconnected by a network. An SSI should offer four properties: (1) Resource distribution transparency, i.e., offering processes transparent access to all resources, and resource sharing between processes whatever the resource and process location; (2) High performance; (3) High availability, i.e., tolerating node failures and allowing application checkpoint and restart; and (4) Scalability, i.e., dynamic system reconfiguration, node addition and eviction, transparently to applications.
Current achievements with Kerrighed
In 2003, thanks to the recruitment of two expert engineers within the COCA Contract, we have integrated the results obtained last year by several PhD students into a unique prototype. Two major releases (Kerrighed V0.70  and V0.72 ) have been delivered. The robustness of Kerrighed has been significantly enhanced, and several new functionalities have been implemented. Kerrighed V0.72 is now suitable for the execution of applications provided by our industrial partners (Edf, Dga).
In 2003, we have focused on the implementation of a configurable global scheduler for Kerrighed  . It should be able to adapt the global scheduling policy to the workload characteristics, and to change it dynamically depending on the cluster load. All components of the Kerrighed scheduler can be hot-plugged or hot-stopped. A development framework allowing to easily implement global scheduling policies has been designed and implemented. These schedulers may rely on new components, developed without any kernel modification or on components existing in Kerrighed. Some preliminary global scheduling policies have been experimented .
In the near future, we plan to implement a number of global scheduling policies in Kerrighed and to experiment them with respect to workloads made up of real sequential and parallel applications provided by Edf. We will also design and implement a simple batch system exploiting Kerrighed features to meet users requirements.
Process migration in Kerrighed
In order to cope with the migration of communicating processes, we have designed the Kernet System . It supports the global management of any data stream (socket, pipe, char devices) in Kerrighed. In 2003, unix et inet sockets have been implemented on top of Kernet. Other data stream interfaces will be implemented in 2004. Performance evaluations have been carried out with Mpi applications based on the MPICH environment. (Note that no modification of the MPICH environment is needed to execute it within Kerrighed.) They demonstrate that a Mpi process can be migrated in Kerrighed without any performance degradation incurred by communications taking place after migration .
Kernet relies on theGimli/Gloïn System. It is a portable, high-performance communication system, providing a kernel-level send/receive interface to Kerrighed distributed services. We have revisited the design of Gimli/Gloïn to obtain better performance, and to extend its interface with active messages and pack/unpack primitives. The implementation of the new Gimli/Gloïn architecture, which offers high-performance communication both at kernel level and at user level, is in progress. In addition, we plan to implement a new Gloïn device to better exploit the Myrinet technology.
Thread support in Kerrighed
Today, Kerrighed offers a complete support of the Posix thread standard. This important result has been obtained thanks to our previous results on distributed shared memory. It also crucially relies on the work carried out in 2003 on the design and implementation of distributed mechanisms for proper thread termination, cluster-wide signal management, and distributed synchronization facilities (locks, barriers, etc.) compatible with preemptive thread migration. Kerrighed Pthread Interface has been validated by executing existing OpenMP applications compiled with the unmodified Omni 1.4 OpenMP compiler targeting pthread in SMP multiprocessors . The correct execution of the 288 tests provided the Omni compiler to ensure its correct installation on a given architecture demonstrates that Kerrighed is now mature to support multithreading on a cluster. The OpenMP version of the HRM1D Edf application has also been successfully executed on Kerrighed. It includes 7,000 lines of Fortran code. However, performance has to be improved. In the future, we plan to explore cluster-aware compilation methods, that produce more efficient code for OpenMP programs. They should also provide OpenMP developers with tools to better understand the performance bottlenecks of their applications so that they can tune their parallelized algorithms.
Further work on Kerrighed
Future work on Kerrighed is three-fold. First, we will continue the development of Kerrighed with the design of a distributed file system based on containers . We will also integrate checkpointing mechanisms for parallel applications. Kerrighed will be ported to a 64-bit architecture based on Opteron processors. Second, we will work further on high-availability issues. On the one hand, we will continue the work started this summer in cooperation with Rutgers University during the internship of Pascal Gallard. It is devoted to high-availability issues in exploiting the read and write RDMA features provided in the last generation of Myrinet adapters. On the other hand, we will work on dynamic reconfiguration mechanisms for Kerrighed distributed services. Last, we will pursue our efforts to extend the community of Kerrighed users. In cooperation with Edf, we plan to build an OSCAR package based on Kerrighed (SSI-OSCAR) during the post-doctoral internship of Geoffroy Vallée at Oak Ridge National Laboratory.
High performance cluster-wide I/0
High performance I/O are of primary importance for the applications executed on clusters. Some applications, like numerical computation or VOD, demand high-bandwidth sequential accesses. Other, like mail or Web servers, benefit from low-latency data and meta-data operations. As of today, no cluster file system provides performance for the whole range of access patterns.
Rather than putting forward another middleware, we explore a new approach to make the operating system capable of efficient distributed I/O. We propose to manage a cluster-wide cache, consisting of both data and meta-data, through Distributed Shared Memory (DSM) techniques . Our goal is to shorten the data path, and efficiently overlap I/O, communication and computation, in order to avoid dedicated I/O nodes and network-attached storage. With the I/O system we propose, it is possible to take advantage of Direct Remote Access (DMA) transfers, thus avoiding stressing the local bus of a node. Our system is also compatible with striping and mirroring as in software RAID systems.
A first prototype has been implemented based on a modified version of the Linux Kernel. We plan to complete this prototype to validate our approach with respect to standard benchmarks. We also plan to provide an MPI-IO interface.
Checkpointing parallel applications in clusters
Backward error recovery involving checkpointing and restart of tasks is an important component of any system providing fault tolerance to applications distributed over a network. In Kerrighed, one of our objectives is to be able to checkpoint and restart any kind of scientific application: they may range from sequential applications to parallel applications, with communication based on message passing shared memory, or even both of them.
We have identified common mechanisms for implementing a wide variety of checkpointing and rollback recovery protocols for both message passing and distributed shared memory systems . The idea is to treat pages as separate entities similar to regular tasks, and provide a mechanism to track direct dependencies among tasks and memory pages. This mechanism is thus common to both distributed shared memory and message-passing systems. It can moreover be efficiently implemented, since the overhead of each interaction is very light, both in terms of computation and control information. The proposed mechanism can finally support several optimizations discussed in the literature.
We have carried out a first implementation of a coordinated checkpointing protocol within Kerrighed cluster operating system for multithreaded applications . Checkpointing the private state of a thread involves the same basic mechanisms as those used for process migration. Additionally, a mechanism to checkpoint container pages incrementally has been implemented. As the checkpoint storage has a high impact on performance during fault-free executions, Kerrighed can save checkpoints either in the memory of two nodes or into disks. Synchronization is an important issue when designing a checkpoint/recovery protocol for shared memory applications. In fact, parallel applications traditionally use locks and barriers, that may incur causality dependences between processes. We have studied how to extend our previous work on dependence tracking to deal with synchronization . Both locks and barriers introduce resource contention conflicts at recovery time, which need to be resolved by heuristics. Moreover, due to these conflicts, a unique latest recovery line is not defined, which requires using additional heuristics.
This work will be continued next year. We will finalize the implementation in Kerrighed of all the mechanisms needed to checkpoint and recover parallel applications communicating by message (for instance, Mpi applications) or shared memory (for instance, OpenMP applications). A preliminary global coordinated checkpointing strategy will be evaluated. Comparison with other systems will be carried out in the framework of the Procope 2004 bilateral collaboration with the University of Ulm, Germany.
Checkpointing parallel applications in cluster federations
Federations of clusters (aka clusters of clusters) are very useful for applications like large-scale code coupling. Faults may appear very frequently, so that checkpointing strategies should definitely be provided to restart the applications in the event of a node failure in a cluster. To take into account the constraints introduced by clusters federation architecture, we propose a hierarchical checkpointing protocol . It uses a regular synchronous approach inside individual clusters, but only quasi-synchronous methods between clusters. Our protocol has been evaluated by simulation. It fits well for applications that can be divided into modules with little inter-module communication.
Large companies exploit several medium-size clusters distributed on several geographic sites. Some applications, such as those using code coupling, may overcome the capacities of one single cluster. A solution is to run each component of the code on a different cluster. Moreover, for a given cluster, a limited amount of applications can be run at the same time. However, other clusters in the same company may be idle or underloaded at the same time, and hence could be enrolled in the computation. What is needed here is a Grid-aware operating system, that could federate clusters in order to make them cooperate, in particular for sharing resources.
We have worked on the design of such a Grid-aware operating system. It should be able to manage a large number of nodes, and to deal with the dynamicity inherent to a federation, where multiple reconfigurations (node connection, disconnection, or failure) may be in progress at the same time. Our proposal is based on a peer-to-peer infrastructure. The idea is to build a virtual overlay network for a federation. Such a network provides a key-based routing protocol, making transparent the physical location of any object named by a key. The Grid-aware operating system would encompass several distributed services such as, for instance, services assembling a federation, managing and scheduling applications, controlling resource access, managing a virtual shared memory and a distributed file system, etc.
This year, we have implemented (using C) the peer-to-peer overlay network inspired from Pastry . It will serve as the basis for implementing the distributed services at the federation level.
The first service that we have studied is a service for executing distributed applications using the shared memory paradigm in a cluster federation. This raises the problem of executing shared memory parallel applications on dynamic and large-scale systems. The shared memory is private to each application, it is volatile, and the application components transparently access shared memory objects via their usual address space. The peer-to-peer system tolerates up to simultaneous reconfiguration events (node failure, disconnection, or join) and an infinite number of reconfigurations. We have designed a coherence protocol similar to Kai Li's protocols for replicas of memory objects . The protocol uses the peer-to-peer architecture to handle the simultaneous reconfiguration events. The number of simultaneous such events which can be handled is a parameter of the system. However, an infinite number of reconfigurations can be supported with a fail-stop model. Failures are tolerated using backward error recovery and replicated automata. This avoids restarting the applications when possible. We have proved that the protocol preserves the coherence of replicated memory objects, despite of simultaneous reconfiguration events, and guarantees liveness if communications are reliable.
Optimizations to this protocol will be studied in the near future and both theoretical and experimental evaluations will be performed. Furthermore, we plan to study a home-based coherence protocol implementing a release consistency memory model.
Mome and openMP
Mome initial target was the execution of programs from the high performance community which exploit loop-level parallelism using a Spmd execution model. This implementation makes a clear distinction between shared and private data. The allocation of the shared data in the shared space must be explicitly requested by the application. This execution model is consistent with the HPF language where the variables are implicitly private and the shared variables must be explicitly specified.
The OpenMP specification targets SMP architectures: shared memory multiprocessors. In the OpenMP model, all variables are implicitly shared. The private variables (one instance per thread) must be explicitly specified. It is not possible through static analysis to decide at compile-time which objects are shared and which ones are private.
The Mome DSM implementation and the associated runtime system have been adapted in order to support standard OpenMP codes without adding complexity to compilers: the thread stacks can now be allocated in the shared space, the signal handlers are executed on private stacks, the DSM internal code never read or write in the application space, the distributed synchronization objects are allocated in the shared space but the primitives do not touch the objects etc.
A new implementation of the nth_lib runtime system from the IST POP project on the Mome DSM is under progress and the experimentations will start in the near future. The integration of the release consistency model in Mome is planned.