Section: Overall Objectives
Operating system and runtime for clusters and grids
Clusters, made up of homogeneous computers interconnected via high-performance networks, are now widely used as general-purpose, high-performance computing platforms for scientific computing. While such an architecture is attractive with respect to its price/performance ratio, there still exists a large potential for efficiency improvement at the software level. System software can be improved to better exploit cluster hardware resources. Programming environments need to be developed with both the cluster and human programmer efficiency in mind.
We believe that cluster programming remains difficult. This is due to the fact that clusters suffer from a lack of dedicated operating system providing a single system image (SSI). A single system image provides the illusion of a single, powerful and highly-available computer to cluster users and programmers, as opposed to a set of independent computers, whose resources have to be managed locally.
Several attempts to build an SSI have been made at the middleware level as Beowulf  , PVM  or MPI  . However, these environments only provide a partial SSI. Our approach in the Paris Project-Team is to design and implement a full SSI in the operating system. Our objective is to combine ease of use, high performance and high availability. All physical resources (processor, memory, disk, etc.) and kernel resources (process, memory pages, data streams, files, etc.) need to be visible and accessible from all cluster nodes. Cluster reconfigurations due to a node addition, eviction or failure, need to be automatically dealt with by the system, transparently to the applications. Our SSI operating system (SSI OS) is designed to perform global, dynamic and integrated resource management.
As the execution time of scientific applications may be larger than the cluster mean time between failures, checkpoint/restart facilities need to be provided, not only for sequential applications but also for parallel applications. This is independent of the underlying communication paradigm. Even though backward error recovery (BER) has been extensively studied from the theoretical point of view, an efficient implementation of BER protocols, transparent to the applications, is still a research challenge. There are very few implementations of recovery schemes for parallel applications. Our approach is to identify and implement as part of the SSI OS, a set of building blocks that can be combined to implement various checkpointing strategies and their optimization for parallel applications, whatever inter-process communication (IPC) layer they use.
In addition to our research activity on operating system, we also study the design of runtimes for supporting parallel languages on clusters. A runtime is a software offering services dedicated to the execution of a particular language. Its objective is to tailor the general system mechanisms (memory management, communication, task scheduling, etc.) to achieve the best performance given the target machine and its operating system. The main originality of our approach is to use the concept of distributed shared memory (DSM) as the basic communication mechanism within the runtime. We are essentially interested in Fortran and its OpenMP extensions  . The Fortran language is traditionally used in the simulation applications we focus on. Our work is based on the operating system mechanisms studied in the Paris Project-Team. In particular, the execution of OpenMP programs on a cluster requires a global address space shared by threads deployed on different cluster nodes. We rely on the two distributed shared memory systems we have designed: one at user level, implementing weak memory consistency models, and the other one at operating-system level, implementing the sequential consistency model.