Section: New Results
Large Scale Distributed Systems
*A survey of Grid research tools: simulators, emulators and real life platforms*
Grid infrastructures are becoming the largest and most complex distributed systems ever built. Because of their size and complexity, they raise many algorithmic challenges for security, fault tolerance, faire share and performance. When investigating a research issue, researchers are using different methodologies and different tools. Most of the published Grid studies were conducted on real produc tion infrastructures or simulators. There are others research tools such as mathematical models, emulators and large scale experimental testbeds. In  , we present a survey of existing tools and methodologies to investigate Grid research issues. We describe the some mathematical models, the main generic simulators (Bricks, SimGrid, GridSim, GangSim and OptorSim), a couple of emulators (MicroGrid and Grid eXplorer) and a couple of experimental testbeds (DAS2 and Grid'5000). We briefly discuss their respective advantages and limitations and present the validation approach used by their authors.
*V-Meter: Microbenchmark pour évaluer les utilitaires de virtualisation dans la perspective de systèmes d'émulation à grande échelle*
V-GRID is a large scale emulator to test applications which need a large number of machines. To do this, we need to have many (100) virtual machines on each physical machine. We needed to choose between 4 virtualization tools to make this emulator : Vserver, Xen, UML and VMware. In  , we compare performances of 3 of these systems : Vserver, UML and Xen and we show none meets all the condition specified (scalability, speed, usability,... ) for our emulator.
Data-centric applications are still a challenging issue for large scale distributed computing systems. The emergence of new protocols and software for collaborative content distribution over Internet offers a new opportunity for efficient and fast delivery of high volume of data. In this paper, we investigate BitTorrent as a protocol for data diffusion in the context of Computational Desktop Grid. We show that BitTorrent is efficient for large file transfers, scalable when the number of nodes increases but suffers from a high overhead when transmitting small files. The paper also investigates two approaches to overcome these limitations. First, we propose a performance model to select the best of FTP and BitTorrent protocols according to the size of the file to distribute and the number of receiver nodes. Next we propose enhancement of the BitTorrent protocol which provides more predictable communication patterns. We design a model for communication performance and evaluate BitTorrent-aware versions BT-MinMin, BT-MaxMin and BT-Sufferage scheduling heuristics against a synthetic parameter-sweep application.
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault detection and handling. For this last approach, several protocols have been proposed in the literature. In a recent paper, we have demonstrated that uncoordinated checkpointing tolerates higher fault frequency than coordinated checkpointing. Moreover causal message logging protocols have been proved the most efficient message logging technique. These protocols consist in piggybacking non deterministic events to computation message. Several protocols have been proposed in the literature. Their merits are usually evaluated from four metrics: a) piggybacking computation cost, b) piggyback size, c) applications performance and d) fault recovery performance. In this paper, we investigate the benefit of using a stable storage for logging message events in causal message logging protocols. To evaluate the advantage of this technique we implemented three protocols: 1) a classical causal message protocol proposed in Manetho, 2) a state of the art protocol known as LogOn, 3) a light computation cost protocol called Vcausal. We demonstrate a major impact of this stable storage for the three protocols, on the four criteria for micro benchmarks as well as for the NAS benchmark.
Hybrid Preemptive Scheduling of MPI Applications on the Grids Time sharing between cluster resources in a Grid is a major issue in cluster and Grid integration. Classical Grid architecture involves a higher level scheduler which submits non-overlapping jobs to the independent batch schedulers of each cluster of the Grid. The sequentiality induced by this approach does not fit with the expected number of users and job heterogeneity of Grids. Time sharing techniques address this issue by allowing simultaneous executions of many applications on the same resources.
Co-scheduling and gang scheduling are the two best known techniques for time sharing cluster resources. Co-scheduling relies on the operating system of each node to schedule the processes of every application. Gang scheduling ensures that the same application is scheduled on all nodes simultaneously. Previous work has proven that co-scheduling techniques outperforms gang scheduling when physical memory is not exhausted. In this paper, we introduce a new hybrid sharing technique providing checkpoint-based explicit memory management. It consists in co-scheduling parallel applications within a set, until the memory capacity of the node is reached, and using gang scheduling related techniques to switch from one set to another one. We compare experimentally the merits of the three solutions: Co, Gang and Hybrid Scheduling, in the context of out-of-core computing, which is likely to occur in the Grid context, where many users share the same resources. Additionally, we address the problem of heterogeneous applications by comparing hybrid scheduling to an optimized version relying on paired-scheduling. The experiments show that the hybrid solution is as efficient as the co-scheduling technique when the physical memory is not exhausted, can benefit from paired-scheduling optimization technique when applications are heterogeneous, and is more efficient than gang scheduling and co-scheduling when physical memory is exhausted.
MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI
High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing library in HPC applications. These two trends raise the need for fault tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault tolerance protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols. We present then four fault tolerant protocols implemented in a new generic framework for fault tolerant protocol comparison, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging. We measure the performance of these protocols on a micro-benchmark and compare them for the NAS benchmark, using an original fault tolerance test. Finally, we outline the lessons learned from this in depth fault tolerant protocol comparison for MPI applications.