Team AlGorille

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: New Results

Transparent Resource Management

Participants : Pierre-Nicolas Clauss, Stéphane Genaud, Soumeya Leila Hernane, Constantinos Makassikis, Stéphane Vialle.

New Control and Data Structures for Efficiently Overlapping Computations, Communications and I/O

With the thesis of Pierre-Nicolas Clauss we introduced the framework of ordered read-write locks, ORWL, see [11] , [17] . These are characterized by two main features: a strict FIFO policy for access and the attribution of access to lock-handles instead of processes or threads. These two properties allow applications to have a controlled pro-active access to resources and thereby to achieve a high degree of asynchronism between different tasks of the same application. For the case of iterative computations with many parallel tasks which access their resources in a cyclic pattern we provide a generic technique to implement them by means of ORWL. It was shown that the possible execution patterns for such a system correspond to a combinatorial lattice structure and that this lattice is finite iff the configuration contains a potential deadlock. In addition, we provide efficient algorithms: one that allows for a deadlock-free initialization of such a system and another one for the detection of deadlocks in an already initialized system.

Whereas the first experiments for ORWL had been done with our library parXXL, we now provided a standalone distributed implementation of the API that is uniquely based on C and POSIX socket communications. Our goal is to simplify the usage of ORWL and to allow portability to a large variety of platforms. This implementation runs on different flavors of Linux and BSD, on different processor types Intel and ARM, and different compilers, gcc, clang, opencc and icc. An experimental evaluation of the performance is on its way.

Data Handover, DHO, is a general purpose API that combines locking and mapping of data in a single interface. The access strategies are similar to ORWL, but locks and maps can also be hold only partially for a consecutive range of the data object. It is designed to ease the access to data for client code, by ensuring data consistency and efficiency at the same time.

In the thesis of Soumeya Hernane, we use the Grid Reality And Simulation (GRAS) environment of SimGrid, see  5.3 , as a support of an implementation of DHO. GRAS has the advantage of allowing the execution in either the simulator or on a real platform. A first series of tests and benchmarks of that implementation demonstrates the ability of DHO to provide a robust and scalable framework, [38] .

Energy performance measurement and optimization

Several experiments have been done on the GPU clusters of SUPÉLEC with different kinds of problems ranging from an embarrassingly parallel one to a strongly coupled one, via an intermediate level. Our first results tend to confirm our first intuition that the GPUs are a good alternative to CPUs for problems which can be formulated in a SIMD or massively multi-threading way. However, when considering not embarrassingly parallel applications the supremacy of a GPU cluster tends to decrease when the number of nodes increases (these results have been introduced to the European COST-IC0804 about energy efficiency in large scale distributed systems [19] ). So, the energy saving can clearly be one of the decision criteria for using GPUs instead of CPUs, depending on the parallel algorithm and the number of computing nodes. The other criteria will generally be the ease of development and maintenance which have a direct impact of the economical cost of the software.

A straight sequel of that work has been to propose a model linking together the computing and energy performances  [60] . That model allows the user to estimate the minimal speedup that a GPU version must get over its CPU counterpart in order to become more energy efficient. Thanks to this model, it also becomes possible to design a Execution Control System (ECS) that would dynamically choose the best conjunction of software version and hardware to run a given scientific application.

We intend to continue our investigations in that domain at different levels. Among them, we can cite further studies over our model as well as the development of the control system mentioned below. As we have developed an American option pricer on a CPU cluster and on a GPU cluster (see section 6.1.4 ), we plan to evaluate our performance model on this second distributed application. It is not obvious the GPU version is always the most interesting, depending on the problem size and the number of used nodes. Also, as mentioned in the previous section, we plan to study the energy aspect of hybrid processing, which corresponds to the full exploitation of the computational power present on every node of a parallel system. Typically, this includes the use of all the cores in a node in conjunction with any co-processing device. Finally, we would like to determine the best trade-off between energy saving and design/implementation costs.

Load balancing

A load-balancing algorithm based on asynchronous diffusion with bounded delays has been designed to work on dynamical networks [14] . It is by nature iterative and we have provided a proof of its convergence in the context of load conservation. Also, we have given some constraints on the load migration ratios on the nodes in order to ensure the convergence. This work has been extended, especially with a detailed study of the imbalance of the system during the execution of a parallel algorithm simulated in the SimGrid platform.

The perspectives of that work are double. The first one concerns the internal functioning of our algorithm. There is an intrinsic parameter which tunes the load migration ratios and we would like to determine the optimal value of that ratio. The other aspect is on the application side in a real parallel environment. Indeed, with Stéphane Genaud, we intend to apply this algorithm to a parallel version of the AdaBoost learning algorithm. We will compare our load-balancing scheme to other existing ones in different programming environments among which the P2P-MPI framework.

Concerning the Neurad project, our parallel learning proceeds by decomposing the data-set to be learned. However, using a simple regular decomposition is not sufficient as the obtained sub-domains may have very different learning times. Thus, we have designed a domain decomposition of the data set yielding sub-sets of similar learning times [32] . One of the main issue in this work has been the determination of the best estimator of the learning time of a sub-domain. As the learning time of a data set is directly linked to the complexity of the signal, several estimators taking into account that complexity have been tested, among which the entropy. Although our current results are satisfying, we are convinced that the quality of the decomposition can still be improved by several ways. The first one could be the design of a better learning time estimator. The second one is the strategy, i.e. the global scheme and its inner choice. Until now, we have opted for an URB (Unbalanced Recursive Bisection) approach, but we are currently working on the characterization of the best choice of dimension to divide at each decomposition step.

Fault Tolerance

Application-level fault tolerance

Concerning the fault tolerance, we have worked with Marc Sauget, from the University of Franche-Comté, on a parallel and robust algorithm for neural network learning in the context of the Neurad project  [44] . A short description of that project is given in Section  4.1.7 .

As that learning algorithm is to be used in local clusters we have opted for a simple approach based on the client-server model. So, there is a server which distributes all the learning tasks (the data-set sub-domains to be learned) to the clients which perform the learning. However, although the parallel context is very classical, the potentially very long learning times of the sub-domains (the data-sets may be huge) imply the insertion of a fault-tolerance mechanism to avoid the loss of any learning.

So, we have developed a detection mechanism of the clients faults together with a restarting process. We have also studied some variants of task redistribution as a fault may only be at the link level and may not imply the loss of the learning in progress on the corresponding client. Our final choice was to redistribute the task as soon as a fault is detected. Then, if that fault is later canceled by the client, this means that there are two clients performing the same learning. However, we do not stop any of the clients but let them run until one of them sends back its result. Only then, the other client is stopped.

That strategy has shown to be rather efficient and robust in our different experiments performed with real data on a local cluster where faults were generated. Although those results are rather satisfying, we would like to investigate yet more reactive mechanisms as well as the insertion of robustness at the server level.

Programming model and frameworks for fault tolerant applications

In the framework of the PhD thesis of Constantinos Makassikis, supervised by Stéphane Vialle, we have designed a new fault tolerance model for distributed applications which is based on a collaboration of fault-tolerant development frameworks and application-semantic knowledge supplied by users. Two development frameworks have been designed according to two different parallel programming paradigms: one for Master-Workers applications [39] and another one for some kind of SPMD applications including inter-nodes communications [25] . Users' task is limited as he merely needs to supply some computing routines (function of the application), and add some extra code to use parallel programming skeletons and to tune checkpointing frequency.

Our experiments have exhibited limited overheads when no failure happens and acceptable overheads in the worst case failures. These overheads appears less than the one obtained with all fault tolerant middlewares we have experimented, while development time overhead is very limited using our frameworks. Moreover, detailed experiments up to 256 nodes of our cluster have shown it is possible to finely tune the checkpointing policies of the frameworks in order to implement different fault tolerance strategies according, for example, to cluster reliability.


Logo Inria