## Section: New Results

### Transparent Resource Management

Participants : Louis-Claude Canon, Stéphane Genaud, Emmanuel Jeannot, Tchimou N'takpé.

#### Reliable Scheduling

We have worked on the case where jobs can fail. More precisely, we have studied the problem of random brokering for platforms that are directly inspired by existing grids such as EGEE. In such environments incoming jobs are randomly dispatched to computational elements with a probability that is proportional to the accumulated speed of this element. Resubmission is a general strategy employed to cope with failures in grids. We have studied resubmission analytically and experimentally for the case of random brokering (jobs are dispatched to a computing elements with a probability proportional to its computing power). We have compared two cases where jobs are resubmitted to the broker or to the computing element. Results show that resubmitting to the broker is a better strategy. Our approach is different from most existing race-based ones as it is a bottom-up: we start from a simple model of a grid and derive its characteristics, see [20] .

Also, we have studied the problem of scheduling tasks (with and without precedence constraints) on a set of related processors which have a probability of failure governed by an exponential law. We have provided a general method for converting scheduling heuristics on heterogeneous cluster into heuristics that take reliability into account when precedence there exists constraints. As shown by early experimental results, these heuristics are able to construct a good approximation of the Pareto-Front. Moreover we have studied the case where duplication is allowed. In this case, we have shown that for certain classes of problems, evaluating the reliability of a schedule is #-P Complete.

#### Robust Scheduling

A schedule is said to be robust if it is able to absorb some degrees of uncertainty in tasks duration while maintaining a stable solution. This intuitive notion of robustness has led to a lot of different metrics but almost no heuristics. We have performed an experimental study of these different metrics and show how they are correlated to each other. Additionally, we have proposed different strategies for minimizing the makespan while maximizing the robustness: from an evolutionary meta-heuristic (best solutions but longer computation time) to more simple heuristics making approximations (medium quality solutions but fast computation time). We have compared these different approaches experimentally and show that we are able to find different approximations of the Pareto front for this bi-criteria problem [14] .

Computing the makespan distribution when task and communication duration are given by probabilistic distribution is a #-P complete problem. We have studied different ways to approximate this problem based on previous results of the literature on the PERT network. For comparing these different methods we have computed the makespan distribution using Monte-Carlo simulation, see [39] , [21] .

#### Detecting Collusion in Desktop Grid

By exploiting idle time on volunteer machines, desktop grids provide a way to execute large sets of tasks with negligible maintenance and low cost. Although desktop grids are attractive for cost-conscious projects, relying on external resources may compromise the correctness of application execution due to the unreliability of nodes.

While most of the existing work considers that failures in desktop grid are uncorrelated. We have explored efficient and accurate techniques for representing, detecting, and characterizing the presence of correlated or collusive behavior in desktop grid systems.

Precisely, we consider the most challenging model for threats: organized
groups of cheaters that may collude to produce incorrect results. We have
proposed two on-line algorithms to detect collusion and
to characterize the behavior of participants. Using several real-life traces,
we have shown that our approach is accurate and efficient in identifying
collusion and in estimating group behavior.
**[stale citation canon-jeannot-weismann-ipdps]****[Action JG: Réponse 7 décembre.]**

#### A general methodology for computing the Pareto-front

Optimization problems, such as scheduling, can often be tackled with respect to several objectives. In such cases, there can be several incomparable Pareto-optimal solutions. Computing or approximating such solutions is a major challenge in algorithm design. We have proposed a generalization of the greedy methodology to the multi-objective case. This methodology, called meta-greedy, allows to design a multi-criteria algorithm when a mono-criteria greedy one is known. We have applied our proposed solution to two problems of the literature (a scheduling and a knapsack problem) and we have shown that we can generate, in a single execution a set of non dominated solutions that are close to the set of Pareto-optimal solutions.

#### Energetic performance measurement and optimization

Several experiments have been done on the GPU clusters of SUPÉLEC with
different kinds of problems ranging from an embarrassingly parallel one
to a strongly coupled one, via an intermediate level. Our first results
tend to confirm our first intuition that the GPUs are a good alternative
to CPUs for problems which can be formulated in a SIMD or massively
multi-threading way. However, when considering not embarrassingly
parallel applications the supremacy of a GPU cluster tends to decrease
when the number of nodes increases (these results have been introduced to the European
COST-IC0804 about *energy efficiency in large scale distributed systems* ). So, the energy
saving can clearly be one of the decision criteria for using GPUs instead of CPUs, depending on
the parallel algorithm and the number of computing nodes. The other
criteria will generally be the ease of development and maintenance which
have a direct impact of the economical cost of the software.

We intend to continue our investigations in that domain at different levels. Among them, we can cite the hybrid processing context which corresponds to the full exploitation of the computational power present on every node of a parallel system. Typically, this includes the use of all the cores in a node in conjunction with any co-processing device. Also, we would like to determine the best trade-off between energy saving and design/implementation costs.

#### Load balancing

A load-balancing algorithm based on asynchronous diffusion with bounded delays has been designed to work on dynamical networks. It is by nature iterative and we have provided a proof of its convergence in the context of load conservation. Also, we have given some constraints on the load migration ratios on the nodes in order to ensure the convergence. Although this work is still in progress, our first results are very satisfying as the efficiency of our algorithm has been experimentally highlighted in the SimGrid framework.

The perspectives of that work are double. The first one concerns the internal functioning of our algorithm. There is an intrinsic parameter which tunes the load migration ratios and we would like to determine the optimal value of that ratio. The other aspect is on the application side in a real parallel environment. Indeed, with Stéphane Genaud, we intend to apply this algorithm to a parallel version of the AdaBoost learning algorithm. We will compare our load-balancing scheme to other existing ones in different programming environments among which the P2P-MPI framework.

Concerning the Neurad project, our parallel learning proceeds by decomposing the data-set to learn. However, using a simple regular decomposition is not sufficient as the obtained sub-domains may have very different learning times. Thus, our work in progress concerns the determination of the best estimator of the learning time of a sub-domain in order to obtain similar learning times of all the sub-domains.

Also, some investigations are done according to the decomposition strategy i.e. the global scheme and its inner choice. Until now, we have opted for an URB (Unbalanced Recursive Bisection) approach. We are currently working on the characterization of the best choice of dimension to divide at each decomposition step.

#### Fault Tolerance

##### Application-level fault tolerance

Concerning the fault tolerance, we have worked with Marc Sauget, from the University of Franche-Comté, on a parallel and robust algorithm for neural network learning in the context of the Neurad project. A short description of that project is given in Section 6.2.6 .

As that learning algorithm is to be used in local clusters we have opted for a simple approach based on the client-server model. So, there is a server which distributes all the learning tasks (the data-set sub-domains to be learned) to the clients which perform the learnings. However, although the parallel context is very classical, the potentially very long learning times of the sub-domains (the data-sets may be huge) imply the insertion of a fault-tolerance mechanism to avoid the loss of any learning.

So, we have developed a detection mechanism of the clients faults together with a restarting process. We have also studied some variants of task redistribution as a fault may only be at the link level and may not imply the loss of the learning in progress on the corresponding client. Our final choice was to redistribute the task as soon as a fault is detected. Then, if that fault is later canceled by the client, this means that there are two clients performing the same learning. However, we do not stop any of the clients but let them run until one of them sends back its result. Only then, the other client is stopped.

That strategy has shown to be rather efficient and robust in our different experiments performed with real data on a local cluster where faults were generated [46] . Although those results are rather satisfying, we would like to investigate yet more reactive mechanisms as well as the insertion of robustness at the server level.

##### Programming model and frameworks for fault tolerant applications

In the framework of the PhD thesis of Constantinos Makassikis, supervised by Stephane Vialle, we
have designed a new fault tolerance model for distributed applications which is based on a
collaboration of fault-tolerant development frameworks and application-semantic knowledge supplied
by users. Two development frameworks have been designed according to two different parallel
programming paradigms: one for *Master-Workers* applications and another one for *SPMD*
applications including inter-nodes communications. Users' task is limited as he merely needs to
supply some computing routines (function of the application), and add some extra code to use
parallel programming skeletons and to tune checkpointing frequency.

Our first experiments have exhibited limited overheads when no failure happens and acceptable overheads in the worst case failures. These overheads appears less than the one obtained with all fault tolerant middlewares we have experimented, while development time overhead is very limited using our frameworks. Moreover, detailed experiments up to 256 nodes of our cluster have shown it is possible to finely tune the checkpointing policies of the frameworks in order to implement different fault tolerance strategies according, for example, to cluster reliability.

##### System-level fault tolerance

The approach of fault tolerance we offer in the P2P-MPI framework is based on replication of computations. We have studied both theoretical and experimental aspects of this approach. Our protocol, which is an adaptation of the active replication principle, incurs an overhead as compared to an execution without replication because extra messages must be sent to replica. Being longer, the execution is more failure-prone on one hand, while on the other hand, replication insures a greater level of robustness. Hence, there is a trade-off that we have studied to determines the optimal replication degree [16] . Our findings are based on the failure distribution model published by [53] , which is based on real traces. Second, we have carried out a number of experiments to assess the overhead of replication. We have shown the cost of replication on varied programs from the Java Grande Forum benchmark and with two NAS benchmarks [25] .