Team AlGorille

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: New Results

Transparent Ressource Management

Participants : Pierre-François Dutot, Emmanuel Jeannot, Tchimou N'Takpé, Luiz Angelo Steffenel, Frédéric Suter.

Scheduling under Uncertainty

When scheduling a set of task modeled by a DAG, due to runtime variation, the actual makespan can be different than the makespan computed by the scheduling algorithm. A schedule is said robust when this variation is not too large, i.e. when the schedule is not too sensible to runtime variation.

We have addressed the problem of matching and scheduling of DAG-structured application to both minimize the makespan and maximize the robustness in a heterogeneous computing system. Due to the conflict of the two objectives, it is usually impossible to achieve both goals at the same time. We have given two definitions of robustness of a schedule based on tardiness and miss rate. Slack was proved to be an effective metric to be used to adjust the robustness. We have employed $ \epsilon$ -constraint method to solve the bi-objective optimization problem where minimizing the makespan and maximizing the slack are the two objectives. We have defined the overall performance of a schedule considering both makespan and robustness such that user have the flexibility to put emphasis on either objective. Experiment results have validated the performance of the proposed algorithm.

Parallel Task Scheduling

When conducting our initial experimental comparison of M-HEFT [4] and the HCPA heuristic developed during the master thesis of Tchimou N'Takpé we identified several limitations in both algorithms. This year we proposed improvements to address these limitations. The allocation phase of HCPA has been improved by proposing a new stopping criterion that allows smaller but still efficient allocations. We also introduced a packing strategy in order to fill gaps that may appear in its placement phase as a task may be delayed unnecessarily just because its computed processor allocation is (perhaps only slightly) larger than the number of processors available at the time when the task is ready for execution.

We also address a glaring drawbacks of M-HEFT, which is that it tends to use very large processor allocations for application tasks. This is simply due to the fact that a task's processor allocation is chosen "blindly" so that the task's completion time is minimized. To remedy this problem with M-HEFT we propose three simple ways to bound a task's processor allocation.

Part of these research have been published in [31] , [30] , and a comparison between improved versions of our heuristics is under submission.

Finally we are currently on the implementation of a guaranteed heuristic in collaboration with Henri Casanova, at University of Hawai`i, Manoa and Pierre-Fran cois Dutot. An optimal allocation is computed by a linear program and a list scheduling algorithm is then used to place these task's allocation.

Data Redistribution

Various redistribution algorithm of the literature have been implemented and tested this year. Indeed, many algorithms have been proposed to redistribute data on a same cluster. However, no fair comparison exists between these algorithms. Moreover, some of them can easily be extended to solve the KPBS problem that consists in redistributing data between clusters over a backbone.

We have carried out experiments on the grid explorer cluster and on grid 5000 between the Orsay site and the Rennes site as well as on a singe cluster.

Surprisingly, on the machines we tested, we have found that avoiding contention is not always useful. Indeed, in most of the cases, the brute-force method is the fastest way to redistribute data from a block-cyclic distribution to another block-cyclic distribution. This result is mainly due to the fact that contention does not degrade the performance of the networks we have used. However, in the case where the pattern is irregular OGGP is the best scheduling algorithm. We also showed that preemption is useful only if its cost is taken into account by the algorithm.

In conclusion, if performance is the only issue, the brute-force method is often the best one. However, if other issues have to be considered (QOS, memory constraints, predictability and stability), scheduling algorithms such as OGGP are a very good options.

Performance Prediction

Being able of accurately estimating the runtime of a program and communication time of data transfer is critical for efficiently scheduling application on distributed environments such as grids.

We have introduced a template based modeling mechanism that is able to accurately predict the runtime of the service based on previous execution. It improves the standard runtime estimation of GridSolve as it is more accurate and takes into account the specificity of the service and the machine it runs on. Second, we have developed an estimator of the communication cost between the client and the server. Since communication cost is often very large such an estimator enables to discard fast remote server if the gain in terms of computation time is overshadowed by the communication time.

We have also worked on modeling the dense LU factorization in order to predict the runtime on a parallel machine. With this model we are able to predict a block-size close to the optimal for a given size of the matrix and a given number of processors

Total Exchange Performance Prediction

One of the most important collective communication patterns for scientific applications is the total exchange (also called All-to-All, in which each process holds n different data items of size m that should be distributed among the n processes, including itself. However, this communication pattern tend to saturate network resources, causing unexpected transmission delays - the network contention.

Having accurate predictions is extremely important on the development of application performance prediction frameworks such as PEMPIs [49] and GridSolve. Because it is not always possible to use contention-aware All-to-All implementations (as in the case of popular MPI libraries), it is important to design performance models that take into account the effects of network contention.

Studying the effects of the network contention in the context of MPI programming environments, we introduced a new approach to model the performance of the All-to-All collective operation. Contrarily to existing models which rely on complex interference analysis, our strategy consists in identifying, based on a sample execution, a contention signature that characterizes a given network environment. Using such method we were able to accurately predicted the performance of the All-to-All operation on different network architectures (Fast Ethernet, Gigabit Ethernet and Myrinet, for example), as illustrated in our paper [35] .

Grid-aware Total Exchange

As presented above, Total Exchange algorithms (also called All-to-All) are widely studied in the context of (partially) homogeneous clusters subjected to network contention. Only a few works try to optimize the execution of such communication patterns on grid environments, and up to now the results are far from being widely spread. Indeed, heterogeneity of the communication environment turns the optimization of the All-to-All operation into a NP-hard problem.

Based on preliminary experiments conducted by [14] , we were able to implement on LaPIe some scheduling heuristics that are efficient for small messages (or better saying, for strongly heterogeneous environments. Nevertheless, these heuristics fail with large messages as they are unable to improve the utilisation of the wide-area bandwidth. For instance, we are currently observing the impact of different implementation algorithms from popular MPI distribution such as MPICH and OpenMPI on the communication schedule, and trying to figure out the causes of low-bandwidth utilisation sometimes observed with these algorithms. The next step will consist on developing specific heuristics to circumvent these restriction. They should be tested in both simulated and real environments, using respectively GRAS/MSG (or SMPI, if available) and LaPIe .


Logo Inria