Section: New Results
Providing access to HPC servers on the Grid
Participants : Abdelkader Amar, Raphaël Bolze, Yves Caniou, Eddy Caron, Ghislain Charrier, Benjamin Depardon, Frédéric Desprez, Jean-Sébastien Gay, David Loureiro, Jean-Marc Nicod, Laurent Philippe, Vincent Pichon, Emmanuel Quemener, Cédric Tedeschi, Frédéric Vivien.
In many scientific areas, such as high-energy physics, bioinformatics, astronomy, and others, we encounter applications involving numerous simpler components that process large data sets, execute scientifc simulations, and share both data and computing resources. Such data intensive applications consist of multiple components (tasks) that may communicate and interact with each other. The tasks are often precedence-related, and precedence contraints usually follow from data flow between them. Data files generated by one task are needed to start another. This problem is known as workflow scheduling. Surprisingly the problem of scheduling multiple workflows online does not appear to be fully addressed. We study many heuristics based on list scheduling to solve this problem. We also implemented a simulator in order to classify the behaviors of these heuristics depending on the shape and size of the graphs. Some of these heuristics are implemented within Diet and tested with the bioinformatics applications involved in the Décrypthon program.
We also work on scheduling workflows in the context that services involved in workflows are not necessarily present on all computing resources. In that case, there is a need to correctly schedule services in order not to see only short term performances: for example, a powerful resource may stay idle in order to be able to submit a little later a job that only it can provide. Numerous heuristics have been designed, and we currently evaluate them before implementing them in the Diet Grid middleware.
Service Discovery in Peer-to-Peer environments
The area studied is computational Grids, peer-to-peer systems and their possible interactions to provide a large scale service discovery within computational Grids. In order to address this issue, we first developed a new architecture called DLPT (for Distributed Lexicographic Placement Table) . This is a distributed indexing system based on a prefix tree that offers a totally distributed way of indexing and retrieving services at a large scale. The DLPT offers flexible search operations for users while offering techniques to replicate the structure for fault tolerance over dynamic platforms and a greedy algorithm partially taking into account the performance of the underlying network.
One of the fundamental aspects of the peer-to-peer systems is the fault tolerance. Starting from the fact that replication is costly and does not ensure the system to recover after transient failures and during a collaboration with Franck Petit from the LaRIA, we developed some new fault tolerance algorithms for our architecture. First, we provided a protocol to repair it after node crashes. Second, we proposed a self-stabilizing version of such architectures [Oops!] . We have begun a collaboration with Prof Ajoy K. Datta from the University of Nevada, Las Vegas about a self-stabilizing version of the structures used in DLPT, but designed in a less restricted model than [Oops!] . A paper resulting from this work is currently under submission.
We recently developed some algorithms allowing to efficiently map the logical structures used within our architecture onto networks structured as rings. We also developed new load balancing heuristics for DHTs, and adapted them and others to our case. We obtained some good results when comparing our heuristic with others. This work is also currently under submission.
Finally, a prototype of this architecture has been developed as a part of a collaboration about Networked Service Discovery involving Pascale Primet and Pierre Bozonnet from the RESO team and Alcatel. This prototype is currently under development and is studied for its experimentation over a real platform such as Grid'5000.
Deployment for DIET: Software and Research
Using distributed heterogeneous resources requires efficient and simple tools to deploy applications. However, current tools still lack maturity concerning their resource selection methods. This year we have proposed an extension to the generic application description model ( GADe ) of a Grid deployment software: Adage (“Automatic Deployment of Applications in a Grid Environment”) developed in the project team PARIS-IRISA at Rennes.
Our extension for GADe proposes a model for hierarchical applications (tree shaped applications). We present a heuristic to find the shape of a hierarchical application on a given platform, and also two kinds of heuristics: one based on two sub-heuristics (one to define a set of nodes, and one to choose among the nodes), and the second based on affinity lists between the nodes and the processes. We try to satisfy three criteria: minimize the communication costs, balance the load among the nodes, and maximize the number of deployed instances. Our simulations show that there is not a better heuristic, even if the most interesting one is affinity . One has to choose a heuristic depending on what part of the objective function he wants to prioritize (which combination of number of deployed instances , communication cost , and load balancing ). We also deployed hundreds of Diet elements using Adage , and showed that this tool is far more efficient than the current Diet deployment tool: Go Diet .
The subject is far from being closed. The future works will be to propose deployment heuristics for parallel applications, which represent a large part of the applications used on a Grid. An important point which has not been taken into account is the compatibility between the applications and the resources: an application may not be launched on the whole platform due to memory, or disk space, or even libraries constraints. This reduces the possibilities to map the instances.
Even if we do not know when the communications take place, they can generate bottlenecks on the communication links. Indeed, if we do not consider the platform as a fully connected graph, we should take into account the path that the connections have to follow, and if they take place simultaneously it is possible that the communication link do not support the load.
Finally, our work considers only static deployments. We consider that we have a set of resources, and a set of processes to deploy at a given time t . This deployment does not change afterwards. The utilization of the processes deployed at a time t will certainly be different from its utilization later on. This raises the redeployment problem, that is to say the modification of the current deployment to take into account the new parameters. This requires to take into account the current mapping of the processes, as well as the modifications on the processes parameters.
To validate our heuristics on deployment in a real environment, we implemented them in the deployment software Adage . Our experiments on Grid'5000 allowed us to deploy hundreds of Diet elements in a much more efficient fashion than Go Diet (two times faster). We intend to interface Diet Dashboard with Adage and to replace Go Diet in the deployment process of Diet .
Grid'5000 large scale experiments
Large Experiment for Cosmological Simulations. We studied the possibility of computing a lot of low-resolution simulations. The client requests a 128 3 particles 100 Mpc. h-1 simulation (first part). When it receives the results, it requests simultaneously 100 sub-simulations (second part). As each server cannot work on more than one simulation at the same time, we won't be able to have more than 11 parallel computations at the same time. The experiment (including both the first and the second part of the simulation) lasted 16h 18min 43s (1h 15min 11s for the first part and an average of 1h 24min 1s for the second part). After the first part of the simulation, each SeD received 9 requests (one of them received 10 requests) to compute the second part. The total execution time for each server is not the same: about 15h for Toulouse and 10h30 for Nancy. Consequently, the schedule is not optimal. The equal distribution of the requests does not take into account the machines processing power. In fact, at the time when Diet receives the requests (all at the same time) the second part of the simulation has never been executed, hence Diet doesn't know anything on its processing time, the best it can do is to share the total amount of requests on the available SeD s. A better makespan could be attained by writing a plug-in scheduler. We work on this problem.
Diet at Supercomputing. At the Inria booth from the conference SuperComputing'07, we have shown from Reno a real execution on Grid'5000 through GRUDU, using the Diet Dashboard. The challenge was that we performed this demo with all steps of Grid usage, from operating system deployment to workflow execution. To evaluate the performance of Diet on the French Grid Grid'5000 and present its functionnalities in a demo, the Diet DashBoard and its fork GRUDU are very useful. GRUDU (Grid'5000 Reservation Utility for Deployment Usage) is a tool for managing Grid'5000 resources, reservations and deployments. Initially developed to help Diet users on Grid'5000, this tool from Diet Dashboard can be used in a stand-alone version called GRUDU. A first part of the demo of GRUDU highlights how it can be interesting for the Grid end-users. The second part of the demo focuses on how to use Diet and the workflow part on a real Grid through the Diet Dashboard.
Join Scheduling and Data Management
Usually, in existing Grid computing environments, data replication and scheduling are two independent tasks. In some cases, replication managers are requested to find best replicas in term of access costs. But the choice of the best replica has to be done at the same time as the schedule of computation requests. We first proposed an algorithm that computes at the same time the mapping of data and computational requests on these data using a linear program and a method to obtain a mixed solution, i.e., integer and rational numbers, of this program. But our results hold if the submitted requests follow precisely the usage frequencies given as an input for the static replication and scheduling algorithm. Due to particular biological experiments these schemes may punctually change. To cope with those changes, we developed a dynamic algorithm and a set of heuristics that monitor the execution platform and take decision to move data and change scheduling of requests. The main goal of this algorithm is to balance the computation load between each server. Again using the Optorsim simulator, we compared the results of the different heuristics. The conclusion of these simulations is that we have a set of heuristics that, in the case of our hypothesis, are able to reliably adapt the data placement and requests scheduling to get an efficient usage of all computation resources.
In this previous work, we designed a scheduling strategy based on the hypothesis that, as you choose a large enough time interval, the proportion of a job using a given data is always the same. As observed in execution traces of bioinformatics clusters, this hypothesis seems to correspond to the way that these clusters are generally used. However, this algorithm does not take into account the initial data distribution costs and, in its original version, the dynamicity of the submitted jobs proportions. We introduced algorithms that allow to get good performance as soon as the process starts and take care about the data redistribution when needed. We want to run a continuous stream of jobs, using linear-time algorithms that depend on the size of the data on which they are applied. Each job is submitted to a Resource Broker which chooses a Computing Element (CE) to queue the job on it. When a job is queued on a CE, it waits for the next worker node that can execute it, with a FIFO policy. These algorithms try to take into account the temporary changes in the usage of the platform and do not need to obtain dynamic information about the nodes (cpu load, free memory, etc.). The only information used to make the scheduling decisions is the frequency of each kind of job submitted. Thus, the only necessary information to the scheduler is collected by the scheduler itself avoiding the use of complex platform monitoring services. In a next step, we will concentrate on the data redistribution process which is itself a non-trivial problem. We will study some redistribution strategies to improve the performance of the algorithms which dynamically choose where to replicate the data on the platform. Some large scale experiments have been already done on the Grid'5000 experimental platform using the Diet middleware. This work is done in collaboration with the PCSV team of the IN2P3 institute in Clermont-Ferrand.
Parallel Job Submission Management
We have performed several experiments, some with Ramses (see Section 4.5 ) and others using the Décrypthon applications. We plan to build a client/server for the Lammps software (see Section 4.2 ). We have undertaken some work to add performance prediction for parallel resources to Diet : communicate with batch system and simulating them with the Simbatch simulator that we have developed (see next section). Hence, we will have sufficient information to incorporate pertinent distributed scheduling algorithms into Diet .
Job Submission Simulations
Generally, the use of a parallel computing resource is done via a batch reservation system. The algorithms involved can greatly impact performance and consequently, be critical for the efficiency of Grid computing. Unfortunately, few Grid simulators take those batch reservation systems into account. They provide at best a very restricted modeling using an FCFS algorithm and few of them deal with parallel tasks. In this context we have proposed a reusable module, named Simbatch (http://simgrid.gforge.inria.fr/doc/contrib.html )as a built-in for the Grid simulator Simgrid (http://simgrid.gforge.inria.fr/ )allowing to easily model various batch schedulers.
Simbatch is an API written in C providing the core functionalities to easily model batch schedulers, and design and evaluate algorithms. For the moment, three of the most famous algorithms for batch schedulers are already incorporated: Round Robin (RR), First Come First Served (FCFS) and Conservative BackFilling (CBF). A simple use of batch schedulers provided by Simbatch in a Simgrid simulation is done via the two traditional configuration files of SimGrid (platform file and deployment file) and another file named simbatch.xml describing every batch used in it. For an advanced use of Simbatch, a set of functions is available to make new plug-in algorithms.
We have compared the flow metrics (time of a task spent in the system) for each task between a real batch system (OAR, developed in Grenoble, which instantiates CBF) and the Simbatch simulator. Simulations without communication costs show an error rate on the flow metrics generally below 1% while simulations involving communication costs show an error rate around 3%. Schedules are in the majority of our experiments and in both cases strictly the same. Those good results allow us to consider the use of Simbatch as a prediction tool that can be integrated in Grid middleware such as Diet .
Scheduling for Ocean-Atmosphere simulations
We have analyzed and modelled a real climatology application with the purpose of deriving appropriate scheduling heuristics. First, the application has been modelled as independent identical workflows derived through the chaining of several basic DAGs. Then a simplified model with clustered tasks based upon the actual time parameters of the application has been derived. For this new model, a first scheduling heuristic (driven by the principle of allocating the same number of processors to all multi-processor tasks and leaving what is left to post-processing tasks) has been issued. Three improved versions have been proposed: a first one that distributed resources left unused evenly across the groups of processors, a second one which doesn't leave any resource for the post processing tasks and distributes all left resources evenly to the groups of processors and a third one that models the problem of dividing the resources of the platform in disjoint sets as an instance of the Knapsack problem with a supplementary constraint. The three improved versions have been simulated and yielded gains of upto 9 %.
Finally, scheduling heuristics for the generalized problem of scheduling independent identical chains of identical DAGs (composed of an independent pre-processing task, an independent post-processing task, a main processing task and an inter-processing task linking succesive DAGs, all tasks being multi processor) have been proposed and compared to the approach of applying a mixed-parallelism scheduling algorithm to the composite DAG resulting when linking all entry tasks to a common entry node and all exit tasks to a common exit node. The results of the 4 heuristics proposed were highly encouraging not only in terms of gains obtained with respect to the results of the CPA mixed parallelism scheduling algorithms, but also in terms of running times for finding the solution (at most a second for determining the optimal pipeline compared to tens of minutes or even an hour for running CPA on a problem of the dimension 10 chains of 1800 iterations of the basic DAG each).