## Section: New Results

### Scheduling Strategies and Algorithm Design for Heterogeneous Platforms

Participants : Anne Benoît, Leila Ben Saad, Sékou Diakité, Alexandru Dobrila, Fanny Dufossé, Matthieu Gallet, Mathias Jacquelin, Loris Marchal, Jean-Marc Nicod, Laurent Philippe, Veronika Rehn-Sonigo, Paul Renaud-Goud, Clément Rezvoy, Yves Robert, Bernard Tourancheau, Frédéric Vivien.

#### Mapping simple workflow graphs

Mapping workflow applications onto parallel platforms is a challenging problem that becomes even more difficult when platforms are heterogeneous—nowadays a standard assumption. A high-level approach to parallel programming not only eases the application developer's task, but it also provides additional information which can help realize an efficient mapping of the application. We focused on simple application graphs such as linear chains and fork patterns. Workflow applications are executed in a pipeline manner: a large set of data needs to be processed by all the different tasks of the application graph, thus inducing parallelism between the processing of different data sets. For such applications, several antagonist criteria should be optimized, such as throughput, latency, failure probability and energy minimization.

We have considered the mapping of workflow applications onto
different types of platforms: *fully homogeneous* platforms
with identical processors and interconnection links;
*communication homogeneous* platforms, with identical links but
processors of different speeds; and finally, *fully heterogeneous*
platforms.

Once again, this year we have focused mainly on pipeline graphs, and considered platforms in which processors are subject to failure during the execution of the application. We have added a new optimization objective, namely, the energy minimization, and we have also addressed more sophisticated settings by considering several concurrent applications. On the theoretical side, we have established the complexity of many optimization problems involving energy and several concurrent applications. On the experimental side, we have designed several heuristics which aim at efficiently mapping concurrent applications on a heterogeneous platform (for an energy criterion), given some constraints on the throughput and latency of the application.

Also, in a joint work with Kunal Agrawal (at MIT during the course of the study),
we have thoroughly
investigated the complexity of the *scheduling* problem: given a
mapping, it turns out that it is difficult to orchestrate
communication and computation operations, i.e., to decide at which time-step
each operation should begin (and end). We demonstrated that some
instances of this problem are NP-hard, and we provided some
approximation algorithms.

Finally, in collaboration with Oliver Sinnen, from University of Auckland (New Zealand), we have investigated the bi-criteria problem of both throughput and reliability optimization, when processors are subject to failures. The mechanism of replication, which refers to the mapping of an application stage onto more than one processor, can be used to increase throughput but also to increase reliability. Finding the right replication trade-off plays a pivotal role for this bi-criteria optimization problem. Our formal model includes heterogeneous processors, both in terms of execution speed as well as in terms of reliability. We have studied the complexity of the various subproblems and shown how a solution can be obtained for the polynomial cases. For the general NP-hard problem, we have proposed heuristic algorithms, which were experimentally evaluated. We have also proposed the design of an exact algorithm based on A* state space search which allows us to evaluate the performance of our heuristics for small problem instances.

#### Resource allocation strategies for in-network stream processing

We pursued the work on the operator mapping problem for in-network stream processing applications, initiated last year. In-network stream processing consists in applying a tree of operators in steady-state to multiple data objects that are continually updated at various locations on a network. Examples of in-network stream processing include the processing of data in a sensor network, or of continuous queries on distributed relational databases. Last year, we focused on a “constructive” scenario, i.e., a scenario in which one builds a platform dedicated to the application by purchasing processing servers with various costs and capabilities. The objective was to minimize the cost of the platform while ensuring that the application achieves a minimum steady-state throughput. This year, we considered a more general non-constructive scenario and investigated the problem in which several applications are using the platform concurrently. In particular, we demonstrated the importance of node reuse in such a context: if we can reuse some results from one application to another, we decrease the load on the processors while adding some more communication. Several sophisticated heuristics have been designed and evaluated.

#### Scheduling small to medium batches of identical jobs

Steady-state scheduling is optimal for an infinite number of jobs. It defines a schedule for a subset of jobs which are performed into a period. The global schedule is obtained by infinitely repeating this period. In the case of a finite number of jobs, this scheduling technique can however be used if the number of computed jobs is large. Three phases are distinguished in the schedule: an initialization phase which computes the tasks needed to enter the steady state, an optimal phase composed of several full periods and a termination phase which finishes the tasks remaining after the last period. With a finite number of jobs we must consider a different objective function, the makespan, instead of the throughput used in the steady-state case. We know that the steady-state phase of the schedule is optimal, thus we are interested in optimizing the initial and final phases.

We have worked on the improvement of the steady-state technique for a finite number of jobs. The main idea is to improve the scheduling of the sub-optimal phases: initialization and termination. By optimizing these two phases we reduce their weight in the global schedule and thus improve its performance. In the original algorithm the period is computed by a linear program. As a result, the period's length can be quite large, resulting in a lot of temporary job instances. Each of these temporary job instances must be prepared in the initialization phase, and finished in the termination phase. We have two directions of optimization: (i) limiting the period length using sub-optimal but much simpler solutions, and (ii) better organizing the period to reduce the number of inter-period dependencies. Both propositions have been studied and implemented in the SimGrid toolkit. We have demonstrated the usefulness of both approaches (and their combination) to obtain a steady-state schedule more suited when the number of jobs to process is a few hundreds.

#### Static strategies for worksharing with unrecoverable interruptions

In this work, one has a large workload that is “divisible” and one has access to a number of remote computers that can assist in computing the workload. The problem is that the remote computers are subject to interruptions of known likelihood that kill all work in progress. One wishes to orchestrate sharing the workload with the remote computers in a way that maximizes the expected amount of work completed. In a previous work, we studied strategies for achieving this goal, by balancing the desire to checkpoint often, in order to decrease the amount of vulnerable work at any point, vs. the desire to avoid the context-switching required to checkpoint. This study was done when interruptions are following a linear model, and when the remote computers have the same characteristics.

We first extended that initial study by showing that the heuristics we designed could be straightforwardly extended to deal with any failure model. We validated this extension by simulating these heuristics using actual traces.

We also extended our initial study by considering heterogeneous platforms where the remote computers can be connected with different bandwidths, have different computing speeds, or be subject to different failure laws, as long as these laws are all linear. When at least two of the three computers' characteristics are homogeneous, we proposed closed-form formulas or recurrences to derive the optimal solution. This was done under the hypothesis that the whole divisible load is distributed to computers, and that the work is distributed in a single round. We exposed the complexity of the general case.

#### Scheduling identical jobs with unreliable tasks

Depending on the context, the fault tolerance model may differ. We have studied the case where the fault probability depends on the tasks instead of on the execution resources. The practical use case is a micro-factory where operations are performed on microscopic components. Due to the size of the components, some operations are not as well controlled as the others and thus the complexity of a task has impacts on the task's reliability. In this context, we consider the schedule of a set of identical jobs composed of either linear chains or trees of tasks. Several objectives are studied depending on the available resources, in particular maximizing the throughput (number of components output per time unit), and minimizing the makespan (total time needed to output the required number of components). The resources in use are heterogeneous and general purpose but must be configured to execute a determined task type. For this reason, finding a good schedule turns into an assignment problem. The most simple instances of this problem can be solved in polynomial time whereas the other cases are NP-complete; for those cases, we designed polynomial heuristics to solve the problem.

This year, we focused on the case in which the failure probability may depend both on tasks and on execution resources, and developed more heuristics. Also, we were able to derive a linear programming formulation of the problem and thus to assess the absolute performance of our heuristics.

#### Resource allocation using virtual clusters

We proposed a novel job scheduling approach for sharing a homogeneous cluster computing platform among competing jobs. Its key feature is the use of virtual machine technology for sharing resources in a precise and controlled manner. We justified our approach and proposed several job scheduling algorithms. We presented results obtained in simulations for synthetic and real-world High Performance Computing (HPC) workloads, in which we compared our proposed algorithms with standard batch scheduling algorithms. We found that our approach provides drastic performance improvements over batch scheduling. In particular, we identified a few promising algorithms that perform well across most experimental scenarios. Our results demonstrate that virtualization technology coupled with lightweight scheduling strategies affords dramatic improvements in performance for HPC workloads. The key advantage of our approach over current cluster sharing solutions is that it increases cluster utilization while optimizing a user-centric metric that captures both notions of performance and fairness, the maximum stretch. A key feature of our approach is that we do not assume any knowledge on the job running times and, thus, work in a non-clairvoyant setting.

#### Steady-state scheduling of dynamic bag-of-tasks applications

This work focused on sets of independent tasks (“bag-of-tasks” applications) and a simple master-worker platform. In this context, a main processor initially owns all the tasks and distributes them to a pool of secondary processors, or workers, which process the tasks. The aim is then to maximize the average number of tasks processed by the platform per time unit. In this work, all tasks of a bag-of-tasks application do not have the same computation and communication sizes, but these sizes are defined by the distribution of a random variables. This enables to model the inevitable variations between the multiple tasks of an application. The distribution is not supposed to be known, but to be empirically discovered when considered an initial subset of the tasks submitted to the system (say, the first 100 tasks). We presented a method to obtain an -approximation of an optimal schedule in case of a continuous flow of instances, as well as several heuristics. The quality of the different solutions were assessed through simulations. The proposed methods are compared to standard algorithms like a Round-Robin distribution or an On-Demand method. The simulations showed that a little knowledge about applications is sufficient to really improve scheduling results, and that steady-state static methods have significantly better performance when communication and computations costs are of the same magnitude. For the cases where either the communications or the computations significantly dominate, the on-demand dynamic method is shown to be asymptotically optimal.

#### Parallelizing the construction of the ProDom database

ProDom is a protein domain family database automatically built from a comprehensive analysis of all known protein sequences. ProDom development is headed by Daniel Kahn (Inria project-team BAMBOO, formerly HELIX). With the protein sequence databases increasing in size at an exponential pace, the parallelization of MkDom2, the algorithm used to build ProDom, has become mandatory (the original sequential version of MkDom2 took 15 months to build the 2006 version of ProDom and would have required at least twice that time to build the 2007 version).

The parallelization of MkDom2 is not a trivial task. The sequential MkDom2 algorithm is an iterative process, and parallelizing it involves forecasting which of these iterations can be run in parallel and detecting and handling dependency breaks when they arise. We have moved forward to be able to efficiently handle larger databases. Such databases are prone to exhibit far larger variations in the processing time of query-sequences than was previously imagined. The collaboration with BAMBOO on ProDom continues today both on the computational aspects of the constructing of ProDom on distributed platforms, as well as on the biological aspects of evaluating the quality of the domains families defined by MkDom2, as well as the qualitative enhancement of ProDom. This past year was devoted to improve the new parallel MPI_MkDom2 algorithm and code, for it to be usable in a production setting. Among other improvements, the code was ported to run on the BlueGene/P machine from IDRIS.

#### Steady-state scheduling on the CELL processor

In this work, we have considered the problem of scheduling streaming applications described by complex task graphs on a heterogeneous multicore processor, the STI Cell BE processor. To this goal, we have proposed a theoretical model of the Cell processor. Then, we have used this model to express the problem of maximizing the throughput of a streaming application on this processor. Although the problem is proven NP-complete, we have presented an optimal solution based on mixed linear programming. This allows us to compute the optimal mapping for a number of applications, ranging from a real audio encoder to complex random task graphs. These mappings have been tested on two real platforms embedding Cell processors, and compared to simple heuristic solutions. We have shown that this mappings allows to achieve a good speed-up, whereas the heuristic solutions generally fail to deal with the strong memory and communication constraints of the CELL processors. We are currently extending this work to cope with the complex architecture of the IBM Bladecenter QS 22, which embeds two CELL processors.

#### Fair distributed scheduling of bag-of-tasks applications on desktop grids

Desktop Grids have become very popular nowadays, with projects that include hundred of thousands computers. Desktop grid scheduling faces two challenges. First, the platform is volatile, since users may reclaim their computer at any time, which makes centralized schedulers inappropriate. Second, desktop grids are likely to be shared among several users, thus we must be particularly careful to ensure a fair sharing of the resources.

In this work, we have proposed a distributed scheduler for bag-of-tasks applications on desktop grids, which ensures a fair and efficient use of the resources. It aims to provide a similar share of the platform to every application by minimizing their maximum stretch, using completely decentralized algorithms and protocols. This approach has been validated through extensive simulation. We have shown that its performance is close to the best centralized algorithms for fair scheduling, for a limited bandwidth consumption. This work was conducted in collaboration with Javier Celaya, from the University of Saragosse (Spain).