## Section: New Results

Keywords : Algorithm design, heterogeneous platforms, scheduling strategies, steady-state scheduling, online scheduling, load balancing, divisible loads, bioinformatics.

### Scheduling Strategies and Algorithm Design for Heterogeneous Platforms

Participants : Anne Benoît, Leila Ben Saad, Sékou Diakité, Alexandru Dobrila, Fanny Dufossé, Matthieu Gallet, Mourad Hakem, Mathias Jacquelin, Loris Marchal, Jean-Marc Nicod, Laurent Philippe, Jean-François Pineau, Veronika Rehn-Sonigo, Clément Rezvoy, Yves Robert, Bernard Tourancheau, Frédéric Vivien.

#### Mapping simple workflow graphs

Mapping workflow applications onto parallel platforms is a challenging problem, that becomes even more difficult when platforms are heterogeneous —nowadays a standard assumption. A high-level approach to parallel programming not only eases the application developer's task, but it also provides additional information which can help realize an efficient mapping of the application. We focused on simple application graphs such as linear chains and fork patterns. Workflow applications are executed in a pipeline manner: a large set of data needs to be processed by all the different tasks of the application graph, thus inducing parallelism between the processing of different data sets. For such applications, several antagonist criteria should be optimized, such as throughput, latency, and failure probability.

We have considered the mapping of workflow applications onto different types of platforms:
*Fully Homogeneous* platforms with identical processors and interconnection links;
*Communication Homogeneous* platforms, with identical links but processors of different speeds; and finally,
*Fully Heterogeneous* platforms.

For linear chain graphs, we have extensively studied the complexity of the mapping problem, for throughput and latency optimization criteria. Different mapping policies have been considered: the mapping can be required to be one-to-one (a processor is assigned at most one stage), or interval-based (a processor is assigned an interval of consecutive stages), or fully general. The most important new result this year is the NP-completeness of the latency minimization problem for interval-based mappings on Fully Heterogeneous platforms, which was left open in our previous study. Furthermore, we proved that this problem, together with the similar one-to-one mapping problem, cannot be approximated by any constant factor (unless P=NP).

Once again, this year we have focused mainly on pipeline graphs, and considered platforms in which processors are subject to failure during the execution of the application. We derived new theoretical results for bi-criteria mappings aiming at optimizing both the latency (
*i.e.* , the response time) and the reliability (
*i.e.* , the probability that the computation will be successful) of the application. Latency is minimized by using faster processors, while reliability is increased by replicating computations on a set of processors. However, replication increases latency (additional communications,
slower processors). The application fails to be executed only if all the processors fail during execution. While simple polynomial algorithms can be found for fully homogeneous platforms, the problem becomes NP-hard when tackling heterogeneous platforms.

On the experimental side, we have designed and implemented several polynomial heuristics for different instances of our problems. Experiments have been conducted for pipeline application graphs, on Communication-Homogeneous platforms, since clusters made of different-speed processors interconnected by either plain Ethernet or a high-speed switch constitute the typical experimental platforms in most academic or industry research departments. We can express the problem of maximizing the throughput as the solution of an integer linear program, and thus we have been able to compare the heuristics with an optimal solution for small instances of the problem. For bi-criteria optimization problems, we have compared different heuristics through extensive simulations. Typical applications include digital image processing, where images are processed in steady-state mode. This year, we have thoroughly studied the mapping of a particular image processing application, the JPEG encoding. Mapping pipelined JPEG encoding onto parallel platforms is useful for instance for encoding Motion JPEG images. The performance of our bi-criteria heuristics has been validated on this application.

#### Mapping linear workflows with computation/communication overlap

In this joint work with Kunal Agrawal (MIT), we extend our work on linear pipelined workflows to a more realistic architectural model with bounded communication capabilities and full computation/communication overlap. This model is representative of current multi-threaded systems. We present several complexity results related to period and/or latency minimization.

To be precise, we prove that maximizing the period is NP-complete even for homogeneous platforms and minimizing the latency is NP-complete for heterogeneous platforms. Moreover, we present an approximation algorithm for throughput maximization for linear chain applications on homogeneous platforms, and an approximation algorithm for latency minimization for linear chain applications on all platforms where communication is homogeneous (the processor speeds can differ). In addition, we present algorithms for several important special cases for linear chain applications. Finally, we consider the implications of adding feedback to linear chain applications.

#### Energy-aware scheduling

We consider the problem of scheduling an application composed of independent tasks on a fully heterogeneous master-worker platform with communication costs. We introduce a bi-criteria approach aiming at maximizing the throughput of the application while minimizing the energy consumed by participating resources. Assuming arbitrary super-linear power consumption laws, we investigate different models for energy consumption, with and without start-up overheads. Building upon closed-form expressions for the uniprocessor case, we are able to derive optimal or asymptotically optimal solutions for both models.

#### Mapping filtering services

We explore the problem of mapping filtering services on large-scale heterogeneous platforms. Such applications can be viewed as regular workflow applications with arbitrary precedence graphs, with the additional property that each service (node) filters (shrinks or expands) its input data by a constant factor (its selectivity) to produce its output data. As always, period and/or latency minimization are the key objectives. For homogeneous platforms, the complexity of period minimization was already known; we derive an algorithm to solve the latency minimization problem in the general case with service precedence constraints; for independent services we also show that the bi-criteria problem (latency minimization without exceeding a prescribed value for the period) is of polynomial complexity. However, when adding heterogeneity to the platform, we prove that minimizing the period or the latency becomes NP-hard, and that these problems cannot be approximated by any constant factor (unless P=NP). The latter results hold true even for independent services. We provide an integer linear program to solve both problems in the heterogeneous case with independent services.

#### Resource allocation strategies for in-network stream processing

This year we studied the operator mapping problem for in-network stream processing applications. In-network stream processing consists in applying a tree of operators in steady-state to multiple data objects that are continually updated at various locations on a network. Examples of in-network stream processing include the processing of data in a sensor network, or of continuous queries on distributed relational databases. We studied the operator mapping problem in a “constructive” scenario, i.e., a scenario in which one builds a platform dedicated to the application by purchasing processing servers with various costs and capabilities. The objective is to minimize the cost of the platform while ensuring that the application achieves a minimum steady-state throughput. We have formalized a set of relevant operator-placement problems as linear programs, and proved that even simple versions of the problem are NP-complete. Also, we have designed several polynomial time heuristics, which are evaluated via extensive simulations and compared to theoretical bounds for optimal solutions.

#### Scheduling small to medium batches of identical jobs

Steady-state scheduling is optimal for an infinite number of jobs. It defines a schedule for a subset of jobs which are performed into a period. The global schedule is obtained by infinitely repeating this period. In the case of a finite number of jobs, this scheduling technique can however be used if the number of computed jobs is large. Three phases are distinguished in the schedule: an initialization phase which computes the tasks needed to enter the steady state, an optimal phase composed of several full periods and a termination phase which finishes the tasks remaining after the last period. With a finite number of jobs we must consider a different objective function, the makespan, instead of the throughput used in the steady-state case. We know that the steady-state phase of the schedule is optimal, thus we are interested in optimizing the initial and final phases.

We have worked on the improvement of the steady-state technique for a finite number of jobs. The main idea is to improve the scheduling of the sub-optimal phases: initialization and termination. By optimizing these two phases we reduce their weight in the global schedule and thus improve its performance. In the original algorithm the period is computed by a linear program. As a result, the period's length can be quite large, resulting in a lot of temporary job instances. Each of these temporary job instances must be prepared in the initialization phase, and finished in the termination phase. To reduce initialization and termination phase, we propose to limit the period length. Another possible optimization is to better organize a given period to reduce the number of temporary instances, by transforming inter-period dependences into intra-period dependences when possible. Both propositions have been studied and implemented in the SimGrid toolkit and we are conducting experiences to evaluate their efficiency.

#### Steady-scheduling of task graph collections on heterogeneous resources

In this work, we focused on scheduling jobs on computing Grids. In our model, a Grid job is made of a large collection of input data sets, which must all be processed by the same task graph or
*workflow* , thus resulting in a
*collection of task graphs* problem. We are looking for a competitive scheduling algorithm not requiring complex control. We thus only consider single-allocation strategies. We present an algorithm based on mixed linear programming to find an optimal allocation, and this for different
routing policies depending on how much latitude we have on routing communications. Then, using simulations, we compare our allocations to optimal multi-allocation schedules. Our results show that the single-allocation mixed-linear program approach almost always finds an allocation with a
reasonably-good throughput, especially under communication-intensive scenarios.

In addition to the mixed linear programming approach, we present different heuristic schemes. Then, using simulations, we compare the performance of our different heuristics to the performance of a classical scheduling policy in Grids, HEFT. The results show that some of our static-scheduling policies take advantage of their platform and application knowledge and outperform HEFT, especially under communication-intensive scenarios. In particular, one of our heuristics, DELEGATE, almost always achieves the best performance while having lower running times than HEFT.

#### Fault-tolerant scheduling of precedence task graphs

Heterogeneous distributed systems are widely deployed for executing computationally intensive parallel applications with diverse computing needs. Such environments require effective scheduling strategies that take into account both algorithmic and architectural characteristics. Unfortunately, most of the scheduling algorithms developed for such systems rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors are completely safe. To schedule precedence graphs in a more realistic framework, we introduce first an efficient fault tolerant scheduling algorithm that is both contention-aware and capable of supporting an arbitrary number of fail-silent (fail-stop) processor failures. Next, we derive a more complex heuristic that departs from the main principle of the first algorithm. Instead of considering a single task (one with highest priority) and assigning all its replicas to the currently best available resources, we consider a chunk of ready tasks, and assign all their replicas in the same decision making procedure. This leads to a better load balance of processors and communication links. We focus on a bi-criteria approach, where we aim at minimizing the total execution time, or latency, given a fixed number of failures supported in the system. Our algorithms have a low time complexity, and drastically reduce the number of additional communications induced by the replication mechanism. Experimental results fully demonstrate the usefulness of the proposed algorithms, which lead to efficient execution schemes while guaranteeing a prescribed level of fault tolerance.

#### Static strategies for worksharing with unrecoverable interruptions

In this work, one has a large workload that is “divisible” and one has access to p remote computers that can assist in computing the workload. The problem is that the remote computers are subject to interruptions of known likelihood that kill all work in progress. One wishes to orchestrate sharing the workload with the remote computers in a way that maximizes the expected amount of work completed. Strategies for achieving this goal, by balancing the desire to checkpoint often, in order to decrease the amount of vulnerable work at any point, vs. the desire to avoid the context-switching required to checkpoint, are studied. Strategies are devised that provably maximize the expected amount of work when there is only one remote computer (the case p= 1 ). Results suggest the intractability of such maximization for higher values of p , which motivates the development of heuristic approaches. Heuristics are developed that replicate works on several remote computers, in the hope of thereby decreasing the impact of work-killing interruptions. The quality of these heuristics is assessed through exhaustive simulations.

#### Scheduling identical jobs with unreliable tasks

Depending on the context, the fault tolerance model may differ. We have studied the case where the fault probability depends on tasks instead of on the execution resource. The practical use case is a micro-factory where operations are performed on microscopic components. Due to the size of the components, some operations are not as well controlled as the others and thus the complexity of the task impacts on its reliability. In this context, we consider the schedule of a set of identical jobs composed of either linear chains or trees of tasks. Several objectives are studied depending on the available resources, in particular maximizing the throughput (number of components output per time unit), and minimizing the makespan (total time needed to output the required number of components). The resources in use are heterogeneous and general purpose but must be configured to execute a determined task type. For this reason, finding a good schedule turns into an assignment problem. The most simple instances of this problem can be solved in polynomial time whereas the other cases are NP-complete; for those cases, we designed polynomial heuristics to solve the problem.

#### Resource allocation using virtual clusters

We propose a novel approach for sharing cluster resources among competing jobs. The key advantage of our approach over current cluster sharing solutions is that it increases cluster utilization while optimizing a user-centric metric that captures both notions of performance and fairness. We motivate and formalize the corresponding resource allocation problem, determine its complexity, and propose several algorithms to solve it in the case of a static workload that consists of sequential jobs. Via extensive simulation experiments we identify an algorithm that runs quickly, that is always on par with, or better than, its competitors, and that produces resource allocations that are close to optimal. We find that the extension of our approach to workloads that comprise parallel jobs leads to similarly good results. Finally, we explain how to extend our work to handle dynamic workloads.

#### Parallelizing the construction of the ProDom database

ProDom is a protein domain family database automatically built from a comprehensive analysis of all known protein sequences. ProDom development is headed by Daniel Kahn ( Inria project-team HELIX). With the protein sequence databases increasing in size at an exponential pace, the parallelization of MkDom2, the algorithm used to build ProDom, has become mandatory (the original sequential version of MkDom2 took 15 months to build the 2006 version of ProDom and would have required at least twice that time to build the 2007 version).

The parallelization of MkDom2 is not a trivial one. The sequential MkDom2 algorithm is an iterative process and parallelizing it involves forecasting which of these iterations can be run in parallel and detecting and handling dependency breaks when they arise. We have moved forward to be able to efficiently handle larger databases. Such databases are prone to exhibit far larger variations in the processing time of query-sequences than was previously imagined. The collaboration with HELIX on ProDom continues today both on the computational aspects of the constructing of ProDom on distributed platforms, as well as on the biological aspects of evaluating the quality of the domains families defined by MkDom2, as well as the qualitative enhancement of ProDom.

#### MPI in a sensor network

We study the potential interest of using the MPI communication library in the distributed system made by the networked micro-controlers within a sensor network. We follow the IETF standardization groups dealing with IP for sensor networks. We are currently developing an IP stack for sensor networks with the 6LoWPAN specifications. Our design originality is modularity in order to be able to experiment with several routing modules. This is especially necessary for the test and validation of our multi-sink multi-position theoretical approaches where the route choices are scheduled depending on the sinks' locations in order to increase the lifespan of the overall sensors network. Our target assumptions and testbeds are real routing in buildings and urban environment where sinks' locations are limited to seldom powered and networked locations. Moreover, the sinks re-location frequency is very low because of the man made operation costs.