## Section: New Results

### Scheduling Strategies and Algorithm Design for Heterogeneous Platforms

Participants : Anne Benoît, Marin Bougeret, Hinde Bouziane, Alexandru Dobrila, Fanny Dufossé, Matthieu Gallet, Mathias Jacquelin, Loris Marchal, Jean-Marc Nicod, Laurent Philippe, Paul Renaud-Goud, Clément Rezvoy, Yves Robert, Mark Stillwell, Bora Uçar, Frédéric Vivien.

#### Mapping simple workflow graphs

Mapping workflow applications onto parallel platforms is a challenging problem that becomes even more difficult when platforms are heterogeneous—nowadays a standard. A high-level approach to parallel programming not only eases the application developer's task, but it also provides additional information which can help realize an efficient mapping of the application. We focused on simple application graphs such as linear chains and fork patterns. Workflow applications are executed in a pipeline manner: a large set of data needs to be processed by all the different tasks of the application graph, thus inducing parallelism between the processing of different data sets. For such applications, several antagonist criteria should be optimized, such as throughput, latency, failure probability and energy minimization.

We have considered the mapping of workflow applications onto
different types of platforms: *fully homogeneous* platforms
with identical processors and interconnection links;
*communication homogeneous* platforms, with identical links but
processors of different speeds; and finally, *fully heterogeneous*
platforms.

This year, we have pursued the work involving the energy minimization criteria, and we studied the impact of sharing resources for concurrent streaming applications. For interval mappings, a processor is assigned a set of consecutive stages of the same application, so there is no resource sharing across applications. On the contrary, the assignment is fully arbitrary for general mappings, hence a processor can be reused for several applications. On the theoretical side, we establish complexity results for this tri-criteria mapping problem (energy, period, latency), classifying polynomial versus NP-complete instances. Furthermore, we derive an integer linear program that provides the optimal solution in the most general case. On the experimental side, we design polynomial-time heuristics, and assess their absolute performance thanks to the linear program. One main goal is to assess the impact of processor sharing on the quality of the solution.

#### Throughput of probabilistic and replicated streaming applications

We have pursued the investigation of timed Petri nets to model the mapping of workflows with stage replication, that we had started in 2009. In particular, we have provided bounds for the throughput when stage parameters are arbitrary I.I.D. (Independent and Identically-Distributed) and N.B.U.E. (New Better than Used in Expectation) variables: the throughput is bounded from below by the exponential case and bounded from above by the deterministic case. This work was conducted in collaboration with Bruno Gaujal (LIG Grenoble).

#### Multi-criteria algorithms and heuristics

We have investigated several multi-criteria algorithms and heuristics for the problem of mapping pipelined applications, consisting of a linear chain of stages executed in a pipelined way, onto heterogeneous platforms. The objective was to optimize the reliability under a performance constraint, i.e., while guaranteeing a threshold throughput. In order to increase reliability, we replicate the execution of stages on multiple processors. On the theoretical side, we prove that this bi-criteria optimization problem is NP-hard. We propose some heuristics both for interval and for general mappings, and present extensive experiments evaluating their performance.

The first paper published on this work, “A. Benoit, H. L. Bouziane, Y. Robert. Optimizing the reliability of pipelined applications under throughput constraints. In ISPDC'2010, Istanbul, Turkey, July 2010” received the best paper award.

#### The impact of cache misses on the performance of matrix product algorithms on multicore platforms

The multicore revolution is underway, bringing new chips introducing more complex memory architectures. Classical algorithms must be revisited in order to take the hierarchical memory layout into account. The goal is this study is to design cache-aware algorithms that minimize the number of cache misses paid during the execution of the matrix product kernel on a multicore processor. We have analytically studied how to achieve the best possible tradeoff between shared and distributed caches. We have also implemented and evaluated several algorithms on two multicore platforms, one equipped with one Xeon quadcore, and the second one enriched with a GPU. It turns out that the impact of cache misses is very different across both platforms, and we have identified what are the main design parameters that lead to peak performance for each target hardware configuration.

#### Tree traversals with minimum memory usage

In this study, we focus on the complexity of traversing tree-shaped workflows whose tasks require large I/O files. Such workflows typically arise in the multifrontal method of sparse matrix factorization. We target a classical two-level memory system, where the main memory is faster but smaller than the secondary memory. A task in the workflow can be processed if all its predecessors have been processed, and if its input and output files fit in the currently available main memory. The amount of available memory at a given time depends upon the ordering in which the tasks are executed. We focus on the problem of finding the minimum amount of main memory, over all postorder schemes, or over all possible traversals, that is needed for an in-core execution. We have established several complexity results that answer these questions. We have proposed a new, polynomial time, exact algorithm which runs faster than a reference algorithm. We have also addressed the setting where the required memory renders a pure in-core solution unfeasible. In this setting, we ask the following question: what is the minimum amount of I/O that must be performed between the main memory and the secondary memory? We have shown that this latter problem is NP-hard, and proposed efficient heuristics. All algorithms and heuristics were thoroughly evaluated on assembly trees arising in the context of sparse matrix factorizations.

#### Comparing archival policies for BlueWaters

In this work, we focus on the archive system which will be used in the BlueWaters supercomputer. We have introduced two archival policies tailored for the large tape storage system that will be available on BlueWaters. We have also shown how to adapt the well known RAIT strategy (the counterpart of RAID policy for tapes). We have provided an analytical model of the tape storage platform of BlueWaters, and we used it to asses and analyze the performance of the three policies through simulations. Storage requests were generated using random workloads whose characteristics model various realistic scenarios. The throughput of the system, as well as the average (weighted) response time for each user, are the main objectives.

#### Resource allocation using virtual clusters

We proposed a novel job scheduling approach for sharing a homogeneous cluster computing platform among competing jobs. Its key feature is the use of virtual machine technology for sharing resources in a precise and controlled manner. We followed up on our work on this subject by addressing the problem of resource utilization. We proposed a new measure for this utilization and we demonstrated how, following our approach, one can improve over batch scheduling by orders of magnitude in term of job stretch, while leading to comparable or better resource utilization.

#### Checkpointing policies for post-petascale supercomputers

An alternative to classical fault-tolerant approaches for large-scale
clusters is failure avoidance, by which the occurrence of a fault is
predicted and a preventive measure is taken. We developed
analytical performance models for two types of such a measure:
preventive checkpointing and preventive migration. We also developed an
analytical model of the performance of a standard periodic checkpoint
fault-tolerant approach. We instantiated these models for platform
scenarios that are representative of the current and future technology trends. We
found that preventive migration is the better approach in the short term, but
that both approaches have comparable merit in the longer term.
We also found that standard non-prediction-based fault tolerance
achieves poor scaling when compared to prediction-based failure
avoidance, thereby demonstrating the importance of failure prediction
capabilities. Our results also showed that achieving good utilization of
truly large-scale machines (e.g., 2^{20} nodes) for parallel workloads
will require more than the failure avoidance techniques evaluated in
this work.

In the previous work, we have assumed that checkpoints were occurring periodically. Indeed, it is usually claimed that such a policy is optimal. However, most of the existing proofs rely on approximations. One such assumption is that the probability that a fault occurs during the execution of an application is very small, an assumption that is no longer valid in the context of exascale platforms. We have begun studying this problem in a fully general context. We have established that, when failures follow a Poisson law, the periodic checkpointing policy is optimal. We have also showed an unexpected result: in some cases, when the platform is sufficiently large, the checkpointing costs are sufficiently expensive, or the failures are frequent enough, one should limit the application parallelism and duplicate tasks, rather than fully parallelize the application on the whole platform. In other words, the expectation of the job duration is smaller with fewer processors! To establish this result we derived and analyzed several scheduling heuristics.

#### Scheduling parallel iterative applications on volatile resources

In this work we study the efficient execution of iterative applications onto volatile ressources. We studied a master-worker scheduling scheme that trades-off between the speed and the (expected) reliability and availability of enrolled workers. A key feature of this approach is that it uses a realistic communication model that bounds the capacity of the master to serve the workers, which requires the design of sophisticated resource selection strategies. The contribution of this work is twofold. On the theoretical side, we assess the complexity of the problem in its off-line version, i.e., when processor availability behaviors are known in advance. Even with this knowledge, the problem is NP-hard. On the pragmatic side, we proposed several on-line heuristics that were evaluated in simulation while a Markovian model of processor availabilities.

#### Parallelizing the construction of the ProDom database

ProDom is a protein domain family database automatically built from a comprehensive analysis of all known protein sequences. ProDom development is headed by Daniel Kahn (Inria project-team BAMBOO, formerly HELIX). With the protein sequence databases increasing in size at an exponential pace, the parallelization of MkDom2, the algorithm used to build ProDom, has become mandatory (the original sequential version of MkDom2 took 15 months to build the 2006 version of ProDom).

The parallelization of MkDom2 is not a trivial task. The sequential MkDom2 algorithm is an iterative process, and parallelizing it involves forecasting which of these iterations can be run in parallel and detecting and handling dependency breaks when they arise. We have moved forward to be able to efficiently handle larger databases. Such databases are prone to exhibit far larger variations in the processing time of query-sequences than was previously imagined. The collaboration with BAMBOO on ProDom continues today both on the computational aspects of the constructing of ProDom on distributed platforms, as well as on the biological aspects of evaluating the quality of the domains families defined by MkDom2, as well as the qualitative enhancement of ProDom.

This past year was devoted to the full scale validation of the the new parallel MPI_MkDom2 algorithm and code. We proposed a new methodology to compare two clusterings of sub-sequences in domains. We used this methodology to assess that the parallelization using MPI_MkDom2 do not significantly impact the quality of the clustering produced, when compared to the one produced by MkDom2. We successfully processed all the sequences included in the April 2010 version of the UniProt database, namely 6 118 869 sequences and 2 194 382 846 amino-acids. The whole computation would have taken 12 years and 97 days in sequential and was completed in parallel for a wall-clock time of 19 days and 12 hours. After a post-processing phase, this will lead to a new release of ProDom in the upcoming months after a four year hiatus.