Section: New Results
Keywords : Algorithm design, heterogeneous platforms, scheduling strategies, steadystate scheduling, online scheduling, load balancing, divisible loads, bioinformatics.
Scheduling Strategies and Algorithm Design for Heterogeneous Platforms
Participants : Anne Benoît, Lionel EyraudDubois, Matthieu Gallet, Loris Marchal, JeanMarc Nicod, Laurent Philippe, JeanFrançois Pineau, Veronika RehnSonigo, Clément Rezvoy, Yves Robert, Bernard Tourancheau, Frédéric Vivien.
SteadyState Scheduling
The traditional objective, when scheduling sets of computational tasks, is to minimize the overall execution time (the makespan ). However, in the context of heterogeneous distributed platforms, makespan minimization problems are in most cases NPcomplete, sometimes even APXcomplete. But, when dealing with large problems, an absolute minimization of the total execution time is not really required. Indeed, deriving asymptotically optimal schedules is more than enough to ensure an efficient use of the architectural resources. In a nutshell, the idea is to reach asymptotic optimality by relaxing the problem to circumvent the inherent complexity of minimum makespan scheduling. The typical approach can be decomposed in three steps:

Neglect the initialization and cleanup phases, in order to concentrate on steadystate operation.

Derive an optimal steadystate scheduling, for example using linear programming tools.

Prove the asymptotic optimality of the resulting schedule.
This year we have studied a complex application where users, or clients, submit several bagoftasks applications on a heterogeneous masterworker platform, using a classical clientserver model. The applications are submitted online, which means that there is no a priori (static) knowledge of the workload at the beginning of the execution. When several applications are executed simultaneously, they compete for hardware (network and CPU) resources. The traditional measure to quantify the benefits of concurrent scheduling on shared resources is the maximum stretch. The stretch of an application is defined as the ratio of its response time under the concurrent scheduling policy over its response time in dedicated mode, i.e. if it were the only application executed on the platform. The objective is then to minimize the maximum stretch of any application, thereby enforcing a fair tradeoff between all applications. Because we target an online framework, the scheduling policy will need to be modified upon the arrival of a new application, or upon the completion of another one. Our scheduling strategy relies on complicated mathematical tools but can be computed in time polynomial to the problem size. Also, it can be shown optimal for the offline version of the problem, with release dates for the applications. On the practical side, we have run extensive simulations and several MPI experiments to assess the quality of our solutions.
Algorithmic kernels on masterslave platforms with limited memory
This work is aimed at designing efficient parallel matrixproduct algorithms for heterogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative:
 Centralized data.
We assume that all matrix files originate from, and must be returned to, the master. The master distributes both data and computations to the workers (while in ScaLAPACK, input and output matrices are initially distributed among participating resources). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files).
 Heterogeneous starshaped platforms.
We target fully heterogeneous platforms, where computational resources have different computing powers. Also, the workers are connected to the master by links of different capacities. This framework is realistic when deploying the application from the server, which is responsible for enrolling authorized resources.
 Limited memory.
Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and reused for subsequent updates (as in ScaLAPACK). The amount of memory available in each worker is expressed as a given number m_{i} of buffers, where a buffer can store a square block of matrix elements. The size q of these square blocks is chosen so as to harness the power of Level 3 BLAS routines: q= 80 or 100 on most platforms.
We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages). We report a set of numerical experiments on various platforms at École Normale Supérieure de Lyon and the University of Tennessee. These platforms are either homogeneous or heterogeneous. In the latter case, the impact of our new algorithms on the overall performance is even greater.
Replica placement
This study consists in introducing and comparing several policies to place replicas in tree networks, subject to server capacity and QoS constraints. In this framework, the flows of client requests are known beforehand, while the number and location of the servers are to be determined. The standard approach in the literature is to enforce that all requests of a client be served by the closest server in the tree.
Our work on replica placement has been finalized this year by the preparation of a survey paper that assesses the usefulness of our two new policies ( Upwards and Multiple ) to place replicas in tree networks, subject to server capacity and QoS constraints. The survey encompasses many new theoretical results, and provides a comprehensive set of experiments. In particular, the experiments analyze the impact of server heterogeneity, together with the difficulty to find a good tradeoff between favoring clients with a large number of requests and clients with a very constrained QoS. The survey paper will appear in IEEE Trans. Parallel and Distributed Systems .
Mapping simple workflow graphs
Mapping workflow applications onto parallel platforms is a challenging problem, that becomes even more difficult when platforms are heterogeneous —nowadays a standard assumption. A highlevel approach to parallel programming not only eases the application developer's task, but it also provides additional information which can help realize an efficient mapping of the application. We focused on simple application graphs such as linear chains and fork patterns. Workflow applications are executed in a pipeline manner: a large set of data needs to be processed by all the different tasks of the application graph, thus inducing parallelism between the processing of different data sets. For such applications, several antagonist criteria should be optimized, such as throughput, latency, and failure probability.
This year, we have discussed the mapping of workflow applications onto different types of platforms: Fully Homogeneous platforms with identical processors and interconnection links; Communication Homogeneous platforms, with identical links but processors of different speeds; and finally, Fully Heterogeneous platforms.
For linear chain graphs, we have extensively studied the complexity of the mapping problem, for throughput and latency optimization criteria. Different mapping policies have been considered: the mapping can be required to be onetoone (a processor is assigned at most one stage), or intervalbased (a processor is assigned an interval of consecutive stages), or fully general. The most important result is the NPcompleteness of the throughput maximization problem for intervalbased mappings on CommunicationHomogeneous platforms, which is the extension of the wellknown chainstochains problem in a heterogeneous setting.
We have established several new theoretical complexity results for a simplified model with no communication cost, but considering bicriteria optimization problems (throughput, latency) and both pipeline and fork graphs. We considered that pipeline or fork stages can be replicated in order to increase the throughput of the workflow, by sending consecutive data sets onto different processors. In some cases, stages can also be dataparallelized, i.e. the computation of one single data set is shared between several processors. This leads to a decrease of the latency and an increase of the throughput. Some instances of this simple model are shown to be NPhard, thereby exposing the inherent complexity of the mapping problem. We provided polynomial algorithms for other problem instances. Altogether, we provided solid theoretical foundations for the study of monocriterion or bicriteria mapping optimization problems.
This year we have focused mainly on pipeline graphs, and considered platforms in which processors are subject to failure during the execution of the application. We derived new theoretical results for bicriteria mappings aiming at optimizing both the latency ( i.e. , the response time) and the reliability ( i.e. , the probability that the computation will be successful) of the application. Latency is minimized by using faster processors, while reliability is increased by replicating computations on a set of processors. However, replication increases latency (additional communications, slower processors). The application fails to be executed only if all the processors fail during execution. While simple polynomial algorithms can be found for fully homogeneous platforms, the problem becomes NPhard when tackling heterogeneous platforms.
On the experimental side, we have designed and implemented several polynomial heuristics for different instances of our problems. Experiments have been conducted for pipeline application graphs, on CommunicationHomogeneous platforms, since clusters made of differentspeed processors interconnected by either plain Ethernet or a highspeed switch constitute the typical experimental platforms in most academic or industry research departments. We can express the problem of maximizing the throughput as the solution of an integer linear program, and thus we have been able to compare the heuristics with an optimal solution for small instances of the problem. For bicriteria optimization problems, we have compared different heuristics through extensive simulations.
VoroNet
In collaboration with ASAP (IRISA) and CEPAGE (LaBRI), we have proposed the design of VoroNet, an objectbased peer to peer overlay network relying on Voronoi tessellations, along with its theoretical analysis and experimental evaluation. VoroNet differs from previous overlay networks in that peers are application objects themselves and get identifiers reflecting the semantics of the application instead of relying on hashing functions. Thus it provides a scalable support for efficient search in large collections of data. In VoroNet, objects are organized in an attribute space according to a Voronoi diagram. VoroNet is inspired from the Kleinberg's smallworld model where each peer gets connected to close neighbors and maintains an additional pointer to a longrange node. VoroNet improves upon the original proposal as it deals with general object topologies and therefore copes with skewed data distributions. We show that VoroNet can be built and maintained in a fully decentralized way. The theoretical analysis of the system proves that the routing in VoroNet can be achieved in a polylogarithmic number of hops in the size of the system. The analysis is fully confirmed by our experimental evaluation by simulation.
Parallelizing the construction of the ProDom database
ProDom is a protein domain family database automatically built from a comprehensive analysis of all known protein sequences. ProDom development is headed by Daniel Kahn ( Inria projectteam HELIX). With the protein sequence databases increasing in size at an exponential pace, the parallelization of MkDom2, the algorithm used to build ProDom, has become mandatory (the original sequential version of MkDom2 took 15 months to build the 2006 version of ProDom and would have required at least twice that time to build the 2007 version).
The parallelization of MkDom2 is not a trivial one. The sequential MkDom2 algorithm is an iterative process and parallelizing it involves forecasting which of these iterations can be run in parallel and detecting and handling dependency breaks when they arise. We have demonstrated the feasibility of this parallelization at the scale of a cluster or a Grid, yielding a 50+ acceleration factor over the sequential algorithm. The collaboration with HELIX on ProDom continues today both on the computational aspects of the constructing of ProDom on distributed platforms, as well as on the biological aspects of evaluating the quality of the domains families defined by MkDom2, as well as the qualitative enhancement of ProDom.
Automatic discovery of platform topologies
Most of the advanced scheduling techniques require a good knowledge of the interconnection network. This knowledge, however, is rarely available. We are thus interested in automatically building models, from an application point of view, of the interconnection networks of distributed computational platforms.
In the scope of this work we have contributed to the software ALNeM which is a framework to perform network measures and modelling, and which can also be used to perform simulations. In the same framework, we can therefore build a model and assess its quality.
Initially, we have shown that the commonly used model building algorithms (building cliques or spanning trees) all have serious weaknesses which forbid them to accurately predict the running times of simple algorithmic kernels. We have then proposed three new algorithms and we have assess their quality. We have shown that one of these algorithms is able to produce an accurate model of an interconnection network. This algorithm requires network measures to be performed on each of the active elements in the interconnection network (computing nodes, routers, etc.). In the future, we will try to overcome this limitation.
Scheduling multiple divisible loads on a linear processor network
Min, Veeravalli, and Barlas have recently proposed strategies to minimize the overall execution time of one or several divisible loads on a heterogeneous linear processor network, using one or more installments [76] , [75] . We have shown on a very simple example that their approach does not always produce a solution and that, when it does, the solution is often suboptimal. We have also shown how to find an optimal schedule for any instance, once the number of installments per load is given. Then, we formally stated that any optimal schedule has an infinite number of installments under a linear cost model as the one assumed in [76] , [75] . Therefore, such a cost model cannot be used to design practical multiinstallment strategies. Finally, through extensive simulations we confirmed that the best solution is always produced by our linear programming approach.
Scheduling small to medium batches of identical jobs
When considering the scheduling of small to medium batches of identical jobs, we have the choice between steadystate, makespan and batch oriented techniques. Steadystate techniques allow to achieve an optimal use of the resources for job series of infinite size. However, the cost of the initialization and termination phase is not controlled and, if the size of the considered series is too small, the overhead generated during the initial phase may lead to an inefficient scheduling. Makespan or online oriented techniques are usually designed to optimize the execution of graphs of tasks on a platform. As the problem of scheduling a set of jobs on an heterogeneous platform is NPcomplete, these techniques relies on heuristics to compute a suboptimal scheduling. Makespan oriented techniques computes this scheduling offline, before the execution, with the assumption that no other jobs will be run on the platform during the execution of the task set. Each task to be scheduled is managed independently of the other tasks. As these techniques try to optimize the execution of the whole set of tasks, they do not suffer from the initialization problem. However, if the number of tasks scales up the time needed to compute a optimized scheduling usually becomes too long due to the complexity of the algorithm. Online oriented techniques computes dynamically the scheduling taking into account already running tasks to place the tasks of the arriving jobs. These two last techniques do not benefit from the knowledge that the executed jobs are identical.
Using the SimGrid toolkit, we have developed a simulator to compare the performances of these three approaches. We have exhibited their domain of interest depending on the batch size, the application and platform characteristics. Generally, Makespan or online oriented schedules get better performances for small sized batches and, in this case, the steady state schedules are penalized by their initialization and termination phases. The steady state scheduling is however not time consuming, so it is worth adapting it to this context. As the size of the suboptimal phases directly depends on the optimal period size, we compute an optimal period size to extend the use of steady state scheduling to small to medium sized batches.
Scheduling for realtime brainmachine interfaces
In collaboration with the ACIS Laboratory, University of Florida, we have studied how to schedule a particular application with realtime constraint, on a distributed environment. The target application is a brainmachine interface: it receives signals coming from the premotor cortex of an animal's brain, treats these signals and produces a motor command which operates a robotic arm. The signal processing consists in the collaboration of a big number of “expert models”: each expert model computes an output (the motor command). All outputs are gathered using a responsibility estimator to predict the significance of the models at a given time. The big number of models to be computed supports the use of a distributed architecture. Each model is implemented as a linear filter of the neural signals with its own set of parameters. Periodically, a set of models have to be trained, and their parameter updated, so that they can improve their accuracy. Before considering the scheduling problems linked to this application, we concentrate on optimizing the computation of the models, and especially the training phase. About 10 seconds of data are needed to train a model and computes its new parameters. Performing the whole computation of the training leads to huge running time which are not acceptable in the context of the realtime application. We proposed that this computation is performed “online”, without waiting for the last data to be received before starting the training. We adapt existing adaptive filters like recursive least square filters from signal processing literature to take into account the particularities of the application, such that multidimensional inputs and outputs. This online computation offers satisfactory reactivity and numerical stability.
Numerical Simulation for Energy Efficiency
Numerical simulation can have a great impact in the design of buildings in order to predict their energy consumption. We worked on the Ener+ framework provided by CETHIL (Centre de Thermique de Lyon, UMR 5008) in order to optimize the design of an house using parametric optimization by multiple executions of the TRNSYS simulation engine [65] . The results were very promising with a final design of a house which yearly produce twice its needs, including its own energy and its four inhabitants specific energy consumption.
Routing in Low Power and Lossy Networks
The size reduction of computing and networking components opened a new field of applications, the embedded sensors and actuators on motes networks. Adhoc networking provides a foundation basis for these new devices systems but their low power and lossy characteristics are adding new challenges as well as their potential application on a very large scale in our environment for monitoring and control purposes. We explored these sensor networked platforms and setup an experimental testbed at the lab. We developed a parameterized and programmable interface on top of the existing middleware provided by the vendor. We were following closely the IETF charters related to sensor networks and especially last year the 6lowPAN, RL2N and RSN discussions. In this context we are now working to implement an IPv6 stack for motes running TinyOS. Our aim is to take advantage of our large experience in networking optimization and communication scheduling in highly connected graphs to both provide very efficient meshrouting for motes networks and long distance connectivity for motes clouds. On the application side, we are working on the calibration of our platform in order to better estimate the quality of measurement obtained from low cost embedded sensors.