## Section: New Results

### New Services for Scheduling and Processing on Large Scale Platforms

#### Requests and Task Scheduling on Large Scale Semi-Stable Distributed Platforms

##### Divisible Load Scheduling

Participants : Olivier Beaumont, Nicolas Bonichon, Lionel Eyraud-Dubois.

Even if the application field for large scale platforms is currently too poor, targeted platforms are clearly not suited to tightly coupled codes and we need to concentrate on simple scheduling problems in the context of large scale distributed unstable platforms. Indeed, most of the scheduling problems are already NP-Complete with bad approximation ratios in the case of static homogeneous platforms when communication costs are not taken into account.

Recently, many algorithms have been derived, under several communication models, for master slave tasking [59] , [111] and Divisible Load Scheduling (DLS) [70] , [69] , [55] .

In this case, we aim at executing a large bag of independent, same-size tasks. First we assume that there is a single master, that initially holds all the (data needed for all) tasks. The problem is to determine an architecture for the execution. Which processors should the master enroll in the computation? How many tasks should be sent to each participating processor? In turn, each processor involved in the execution must decide which fraction of the tasks must be computed locally, and which fraction should be sent to which neighbor (these neighbors must be determined too).

Parallelizing the computation by spreading the execution across many
processors may well be limited by the induced communication volume.
Rather than aiming at makespan minimization, a more relevant
objective is the optimization of the throughput in steady-state
mode. There are three main reasons for focusing on the steady-state
operation. First is *simplicity* , as the steady-state
scheduling is in fact a relaxation of the makespan minimization
problem in which the initialization and clean-up phases are ignored.
One only needs to determine, for each participating resource, which
fraction of time is spent computing for which application, and which
fraction of time is spent communicating with which neighbor; the
actual schedule then arises naturally from these quantities.

In [66] , we have considered the case task scheduling for parallel multi-frontal methods, what corresponds to map a set of tasks whose dependencies are depicted by a tree. In [65] , we have proposed several distributed scheduling algorithms when several applications are to be simultaneously mapped onto an heterogeneous platform.

In [64] , we discuss complexity issues for DLS on heterogeneous systems under the bounded multi-port model. To our best knowledge, this is the first attempt to consider DLS under a realistic communication model, where the master node can communicate simultaneously to several slaves, provided that bandwidth constraints are not exceeded. We concentrate on one round distribution schemes, where a given node starts its processing only once all data has been received. Our main contributions are (i) the proof that processors start working immediately after receiving their work (ii) the study of the optimal schedule in the case of 2 processors and (iii) the proof that scheduling divisible load under the bounded multi-port model is NP-complete. This last result strongly differs from divisible load literature and represents the first NP-completeness result when latencies are not taken into account.

Another important and still open issue for Divisible Load Scheduling deals with return communication. Under the classical model, it is assumed that the communication time of the results between the slaves and the master node can be neglected, what strongly limits the application field. In particular, the complexity of the problem with return messages is still opened. This question has been studied in cooperation with Abhay Ghatpande, from Waseda University in [107] , [108] , [126] . In particular, we have proposed two heuristics for scheduling return messages with different computational costs.

In this context, we have participated in the writing of two book chapters, [43] about different possible modelisations of communications and [40] about steady-state scheduling.

##### Broadcasting and Independent Tasks Scheduling under bounded Multiport Model

Participants : Olivier Beaumont, Christopher Thraves-Caro, Lionel Eyraud-Dubois, Hejer Rejeb.

We have revisited several classical scheduling problems (Broadcasting, independent tasks scheduling) under more realistic communication models, whose parameters can be instanciated at runtime. We have proved that the use of resource augmentation techniques enables to derive quasi-optimal algorithms even if the underlying scheduling problems are strongly NP-Complete.

In [25] , [24] , we have considered the problem of allocating a large number
of independent, equal-sized tasks to a heterogeneous
large scale computing platform. We model the platform
using a set of servers (masters) that initially hold (or generate)
the tasks to be processed by a set of clients (slaves).
All resources have different speeds of communication and
computation and we model contentions using the bounded
multi-port model. This model corresponds well to modern
networking technologies, but for the sake of realism,
another parameter needs to be introduced in order to
bound the number of simultaneous connections that can
be opened at a server node. We prove that unfortunately,
this additional parameter makes the problem of maximizing
the overall throughput NP-Complete. On the other
hand, we also propose a polynomial time algorithm, based
on a slight resource augmentation, to solve this problem.
More specifically, we prove that, if d_{j} denotes the maximal
number of connections that can be opened at server node
S_{j} , then the throughput achieved using this algorithm and
d_{j} + 1 simultaneous connections is at least the same as the
optimal one with dj simultaneous connections. This algorithm
also provides a good approximation for the dual
problem of minimizing the maximal number of connections
that need to be opened in order to achieve a given
throughput, and it can be turned into a standard approximation
algorithm (i.e., without resource augmentation).

We have also considered in [39] , [26] an extension of the above problem (MTBD) trying to represent the more realistic situation when clients can arrives or leaves the system at any time. This extension is called online MTBD. First, we have studied the complexity of the problem, and obtained a negative result saying that no totally online algorithm is able to guarantee a desired approximation factor, even if the algorithm uses resource augmentation. On the other hand, if new connections are allowed each time a client arrives or leaves the system, we propose an algorithm that provides the optimal throughput using resource augmentation and allowing only one new connection per server (each time a client arrives or leaves the system).

In [23] , we have considered the
problem of broadcasting a large message in a large scale distributed
platform, with the same degree-constrained multi-port model. The
message must be sent from a source node, with the help of the
receiving peers which may forward the message to other peers. In this
context, we are not interested in minimizing the makespan for a given
message size but rather to maximize the throughput (i.e. the maximum
broadcast rate, once steady state has been reached). In this case
also, the degree constraint makes the problem of maximizing the
overall throughput NP-Complete. On the other hand, we also propose a
polynomial time algorithm based on a slight resource augmentation to
solve this problem. More specifically, we prove that if d_{j} denotes
the maximal number of connections that can be opened at node C_{j} ,
then the throughput achieved by this algorithm, using at most max(4, d_{j} + 2) simultaneous connections at node C_{j} , is at least the
same as the optimal one with d_{j} simultaneous connections.

In a closely related context, we have investigated in [27] the influence of bandwidth sharing mechanisms on the performance of scheduling algorithms on large scale distributed platforms. More specifically, we have considered three scheduling problems (file redistribution, independent tasks scheduling and broadcasting) on large scale heterogeneous platforms under the Bounded Multi-port Model. This model can be used when programming at TCP level and is also implemented in modern message passing libraries such as MPICH2. We prove, using the three above mentioned scheduling problems, that this model is tractable and that even very simple distributed algorithms can achieve optimal performance, provided that we can enforce bandwidth sharing policies. Our goal is to assert the necessity of such QoS mechanisms, that are now available in the kernels of modern operating systems, to achieve optimal performance. We prove that implementations of optimal algorithms that do not enforce prescribed bandwidth sharing can fail by a large amount if TCP contention mechanisms only are used. More precisely, for each considered scheduling problem, we have established upper bounds on the performance loss than can be induced by TCP bandwidth sharing mechanisms and we have proved that these upper bounds are tight by exhibiting instances achieving them.

#### New Services for Processing on Large Scale Distributed Platforms

##### Heterogeneous Dating Service and Distributed Storage Systems

Participants : Olivier Beaumont, Philippe Duchon, Hejer Rejeb.

In many distributed applications on large distributed systems, nodes
may offer some local resources and request some remote resources. For
instance, in a distributed storage environment, nodes may offer some
space to store remote files and request some space to duplicate
remotely some of their files. In the context of broadcasting, offer
may be seen as the outgoing bandwidth and request as the incoming
bandwidth. In the context of load balancing, overloaded nodes may
request to get rid of some tasks whereas underloaded nodes may offer
to process them. In this context, we propose a distributed algorithm,
called *dating service* which is meant to
randomly match demands and supplies of some resource of many nodes
into couples. In a given round it produces a matching between demands
and supplies which is of linear size (compared to the optimal one),
even if available resources of individual nodes are very
heterogeneous, and is chosen uniformly at random from all matchings of
this size.

We believe that this basic operation can be of great interest in many practical applications and could be used as a building block for writing efficient software on large distributed unstable platforms. We plan to demonstrate its practical efficiency for content distribution, management of large databases and distributed storage applications described in Section 5 .

We also have ongoing work on using this dating service for the maintenance of a randomized overlay network against arbitrary arrivals and departures of nodes, and are trying to remove the requirement for the algorithm to work in a succession of rounds.

In the context of our collaboration with Yahoo!, we have presented in [60] a new algorithm for disk reconfiguration in the context of *Vespa* , a scalable platform for storing, retrieving processing and searching large amounts large amounts of data developped by *Yahoo! Technologies Norway* . The corresponding scheduling problem is closely related to independent related tasks scheduling on heterogeneous platforms, when communication costs are taken into account, and when each task can only be processed on a prescribed set of processors. We prove how to derive from a linear programming formulation in rational numbers an approximation algorithm whose approximation ratio is close to 1 in the condition of use of *Vespa* .

##### Building Heterogeneous Clusters

Participants : Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Lionel Eyraud-Dubois, Hubert Larchevêque.

As already noted in Section 2.1 with the example of WCG call for proposal, the application field of Grid computing is limited by several constraints. In particular, the target application should be easy to divide into small independent pieces of work, so that each individual piece can be executed on a single node. This strongly limits the application field since in many cases, data may be too large to fit into the memory of a single node.

In this context, we would like to propose a distributed algorithm to dynamically build clusters of nodes able to process large tasks. These sets of nodes should satisfy constraints on the overall available memory, on its processing power together with constraints on the maximal latency between nodes and the minimal bandwidth between two participating nodes.

We believe that such a distributed service would enable to consider a much larger application field. We plan to demonstrate first its practical efficiency for the application of molecular dynamics (based on NAMD) described in more detail in Section 5 .

In [63] we present a modeling of this problem called
*bin-covering problem with distance constraint* and we propose
a distributed approximation algorithm in the case where the elements are in a space
of dimension 1. In [62] , we describe a generic
2-phases algorithm, based on resource augmentation and whose
approximation ratio is 1/3. We also propose a distributed version of
this algorithm when the metric space is (for a small
value of D ) and the norm is used to define
distances. This algorithm takes O((4^{D})log^{2}n) rounds and O((4^{D})nlogn) messages both in expectation and with high probability,
where n is the total number of hosts.

##### Complex Queries for Non-Trivial Parallel Algorithms

Participants : Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Lionel Eyraud-Dubois.

In many applications on large scale distributed platforms, the application data files are distributed among the platform and the volatility in the availability of resources forbids to rely on a centralized system to locate data.

In this context, complex queries, such as finding a node holding a given set of files, or holding a file whose index is close to a given value, or a set of (close) nodes covering a given set of files, should be treated in a distributed manner. Queries built for P2P systems are much too poor to handle such requests.

We plan to demonstrate the usefulness and efficiency of such requests on the molecular dynamics application and on the continuous integration application described in Section 5 . Again, we strongly believe that these operations can be considered as useful building blocks for most large scale distributed applications that cannot be executed in a client-server model, and that providing a library with such mechanisms would be of great interest.

A sound approach is to structure them in such a way that they reflect the structure of the application. Peers represent objects of the application so that neighbours in the peer to peer network are objects having similar characteristics from the application's point of view. Such structured peer to peer overlay networks provide a natural support for range and complex queries. We have proposed in [67] to use complex structures such as a Voronoï tessellation, where each peer is associated to a cell in the space. Moreover, since the associated cost to compute and maintain these structures is usually extremely high for dimensions larger than 2, we have proposed to weaken the Voronoï structure to deal with higher dimensional spaces [68] .

We are currently adapting the techniques proposed in these papers to the molecular dynamics application in collaboration with Juan Elezgaray from IECB.