Section: Scientific Foundations
Structuring of Applications for Scalability
Participants : PierreNicolas Clauss, Sylvain ContassotVivier, Jens Gustedt, Soumeya Leila Hernane, Vassil Iordanov, Thomas Jost, Wilfried Kirschenmann, Stéphane Vialle.
Our approach is based on a “good ” separation of the different problem levels that we encounter with Grid problems. Simultaneously, this has to ensure a good data locality (a computation will use data that are “close ”) and a good granularity (the computation is divided into non preemptive tasks of reasonable size). For problems for which there is no natural data parallelism or control parallelism such a division (into data and tasks) is mandatory when tackling the issues related to spatial and temporal distances as we encounter them in the Grid.
Several parallel models offering simplified frameworks that ease the design and the implementation of algorithms have been proposed. The best known of these provide a modeling that is called “fined grained ”, i.e., at the instruction level. Their lack of realism with respect to the existing parallel architectures and their inability to predict the behavior of implementations, has triggered the development of new models that allow a switch to a coarse grained paradigm. In the framework of parallel and distributed (but homogeneous) computing, they started with the fundamental work of Valiant [55] . Their common characteristics are:

Maximally exploit the data that is located on a particular node by a local computation.

Collect all requests for other nodes during the computation.

Only transmit these requests if the computation can't progress anymore.
The coarse grained models aim at being realistic with regard to two different aspects: algorithms and architectures. In fact, the coarseness of these models uses the common characteristic of today's parallel settings: the size of the input is orders of magnitude larger than the number of processors that are available. In contrast to the PRAM (Parallel Random Access Machine) model, the coarse grained models are able to integrate the cost of communications between different processors. This allows them to give realistic predictions about the overall execution time of a parallel program. As examples, we refer to BSP (Bulk Synchronous Parallel model) [55] , LogP (Latency overhead gap Procs) [48] , CGM (Coarse Grained Multicomputer) [50] and PRO (Parallel Resource Optimal Model) [5] .
The assumptions on the architecture are very similar: p homogeneous processors with local memory distributed on a pointtopoint interconnection network. They also have similar models for program execution that are based on supersteps ; an alternation of computation and communication phases. At the algorithmic level, this takes the distribution of the data on the different processors into account. But, all the mentioned models do not allow the design of algorithms for the Grid since they all assume homogeneity, for the processors as well as for the interconnection network.
Our approach is algorithmic. We try to provide a modeling of a computation on grids that allows an easy design of algorithms and realistic performing implementations. Even if there are problems for which the existing sequential algorithms may be easily parallelized, an extension to other more complex problems such as computing on large discrete structures (e.g., web graphs or social networks) is desirable. Such an extension will only be possible if we accept a paradigm change. We have to explicitly decompose data and tasks.
We are convinced that this new paradigm should have the following properties:

It should use asynchronous algorithmic schemes when possible. Those algorithms are very well suited to grid contexts but are not applicable to all scientific problems.

Otherwise, be guided by the idea of supersteps (BSP). This is to enforce a concentration of the computation to the local data.

Ensure an economic use of all available resources.
At the same time, we have to be careful that the model (and the design of algorithms) remains simple.
Several studies have demonstrated the efficiency of (1 ) in large scale local or grid contexts [43] , [42] or have dealt with the implementation aspects [44] . But to fully exploit the benefits of those algorithms, not only the computations need to be asynchronous but also the controls of those algorithms. To fulfill such needs, a decentralized convergence detection algorithm has been proposed in [45] .
A natural extension of those works is the study of asynchronism in hierarchical and hybrid clusters, that is to say, clusters in which there are different levels of computational elements and those elements may be of different kinds. Typically, a cluster of workstations with at least one GPU in each node forms such a hierarchical and hybrid system.
To the best of our knowledge, although GPGPU knows a great success since the last few years, it is not yet very much used in clusters. It is quite probable that this is mainly due to the rather important cost of data transfers between the GPU memory and its host memory which generates an additional communication overhead in parallel algorithms.
Still, there are some algorithms which may be less impacted than the others by that overhead, the asynchronous iterative ones. This comes from the facts that they provide an implicit overlapping of communications by computations and that the iterations are no longer synchronized, which provides much more flexibility according to the parallel system.
In that context, we study the adaptation of asynchronous iterative algorithms on a cluster of GPUs for solving EDP problems. In our solver, the space is discretized by finite differences and all the derivatives are approximated by Euler equations. The inner computations of our EDP solver consist in solving linear equations (generally sparse). Thus, a linear solver is included in our solver. As that part is the most time consuming one, it is essential to get a version as fast as possible to decrease the overall computation time. This is why we have decided to implement it on GPU. Our parallel scheme uses the MultisplittingNewton which is a more flexible kind of block decomposition. In particular, it allows for asynchronous iterations.
Finally, each subdomain of the problem is treated on one node. The nonlinear computations are performed on the CPU whereas the linear resolutions are done on the local GPU. The nodes communicate their local results between each others according to a dependency graph induced by the problem.
Concerning (2 ), the number of supersteps and the minimization thereof should by themselves not be a goal. It has to be constrained by other more “natural ” parameters coming from the architecture and the problem instance. A first solution that uses (2 ) to combine these objectives for homogeneous environments has been given in [5] with PRO.
In a complementary approach we have addressed (3 ) to develop a simple interface that gives a consistent view of the data services that are exported to an application, see [7] .
Starting from these models, we try to design high level algorithms for grids. They will be based upon an abstract view of the architecture and as far as possible be independent of the intermediate levels. They aim at being robust with regard to the different hardware constraints and should be sufficiently expressive. The applications for which our approach will be feasible are those that fulfill certain constraints:

they need a lot of computing power,

they need a lot of data that is distributed upon several resources, or,

they need a lot of temporary storage exceeding the capacity of a single machine.
To become useful on grids, coarse grained models (and the algorithms designed for them) must first of all overcome a principle constraint: the assumption of homogeneity of the processors and connections. The long term goal should be arbitrarily mixed architectures but it would not be realistic to assume to be able to achieve this in one step.