Section: Application Domains
High Performance Computing
Participants : PierreNicolas Clauss, Sylvain ContassotVivier, Jens Gustedt, Soumeya Leila Hernane, Vassil Iordanov, Thomas Jost, Wilfried Kirschenmann.
Models and Algorithms for Coarse Grained Computation
With this work we aim at extending the coarse grained modeling (and the resulting algorithms) to hierarchically composed machines such as clusters of clusters or clusters of multiprocessors.
To be usable in a Grid context this modeling has first of all to overcome a principal constraint of the existing models: the idea of an homogeneity of the processors and the interconnection network. Even if the long term goal is to target arbitrary architectures it would not be realistic to think to achieve this directly, but in different steps:

Hierarchical but homogeneous architectures: these are composed of an homogeneous set of processors (or of the same computing power) interconnected with a nonuniform network or bus which is hierarchic (CCNuma, clusters of SMP s).

Hierarchical heterogeneous architectures: there is no established measurable notion of efficiency or speedup. Also most certainly not any arbitrary collection of processors will be useful for computation on the Grid. Our aim is to be able to give a set of concrete indications of how to construct an extensible Grid.
In parallel, we have to work upon the characterization of architecturerobust efficient algorithms, i.e., algorithms that are independent, up to a certain degree, of lowlevel components or the underlying middleware.
Asynchronous algorithms are very good candidates as they are robust to dynamical variations of the performances of the interconnection network used. Moreover, they are even tolerant to the loss of message related to the computations. However, as mentioned before they cannot be used in all cases. We will then focus on the feasibility to modify those schemes in order to widen their range of applicability while preserving a maximum of asynchronism.
The literature about fine grained parallel algorithms is quite exhaustive. It contains a lot of examples of algorithms that could be translated to our setting, and we will look for systematic descriptions of such a translation. List ranking, tree contraction and graph coloring algorithms already have been designed following the coarse grained setting given by the model PRO [5] .
External Memory Computation
In the midnineties several authors [47] , [49] developed a connection between two different types of computation models: BSPlike models of parallel computation and IO efficient external memory algorithms. Their main idea is to enforce data locality during the execution of a program by simulating a parallel computation of several processors on one single processor.
While such an approach is convincing on a theoretical level, its efficient and competitive implementation is quite challenging in practice. In particular, it needs software that induces as little computational overhead as possible by itself. Up to now, it seems that this has only been provided by software specialized in IO efficient implementations.
In fact, the stability of our library parXXL , see Section 5.1 , permitted its extension towards external memory computing [6] . parXXL has a consequent implementation of an abstraction between the data of a process execution and the memory of a processor. The programmer acts upon these on two different levels:

with a sort of handle on some data array which is an abstract object that is common to all parXXL processes;

with a map of its (local) part of that data into the address space of the parXXL processor, accessible as a conventional pointer.
Another addon was the possibility to fix a maximal number of processors (i.e., threads) that should be executed concurrently
Irregular Problems
Irregular data structures like sparse graphs and matrices are in wide use in scientific computing and discrete optimization. The importance and the variety of application domains are the main motivation for the study of efficient methods on such type of objects. The main approaches to obtain good results are parallel, distributed and outofcore computation.
We follow several tracks to tackle irregular problems: automatic parallelization, design of coarse grained algorithms and the extension of these to external memory settings.
In particular we study the possible management of very large graphs, as they occur in reality. Here, the notion of “networks ” appears twofold: on one side many of these graphs originate from networks that we use or encounter (Internet, Web, peertopeer, social networks) and on the other side the handling of these graphs has to take place in a distributed Grid environment. The principal techniques to handle these large graphs will be provided by the coarse grained models. With the PRO model [5] and the parXXL library we already provide tools to better design algorithms (and implement them afterward) that are adapted to these irregular problems.
In addition we will be able to rely on certain structural properties of the relevant graphs (short diameter, small clustering coefficient, power laws). This will help to design data structures that will have good locality properties and algorithms that compute invariants of these graphs efficiently.
Large Scale Computing
In application of our main algorithmic techniques, we have developed a distribution of a stochastic control algorithm based on Dynamic Programming, which has been successively applied to large problem on large scale architectures.
Since 1957, Dynamic Programming has been extensively used in the field of stochastic optimization. The success of this approach is due to the fact that its implementation by backward recursion is very easy. The main drawback of this method is due to the number of actions and the number of state control to test at each time step. In order to tackle this problem other methods are described in the literature, but either they require convexity of the underlying function to optimize or they are not suitable for large multistep optimizations. The Stochastic Dynamic Programming method is usually thought to be limited to problems with less than 3 or 4 state variables involved. But our parallel version currently allows to optimize an electricity asset management problem with 7energystocks and 10statevariables and still achieves both speedup and sizeup.
From a parallel computing point of view, the main difficulty has been to efficiently redistribute data and computations at each step of the algorithm. Our parallel algorithm has been successfully implemented and experimented on multicore PC clusters (up to 256 nodes and 512 cores) and on a Blue Gene/L and a Blue Gene/P supercomputers (using up to 8192 nodes and 32768 cores, this machine was ranked 13 in Top500 in first semester 2008). Furthermore, a strong collaboration with IBM allowed to implement many serial optimizations and help to decrease the execution times significantly, both on PC clusters and on Glue Gene architecture.
Heterogeneous Architecture Programming
Clusters of heterogeneous nodes, composed of CPUs and GPUs, require complex multigrain parallel algorithms: coarse grain to distribute tasks on cluster nodes and fine grain to run computations on each GPU. Algorithms implementation is achieved on these architectures using a multiparadigm parallel development environment, composed of MPI and CUDA libraries (compiling with both gcc and nVIDIA nvcc compilers).
We investigate the design of multigrain parallel algorithm and multiparadigm parallel development environment for GPU clusters, in order to achieve both speedup and size up on different kinds of algorithms and applications. Our main application targets are: financial computations, PDE solvers, and relaxation methods.
Energy
Nowadays, people are getting more and more aware of the energetic problem and are concerned with reducing their energy consumption. Computer science is not an exception and some effort has to be made in our domain in order to optimize the energetic efficiency of our systems and algorithms.
In that context, we investigate the potential benefit of using intensively parallel devices such as GPUs in addition to CPUs. Although such devices present quite high instantaneous energy consumptions, their energetic efficiency, that is to say their ratio of flops/Watt is often much greater than the one of CPUs.
Load balancing
Although load balancing in parallel computing has been intensively studied, it is still an issue in the most recent parallel systems whose complexity and dynamic nature regularly increase. For the grid in particular, where the nodes or the links may be intermittent, the demand is stronger and stronger for noncentralized algorithms.
With Jacques M. Bahi from the University of FrancheComté, we work on the design and implementation of a decentralized loadbalancing algorithm which works with dynamical networks. In such a context, we consider that the nodes are always available but the links between them may be intermittent. According to the loadbalancing task, this is a rather difficult context of use. Our algorithm is based on asynchronous diffusion.
Another aspect of loadbalancing is also addressed by our team in the context of the Neurad project. Neurad is a multidisciplinary project involving our team and some computer scientists and physicists from the University of FrancheComté around the problem of treatment planning of cancerous tumors by external radiotherapy. In that work, we have already proposed an original approach in which a neural network is used inside a numerical algorithm to provide radiation dose deposits in any heterogeneous environments, see [9] . The interest of the Neurad software is to combine very small computation times with an accuracy close to the most accurate methods (MonteCarlo). It has to be noted that the MonteCarlo methods take several hours to deliver their results where Neurad requires only a few minutes on a single machine.
In fact, in Neurad most of the computational cost is hidden in the learning of the internal neural network. This is why we work on the design of a parallel learning algorithm based on domain decomposition. However, as the learnings of the obtained subdomains may take quite different times, a pertinent loadbalancing is required in order to get approximately the same learning times for all the subdomains. The work here is thus more focused on the decomposition strategy as well as the load estimator in the context of neural learning.