Section: Application Domains
High Performance Computing
Models and Algorithms for Coarse Grained Computation
With this work we aim at extending the coarse grained modeling (and the resulting algorithms) to hierarchically composed machines such as clusters of clusters or clusters of multiprocessors.
To be usable in a Grid context this modeling has first of all to overcome a principal constraint of the existing models: the idea of an homogeneity of the processors and the interconnection network. Even if the long term goal is to target arbitrary architectures it would not be realistic to think to achieve this directly, but in different steps:
Hierarchical but homogeneous architectures: these are composed of an homogeneous set of processors (or of the same computing power) interconnected with a non-uniform network or bus which is hierarchic (CC-Numa, clusters of SMPs).
Hierarchical heterogeneous architectures: there is no established measurable notion of efficiency or speedup. Also most certainly not any arbitrary collection of processors will be useful for computation on the Grid. Our aim is to be able to give a set of concrete indications of how to construct an extensible Grid.
In parallel, we have to work upon the characterization of architecture-robust efficient algorithms, i.e., algorithms that are independent, up to a certain degree, of low-level components or the underlying middleware.
The literature about fine grained parallel algorithms is quite exhaustive. It contains a lot of examples of algorithms that could be translated to our setting, and we will look for systematic descriptions of such a translation.
List ranking, tree contraction and graph coloring algorithms already have been designed following the coarse grained setting given by the model PRO  .
To work in the direction of understanding of what problems might be ``hard '' we tackled a problem that is known to be P-complete in the PRAM/NC framework, but for which not much had been known when only imposing the use of relatively few processors: the lexicographic first maximal independent set (LFMIS) problem  .
We already are able to give a work optimal algorithm in case we have about logn processors and thus to prove that the NC classification is not necessarily appropriate for today's parallel environments which consist of few processors (up to some thousands) and large amount of data (up to some terabytes).
External Memory Computation
In the mid-nineties several authors  ,  developed a connection between two different types of computation models: BSP-like models of parallel computation and IO efficient external memory algorithms. Their main idea is to enforce data locality during the execution of a program by simulating a parallel computation of several processors on one single processor.
Whereas such an approach is convincing on a theoretical level, its efficient and competitive implementation is quite challenging in practice. In particular, it needs software that induces as little computational overhead as possible by itself. Up to now, it seems that this has only been provided by software specialized in IO efficient implementations.
In fact, the stability of our library parXXL (formerly SSCRAP ), see Section 5.1 , also showed in its extension towards external memory computing  . parXXL nas a consequent implementation of an abstraction between the data of a process execution and the memory of a processor. The programmer acts upon these on two different levels:
with a sort of handle on some data array which is an abstract object that is common to all parXXL processes.
with a map of its (local) part of that data into the address space of the parXXL processor, accessible as a conventional pointer.
Another add-on was the possibility to fix a maximal number of processors (i.e., threads) that should be executed concurrently
In  , we develop a pipeline algorithm aware of the use of external memory to store the handled data. The originality of our approach is to overlap computation, communication, and IO through an original strategy using several memory blocks accessed in a cyclic manner. The resulting pipeline algorithm achieves a saturation of the disk resource which is the bottleneck in algorithms relying on external memory.
Irregular data structures like sparse graphs and matrices are in wide use in scientific computing and discrete optimization. The importance and the variety of application domains are the main motivation for the study of efficient methods on such type of objects. The main approaches to obtain good results are parallel, distributed and out-of-core computation.
We follow several tracks to tackle irregular problems: automatic parallelization, design of coarse grained algorithms and the extension of these to external memory settings.
In particular we study the possible management of very large graphs, as they occur in reality. Here, the notion of ``networks '' appears twofold: on one side many of these graphs originate from networks that we use or encounter (Internet, Web, peer-to-peer, social networks) and on the other the handling of these graphs has to take place in a distributed Grid environment. The principal techniques to handle these large graphs will be provided by the coarse grained models. With the PRO model  and the parXXL library we already provide tools to better design algorithms (and implement them afterwards) that are adapted to these irregular problems.
In addition we will be able to rely on certain structural properties of the relevant graphs (short diameter, small clustering coefficient, power laws). This will help to design data structures that will have good locality properties and algorithms that compute invariants of these graphs efficiently.