Section: New Results
Algorithms and highperformance solvers
Participants : Mathieu Faverge, Sébastien Fourestier, Damien Genêt, Hervé Guillard [(Pumas)] , Laurent Hascoët [(Tropics)] , Pascal Hénon [Corresponding member] , Cédric Lachat, Xavier Lacoste, François Pellegrini, Pierre Ramet, Cécile Dobrynski.
Parallel domain decomposition and sparse matrix reordering
Like for the year before, the work carried out within the Scotch project (see section 5.5 ) focused on four main axes.
The first one regards the parallelization of the static mapping routines already available in the sequential version of Scotch . Since its version 5.1 , Scotch provides parallel graph partitioning capabilities, but graph partitions are computed to date by means of a parallel multilevel recursive bisection framework. This framework provides partitions of very high quality for a moderate number of parts (about under 512), but load imbalance dramatically increases for larger numbers of parts. Also, the more parts the user wants, the more expensive it is to compute them, because of the recursive bisection process. In order to reduce load imbalance in the recursive bipartitioning process, a parallel load imbalance reduction algorithm has been devised for the bipartitioning case. This algorithm yields perfectly balanced subdomains, at almost no cost for mesh graphs compared to direct kway methods, while it may significantly increase the cut for very irregular graphs. Load imbalance reduction algorithms for the kway case are consequently mandatory, and are the objective of the year to come. In spite of these drawbacks, and thanks to the recoding of some of its routines, PTScotch can now partition graphs of above 2 billion vertices, a barrier that many users wanted to be removed. For example, it has been able to provide perfectly balanced partitions of distributed meshes of 1.6 billion edges on 8096 processors at LLNL.
The second axis concerns dynamic repartitioning. Since graphs may now comprise more than one billion vertices, distributed on machines having more than one hundred thousand processing elements, it is important to be able to compute partitions which create as few data movements as possible with respect to a prior partition. The integration of repartitioning features into the sequential version of Scotch is now complete, with very good results, which are about to be published. The third year of the PhD of Sébastien Fourestier aims at transposing these results to the parallel case.
A third research axis regards the design of specific graph partitioning algorithms. Several applications, such as Schur complement methods for hybrid solvers (see Section 5.4 ), need kway partitions where load balance should take into account not only vertices belonging to the subdomains, but also boundary vertices, which lead to computations on each of the subdomains which share them. A sequential version is now available as a prototype, thanks to the work of JunHo Her in the context of the ANR project “PETAL”, and has been successfully used in conjunction with the HIPS solver. A paper is in preparation. The transposition of these algorithms to the parallel case may prove difficult. A new directions for this research is the creation of other specific algorithms, in the context of a collaboration with Sherry Li at Berkeley.
The fourth axis is the design of efficient and scalable software tools for parallel dynamic remeshing. This is a joint work with Cécile Dobrzynski, in the context of the PhD of Cédric Lachat, funded by the PUMAS team. PaMPA (“Parallel Mesh Partitioning and Adaptation”) is a middleware library dedicated to the management of distributed meshes. Its purpose is to relieve solver writers from the tedious and error prone task of writing again and again service routines for mesh handling, data communication and exchange, remeshing, and data redistribution. An API of the future platform has been devised, and the coding of the mesh handling and redistribution routines is in progress. As a direct application of PaMPA , Damien Genêt, who started his PhD this fall, will write a new generation fluid dynamics solver on top of this middleware.
Highperformance direct solvers on multiplatforms
New supercomputers incorporate many microprocessors which include themselves one or many computational cores. These new architectures induce strongly hierarchical topologies. These are called NUMA architectures. In the context of distributed NUMA architectures, a work has begun, in collaboration with the INRIA RUNTIME team, to study optimization strategies, and to improve the scheduling of communications, threads and I/O. Sparse direct solvers are a basic building block of many numerical simulation algorithms. We propose to introduce a dynamic scheduling designed for NUMA architectures in the PaStiX solver. The data structures of the solver, as well as the patterns of communication have been modified to meet the needs of these architectures and dynamic scheduling. We are also interested in the dynamic adaptation of the computation grain to use efficiently multicore architectures and shared memory. Experiments on several numerical test cases have been performed to prove the efficiency of the approach on different architectures. M. Faverge defended his Ph.D. [39] on these aspects in the context of the NUMASIS ANR CIGC project.
Hybrid directiterative solver based on a Schur complement approach.
In HIPS , we propose several algorithmic variants to solve the Schur complement system that can be adapted to the geometry of the problem: typically some strategies are more suitable for systems coming from a 2D problem discretisation and others for a 3D problem; the choice of the method also depends on the numerical difficulty of the problem. We have a parallel version of HIPS that provides full iterative methods as well as hybrid methods that mixes a direct factorization inside the domain and an iterative method in the Schur complement.
In [41] , we have presented an hybrid version of the solver where the Schur complement preconditioner was built using parallel scalar ILUT algorithm. This year we have also developed a parallel version of the algorithms where the Schur complement incomplete factorization is done using a dense block structure. That is to say that there is no additional term dropping in the Schur complement preconditioner other than the ones prescribed by the block pattern defined by the HID graph partitioning. This variant of the preconditioner is more expensive in term of memory but for some difficult test cases they are the only alternative to direct solvers. A general comparison of all the hybrid methods in HIPS has been presented in [42] .
J. Gaidamour defended his Ph.D. [40] on the hybrid solver techniques developed in HIPS.
MURGE a common interface to sparse linear solvers.
This year we have also defined a general programming interface for sparse linear solvers. Our goal is to normalize the API of sparse linear solvers and to provide some very simple ways of doing some fastidious tasks such as parallel matrix assembly for instance. We have thus proposed a generic API specifications called MURGE (http://murge.gforge.inria.fr ) for which we also provided some test programs and documentation. This interface has been coded in HIPS and PaStiX . We have also tested this interface in RealfluiDS and JOREK for HIPS and PaStiX .
New supercomputers incorporate many microprocessors which include themselves one or many computational cores. These new architectures induce strongly hierarchical topologies. On one hand, we have introduced a dynamic scheduling designed for these architectures in the PaStiX solver. On the other hand, we have a parallel version of HIPS that provides full iterative methods as well as hybrid methods that mixes a direct factorization inside the domain and an iterative method in the Schur complement. Moreover, graphs or meshes partitioners (Scotch software for instance) are able to deal with problems that have more than several billion of unknowns. Solving linear systems is clearly the limiting step to reach this challenge in numerical simulations. An important aim for this work is the design and the implementation of a sparse linear solver that can exploit the power of this new supercomputers. We will have to propose solutions for the following problems:

full parallel and scalable preprocessing steps (ordering and symbolic factorization);

efficient algorithmic coupling of direct and iterative methods that allow a powerful management of whole the levels of parallelism;

adapted scheduling of computation tasks to take advantage of the runtime that operates on mixed architectures with multicores and GPUs.