## Section: New Results

### Algorithms and high-performance solvers

Participants : Mathieu Faverge, Sébastien Fourestier, Jérémie Gaidamour, Pascal Hénon [ Corresponding member ] , Jun-Ho Her, Cédric Lachat, Xavier Lacoste, François Pellegrini, Pierre Ramet.

#### Parallel domain decomposition and sparse matrix reordering

The work carried out within the `Scotch` project (see
section
5.5 ) focused on four
main axes.

The first one regards the parallelization of the static mapping
routines already available in the sequential version of `Scotch` . Since
its version 5.1 , released last year, `Scotch` provides
parallel graph partitioning capabilities, but graph partitions are
computed to date by means of a parallel multilevel recursive bisection
framework. This framework provides partitions of very high quality for
a moderate number of parts (about under 512), but load imbalance
dramatically increases for larger numbers of parts. Also, the more
parts the user wants, the more expensive it is to compute them,
because of the recursive bisection process. Consequently, efforts have
been put this year on designing a direct k-way parallel graph
partitioning framework. In fact, the problem which has been considered
in this respect is not plain graph partitioning, but static mapping,
because of the increasing need to take into account the topology of
the target machine when assigning computations to processing elements.
Preliminary results have been achieved [38] , during the
second post-doc year of Jun-Ho Her, but much has yet to be done, as
the cost of parallel direct k-way static mapping algorithms is still
extremely high compared to sequential methods.

The second axis concerns dynamic repartitioning. Since graphs may now
comprise more than one billion vertices, distributed on machines
having more than one hundred thousand processing elements, it is
important to be able to compute partitions which create as few data
movements as possible with respect to a prior partition. The
integration of repartitioning features into the sequential version of
`Scotch` is currently under way, in the context of the PhD of
Sébastien Fourestier, and will be extended to the parallel domain
after-wards. These two axes were partially supported by the
ANR-CIS project “SOLSTICE”.

A third research axis regards the design of specific graph partitioning algorithms. Several applications, such as Schur complement methods for hybrid solvers (see Section 5.4 ), need k-way partitions where load balance should take into account not only vertices belonging to the sub-domains, but also boundary vertices, which lead to computations on each of the sub-domains which share them. This work, which had been temporarily set aside by lack of manpower, is now being resumed by Jun-Ho Her, in the context of the ANR project “PETAL”.

The fourth axis is the design of efficient and scalable software tools
for parallel dynamic remeshing. This is a joint work with Cécile
Dobrzynski, which took form this fall with the start of the PhD of
Cédric Lachat. Cédric started his work by devising cache-oblivious
orderings of the unknowns in the `FluidBox` and `MMG3D` 3d software, in
order to speed-up computations.

#### High-performance direct solvers on multi-plateforms

In order to solve linear systems of equations coming from 3D problems and with more than 50 million of unknowns, which is now a reachable challenge for new SMP supercomputers, the parallel solvers must keep good time scalability and must control memory overhead caused by the extra structures required to handle communications.

**Static parallel supernodal approach.**
In the context of new SMP node architectures, we proposed to
fully exploit shared memory advantages. A relevant approach is then
to use an hybrid MPI-thread implementation. This not yet explored
approach in the framework of direct solver aims at solving efficiently
3D problems with much more than 50 million of unknowns. The rationale
that motivate this hybrid implementation was that the communications
within a SMP node can be advantageously substituted by direct
accesses to shared memory between the processors in the SMP nodes using
threads. In addition, the MPI communications between processes are
grouped by SMP node. We have shown that this approach allows a great
reduction of the memory required for communications.
Many factorization algorithms are now implemented in real or complex
variables, for single or double precision: LLt (Cholesky), LDLt
(Crout) and LU with static pivoting (for non symmetric matrices having
a symmetric pattern). This latter version is now integrated in the
`FluidBox` software.
A survey article on theses techniques is under preparation and will be
submitted to the SIAM journal on Matrix Analysis and Applications.
It will present the detailed algorithms and the most recent results.
We have to add numerical pivoting technique in our processing to improve the
robustness of our solver.

**Adaptation to NUMA architectures.**
New supercomputers incorporate many microprocessors which include themselves one or many computational cores. These new architectures induce strongly hierarchical topologies. These are called NUMA architectures.
In the context of distributed NUMA architectures,
a work has begun, in collaboration with the INRIA RUNTIME
team, to study optimization strategies, and to improve the scheduling
of communications, threads and I/O.
Sparse direct solvers are a basic building block of many numerical simulation algorithms. We propose to introduce a dynamic scheduling designed for NUMA architectures in the `PaStiX` solver. The data structures of the solver, as well as the patterns of communication have been modified to meet the needs of these architectures and dynamic scheduling. We are also interested in the dynamic adaptation of the computation grain to use efficiently multi-core architectures and shared memory. Experiments on several numerical test cases have been performed to prove the efficiency of the approach on different architectures.
M. Faverge defended his Ph.D. [1] on these aspects in the context of the NUMASIS ANR CIGC project.

#### Hybrid direct-iterative solver based on a Schur complement approach.

In `HIPS` , we propose several algorithmic variants to solve the Schur complement
system that can be adapted to the geometry of the problem: typically
some strategies are more suitable for systems coming from a 2D problem
discretisation and others for a 3D problem; the choice of the method
also depends on the numerical difficulty of the problem.
We have a parallel version of HIPS that provides full iterative methods as
well as hybrid methods that mixes a direct factorization inside the
domain and an iterative method in the Schur
complement.

In [34] , we have presented an hybrid version of the solver where the Schur complement preconditioner was built using parallel scalar ILUT algorithm. This year we have also developed a parallel version of the algorithms where the Schur complement incomplete factorization is done using a dense block structure. That is to say that there is no additional term dropping in the Schur complement preconditioner other than the ones prescribed by the block pattern defined by the HID graph partitioning. This variant of the preconditioner is more expensive in term of memory but for some difficult test cases they are the only alternative to direct solvers. A general comparison of all the hybrid methods in HIPS has been presented in [35] .

This year, J. Gaidamour has defended his Ph.D. [33] on the hybrid solver techniques developed in HIPS.

#### MURGE a common interface to sparse linear solvers.

This year we have also defined a general programming interface for sparse linear solvers.
Our goal is to normalize the API to sparse linear solvers and to provide some very simple ways
of doing some fastidious taskes such as the parallel matrix assembly for instance.
We have thus proposed a generic API specifications called MURGE (http://murge.gforge.inria.fr )
and also provided some test programs and documentation.
This interface has been coded in `HIPS` and `PaStiX` .
We have also tested this interface in `FluidBox` and `JOREK` for `HIPS` and `PaStiX` .