## Section: New Results

Keywords : sparse matrix ordering, parallel sparse matrix ordering, graphs and irregular meshes (graphs meshes, irregular meshes), scalable parallel graph partitioning, high performance computing, direct and hybrid direct-iterative solvers, parallel fast multipole methods.

### Algorithms and high-performance solvers

Participants : Patrick Amestoy, Olivier Beaumont, Cédric Chevalier, Olivier Coulaud, Mathieu Faverge, Pierre Fortin, Jérémie Gaidamour, Abdou Guermouche, Pascal Hénon, François Pellegrini, Pierre Ramet, Jean Roman.

#### Parallel domain decomposition and sparse matrix reordering

The work carried out within the `Scotch` project (see
section
5.8 ) was focused
on the design and coding of the parallel graph ordering routines
which are the first building blocks of the PT-Scotch (for ``Parallel
Threaded `Scotch` '') parallel graph partitioning and ordering library.
PT-Scotch is being designed so as to be able to handle graphs up to
a billion vertices on architectures of a thousand processors. This
work is carried out in the context of the PhD thesis of Cédric
Chevalier, who started on October 2004 and will graduate by September
2007. In order to achieve efficient and scalable parallel graph
partitioning, it is necessary to implement a parallel multi-level
framework in which distributed graphs are collapsed down to a size
which can be handled by a single processor, on which a sequential
partition is computed by means of the existing `Scotch` tool, after
which this coarse solution is expanded and refined, level by level, up
to obtain a partition of the original distributed graph.

The multi-level framework that we have developed is as
follows. Distributed graphs are coarsened using an
asynchronous multi-threaded algorithm, and coarsened graphs are
folded on one half of the processors, while the other half receives a
similar copy. The coarsening and folding process then goes on
independently on each of the halves, then quarters, eigths, and so
on, of the processors, resulting in the obtainment of different
coarsened graphs on every processor, which provides for a better
exploration of the problem space when initial partitions are
sequentially computed on each of them. Then, best solutions are
unfolded and propagated back to the finer graphs, selecting the best
partitions of the two at each unfolding stage. In order to keep
partition quality at the finest levels, best unfolded partitions have to
be refined at every step by means of local optimization algorithms.
The problem is that the best sequential local optimization algorithms
do not parallelize well, as they are intrinsically sequential, and attempts
to relax this strong sequential constraint can lead to severe loss of
partition quality when the number of processors increase. However,
as the refinement is local, we have overcome this difficulty
by proposing in [5] an algorithm where the optimization
is applied only to band graphs that contains vertices
that are at a small fixed distance (typically 3) from the
projected separators. What we have implemented is a multi-sequential
approach: at every uncoarsening step, a distributed band
graph is created. Centralized copies of this band graph are
then created on every participating processor. These copies can be
used collectively to run a scalable parallel multi-deme genetic
optimization algorithm, or fully independent runs of
a full-featured sequential optimization algorithm. The best refined band
separator is projected back to the distributed graph, and the
uncoarsening process goes on.
All of these algorithms have been successfully implemented in
PT-Scotch , which can compute in parallel orderings that are of the
same, and even better, quality than the ones computed by the
sequential `Scotch` [74] . However, due to the amount of
communication to be exchanged at the coarsening phase, scalability is
still an issue to be addressed.

#### High-performance direct solvers on multi-plateforms

In order to solve linear systems of equations coming from 3D problems and with more than 10 millions of unkowns, which is now a reachable challenge for new SMP supercomputers, the parallel solvers must keep good time scalability and must control memory overhead caused by the extra structures required to handle communications.

**Static parallel supernodal approach.** Some experiments were run
on the TERA parallel computer at CEA, and the factorization times are
close to the ones obtained on the IBM SP3 of CINES (Montpellier,
France). For example, on our largest problem (26 millions of unknowns
for a 2D 1/2 problem), we reach 500Gflops on 768 processors, that is,
about 50% of the peak performance of the TERA computer.
In the context of new SMP node architectures, we proposed to
fully exploit shared memory advantages. A relevant approach is then
to use an hybrid MPI-thread implementation. This not yet explored
approach in the framework of direct solver aims at solve efficiently
3D problems with much more than 10 millions of unkowns. The rationale
that motived this hybrid implementation was that the communications
within a SMP node can be advantageously substituted by direct
accesses to shared memory between the processors in the SMP nodes using
threads. In addition, the MPI communications between processes are
grouped by SMP node. We have shown that this approach allows a great
reduction of the memory required for
communications [79] , [81] ,.
Many factorization algorithms are now implemented in real or complex
variables, for single or double precision: LLt (Cholesky), LDLt
(Crout) and LU with static pivoting (for non symmetric matrices having
a symmetric pattern). This latter version is now integrated in the
`FluidBox` software [48] .
A survey article on thoses techniques is under preparation and will be
submitted to the SIAM journal on Matrix Analysis and Applications.
It will present the detailed algorithms and the most recent results.
We have to add numerical pivoting technique in our processing to improve the
robustness of our solver. Moreover, in collaboration with the MUMPS developers
(see section
5.4 ), we
want to adapt Out-of-Core techniques to overcome the physical
memory constraints.

**Dynamic parallel multifrontal approach.**
The memory usage of
sparse direct solvers can be the bottleneck to solve large-scale
problems involving sparse systems of linear equations of the form
Ax=b. If memory is not large enough to treat a given problem, disks
must be used to store data that cannot fit in memory (*out-of-core* storage). In a previous work, we proposed a first
out-of-core extension of a parallel multifrontal approach based on the
solver MUMPS, where only the computed factors were written to disk
during the factorization. We have studied this year in detail the
minimum memory requirements of this parallel multifrontal approach and
proposed several mechanisms to decrease further those memory
requirements. For a given amount of memory, we have also studied the
volume of disk accesses involved during the factorization of matrix A,
providing insight on the extra cost that we can expect due to I/O.
Furthermore, we have studied the impact of low-level I/O mechanisms,
and have in particular shown the interest (and difficulty) of relying
on direct I/O. Large-scale problems from applicative fields have been
used to illustrate our discussions. This work is performed in the
context of the PhD of Emmanuel Agullo, in collaboration with Jean-Yves
L'Excellent (INRIA project GRAAL) and Patrick Amestoy (ENSEEIHT-IRIT).
Once the factors are on disk, they have to be read back for the solution
step. In order to improve that step, we collaborate with
Tzvetomila Slavova (Ph.D. CERFACS) who focusses on it.
For instance we are currently designing an algorithm which allows to
schedule the I/O in a way that separates the L and U factors on disk
during the factorization step in the unsymmetric case: this will
allow to perform twice less reads at the solution step for unsymmetric
matrices.
A collaboration with Xiaoye S. Li and Esmond G. Ng (Lawrence Berkeley
National Laboratory, Berkeley, California, USA) was started to compare
the multifrontal factorization to the left-looking.

We work with Tzetomila Slavova (Ph.D. at CERFACS) on the study of the performance of the out-of-core solution phases (forward and backward substitutions). In many applications, the solution phase can be invoked many times for a unique factorization phase. In an out-of-core context, the solution phase can thus become even more costly than the factorization. In a first approach, we can rely on system buffers (or page cache) using a demand-driven scheme to access the disks. We have shown that this approach was not adapted because it cannot "choose" correctly data that must be kept in memory. Furthermore, it is important to really control the amount of memory effectively used (system buffers included). Therefore, a version with direct I/O mechanisms (which does not use system buffers) has been introduced, and we have shown that the performance was comparable with the system approach, with the advantage of effectively controlling the memory usage (there is no use of systeme caches hidden to the application). In a parallel environment we have also shown that the order in which the dependency tree is processed could have a very strong impact on performance, because of the (ir)regularity of the disk accesses involved. Finally, we proposed some heuristics that aim at influencing scheduling decisions taken during the solution step to ensure a high locality for disk accesses.

Finally, in the context of distributed NUMA architectures,
a work with the INRIA RUNTIME team to study optimization
strategies, to describe and implement communications, threads and I/O
scheduling has recently begun. Our solvers will use *Madeleine* and *Marcel*
libraries in order to provide an experimental application to validate
those strategies. M. Faverge has started a Ph.D. (since october 2006) to study these
aspects in the context of the NUMASIS ANR CIGC project. Note that in the
Out-of-Core context, new problems linked to the scheduling and the
management of the computational tasks may arise (processors may be
slowed down by I/O operations). Thus, we have to design and study
specific algorithms for this particular context (by extending our work
on scheduling for heterogeneous platforms).

#### High-performance hybrid direct-iterative solvers for large sparse systems

The resolution of large sparse linear systems is often the most consuming step in scientific applications. Parallel sparse direct solver are now able to solve efficiently real-life three-dimensional problems having in the order of several millions of equations, but anyway they are limited by the memory requirement. On the other hand, the iterative methods require less memory, but they often fail to solve ill-conditioned systems.

We have developped two approaches in order to find some trade-off between these two classes of methods.

In these work, we consider an approach which, we hope, will bridge the gap between direct and iterative methods. The goal is to provide a method which exploits the parallel blockwise algorithmic used in the framework of the high performance sparse direct solvers. We want to extend these high-performance algorithms to develop robust parallel incomplete factorization based preconditioners for iterative methods such as GMRES or Conjugate Gradient.

**Block ILUK preconditioner.**
The first idea is to define an adaptive blockwise incomplete factorization
that is much more accurate (and numerically more robust) than the
scalar incomplete factorizations commonly used to precondition
iterative solvers. Such incomplete factorization can take advantage of
the latest breakthroughts in sparse direct methods and particularly
should be very competitive in CPU time (effective power used from
processors and good scalability) while avoiding the memory limitation
encountered by direct methods. By this way, we expect to be able to
solve systems in the order of hundred millions of unknowns and even
one billion of unknowns. Another goal is to analyse and justify the
chosen parameters that can be used to define the block sparse pattern
in our incomplete factorization.

The driving rationale for this study is that it is easier to incorporate incomplete factorization methods into direct solution software than it is to develop new incomplete factorizations. Our main goal at this point is to achieve a significant diminution of the memory needed to store the incomplete factors (with respect to the complete factors) while keeping enough fill-in to make the use of BLAS3 (in the factorization) and BLAS2 (in the triangular solves) primitives profitable.

In [78] and [80] , we have shown
the benefit of this approach over classic scalar implementation and
also over direct factorisations. Indeed, on the AUDI problem (that is
a reference 3D test case for direct solver with about one million of
unknowns), we are able to solve the system in half the time required
by the direct solver while using only one tenth of the memory needed
(for a relative residual precision of 10^{-7} ). We now expect to improve the
convergence of our solver that fails on more difficult problems.

This research was included in a NSF/INRIA project and is carried out in collaboration with Yousef Saad (University of Minneapolis, USA).

Recently, we have focused on the critical problem to find approximate supernodes of ILU(k) factorizations. The problem is to find a coarser block structure of the incomplete factors. The ``exact'' supernodes that are exhibited from the incomplete factor non zero pattern are usually very small and thus the resulting dense blocks are not large enough for an efficient use of the BLAS3 routines. A remedy to this problem is to merge supernodes that have nearly the same structure. The preliminary results [53] , [54] , [77] have shown the benefits of the new approach.

**Hybrid direct-iterative solver based on a Schur complement
approach.** In recent years, a few Incomplete LU factorization techniques were
developed with the goal of combining some of the features of standard
ILU preconditioners with the good scalability features of multi-level
methods. The key feature of these techniques is to reorder the system
in order to extract parallelism in a natural way. Often a number of
ideas from domain decomposition are utilized and mixed to derive
parallel factorizations.

Under this framework, we developped in collaboration with Yousef Saad (University of Minnesota) algorithms that generalize the notion of ``faces'' and ``edge'' of the ``wire-basket'' decomposition. The interface decomposition algorithm is based on defining a ``hierarchical interface structure'' (HID). This decomposition consists in partitioning the set of unknowns of the interface into components called connectors that are grouped in ``classes'' of independent connectors [8] .

This year, in the context of robust preconditioner technique,
we have developed an approach that uses the HID ordering
to define a new hybrid direct-iterative solver.
The principle is to build a decomposition of the adjacency matrix
of the system into a set of small subdomains (the typical size of a
subdomain is around a few hundreds
or thousand nodes) with overlap. We build this decomposition from the nested
dissection separator tree obtained using a sparse matrix reordering
software as `Scotch` or `METIS` . Thus, at a certain level of the
separator tree, the subtrees are considered as the interior of the
subdomains and the union of the separators in the upper part of the elimination tree constitutes the
interface between the subdomains.

The interior of these subdomains are treated by a direct method. Solving the whole system is then equivalent to solve the Schur complement system on the interface between the subdomains which has a much smaller dimension. We use the hierarchical interface decomposition (HID) to reorder and partition this system. Indeed, the HID gives a natural dense block structure of the Schur complement. Based on this partition, we define some efficient block preconditioners that allow the use of BLAS routines and a high degree of parallelism thanks to the HID properties.

We propose several algorithmic variants to solve the Schur
complement system that can be adapted to the geometry of the problem:
typically some strategies are more suitable for systems coming from
a 2D problem discretization and others for a 3D problem; the choice of
the method also depends on the
numerical difficulty of the problem [52] .
For the moment, only a sequential version of these techniques have been
implemented in a library (`HIPS` ). It provides several methods to
build an efficient preconditioner in many of these situations.
It handles symmetric, unsymmetric, real
or complex matrices. `HIPS` has been built on top of the `PHIDAL` library and thus also provides some scalar preconditioner based on the
multistage ILUT factorization (defined in [8] ).
Those works are the thesis subject of Jérémie Gaidamour that has
started since october 2006.

#### Parallel fast multipole method

The Fast Multipole Method (FMM) is a hierarchical method which computes interactions for the N-body problem in O(N) time for any given precision. In order to compute energy and forces on large systems, we need to improve the computation speed of the method.

This has been realized thanks to a matrix formulation of the main operator in the far field computation : this matrix formulation is indeed implemented with BLAS routines (Basic Linear Algebra Subprograms). Even if it is straightforward to use level 2 BLAS (corresponding to matrix-vector operations), the use of level 3 BLAS (that corresponds to matrix-matrix operations) is interesting because much more efficient. So, thanks to a careful data memory storage, we have rewritten the algorithm in order to use level 3 BLAS, thus greatly improving the overall runtime. Other enhancements of the Fast Multipole Method, such as the use of Fast Fourier Transform, the use of « rotations » or the use of plane wave expansions, allow the reduction of the theoretical operation count. Comparison tests have shown that, for the required precisions in astrophysics or in molecular dynamics, our approach is either faster (compared to rotations and plane waves) or as fast and without any numerical instabilities (compared to the FFT based method), hence justifying our BLAS approach. These results are detailed in [76] and have been submitted for publication [67] . Our BLAS version has then been extended to non uniform distributions, requiring therefore a new octree data structure named octree with indirection, that is efficient for both uniform and non uniform distributions. We have also designed an efficient algorithm that detects uniform areas in structured non uniform distributions, since these areas are more suitable for BLAS computations. These results have been presented in [29] and in [75] . An efficient parallel code of our BLAS version has finally been developed and validated on shared and distributed memory architectures in [14] .

We now plan to improve our parallelization thanks to a hybrid MPI-thread programming and to integrate our FMM in complete codes for real simulations in astrophysics and in molecular dynamics.