## Section: Software

Keywords : complete and incomplete supernodal sparse parallel factorizations (complete supernodal sparse parallel factorizations, incomplete supernodal sparse parallel factorizations).

`PaStiX`

Participants : Pascal Hénon, François Pellegrini, Pierre Ramet [ corresponding member ] , Jean Roman.

This work is supported by the French ``Commissariat à l'Energie Atomique CEA/CESTA'' in the context of structural mechanics and electromagnetism applications.

`PaStiX` (Parallel Sparse matriX package)
(http://pastix.gforge.inria.fr ) is a scientific library that provides a high performance parallel solver for very large sparse linear systems based on block direct and block ILU(k) iterative methods.
Numerical algorithms are implemented in simple or double
precision (real or complex): LLt (Cholesky), LDLt (Crout) and LU with
static pivoting (for non symmetric matrices having a symmetric
pattern). This latter version is now used in `FluidBox` (see section
5.3 ).
The `PaStiX` library is planed to be released this year under INRIA CeCILL licence.

The `PaStiX` library uses the graph partitioning and sparse matrix
block ordering package `Scotch` (see
section
5.8 ).
`PaStiX` is based on an efficient static scheduling and memory
manager, in order to solve 3D problems with more than 10 millions of
unknowns. The mapping and scheduling algorithm handles a combination
of 1D and 2D block distributions. This algorithm computes an efficient
static scheduling of the block computations for our supernodal
parallel solver which uses a local aggregation of contribution
blocks. This can be done by taking into account very precisely the
computational costs of the BLAS 3 primitives, the communication costs
and the cost of local aggregations. We also improved this static
computation and communication scheduling algorithm to anticipate the
sending of partially aggregated blocks, in order to free memory
dynamically. By doing this, we are able to reduce dramatically the
aggregated memory overhead, while keeping good performances.

Another important point is that our study is suitable for any heterogeneous parallel/distributed architecture when its performances are predictable, such as clusters of SMP nodes. In particular, we propose now a high performance version with a low memory overhead for SMP node architectures, which fully exploits shared memory advantages by using an hybrid MPI-thread implementation.

Direct methods are numerically robust methods, but the very large three dimensional problems may lead to systems that would require a huge amount of memory despite any memory optimization. A studied approach consists to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Such incomplete factorization can take advantage of the latest breakthroughts in sparse direct methods and particularly should be very competitive in CPU time (effective power used from processors and good scalability) while avoiding the memory limitation encountered by direct methods.