Section: Scientific Foundations
Keywords : highperformance computing, parallel sparse linear algebra, fast multipole methods.
Algorithms and highperformance solvers
Highperformance direct solvers for distributed clusters
Solving large sparse systems Ax = b of linear equations is a crucial and timeconsuming step, arising in many scientific and engineering applications. Consequently, many parallel techniques for sparse matrix factorization have been studied and implemented.
We have started this research by working on the parallelization of an industrial code for structural mechanics, which was a 2D and 3D finite element code and non linear in time. This computational finite element code solves plasticity problems (or thermoplasticity problems, possibly coupled with large displacements). Since the matrices of these systems are very illconditioned, classical iterative methods are not an issue. Therefore, to obtain an industrial software tool that must be robust and versatile, highperformance sparse direct solvers are mandatory, and parallelism is then necessary for reasons of memory capabilities and acceptable solving time. Moreover, in order to solve efficiently 3D problems with more than 10 millions of unkowns, which is now a reachable challenge with new SMP supercomputers, we must achieve a good time scalability and control memory overhead.
In the ScAlApplix project, we focused first on the block partitioning and scheduling problem for high performance sparse LDL^{T} or LL^{T} parallel factorization without dynamic pivoting for large sparse symmetric positive definite systems. Our strategy is suitable for nonsymmetric sparse matrices with symmetric pattern, and for general distributed heterogeneous architectures whose computation and communication performances are predictable in advance.
Research about high performance sparse direct solvers is carried on in collaboration with P. Amestoy (ENSEEIHT – IRIT) and J.Y. L'Excellent (INRIA RhôneAlpes), and has led to software developments (see section 5.4 , 5.5 , 5.8 ) and to industrial contracts with CEA (Commissariat à l'Energie Atomique).
Highperformance iterative solvers
In addition to the project activities on direct solvers, we also study some robust preconditioning algorithms for iterative methods. The goal of these studies is to overcome the huge memory consumption inherent to the direct solvers in order to solve 3D dimensional problems of huge size (several millions of unknowns). Our studies focus on the building of generic parallel preconditioners based on ILU factorizations. The classical ILU preconditioners use scalar algorithms that do not exploit well the CPU power and are difficult to parallelize. Our work aims at finding some unknown orderings and partitionings that lead to a dense block structure of the incomplete factors. Then, based on the block pattern, some efficient parallel blockwise algorithms can be devised to build robust preconditioners that are also able to exploit the full capabilities of the modern highperformance computers.
We study two approaches:

the first approach consists in building block ILU(k) preconditioners. The main idea is to adapt the classical ILU(k) factorization in order to reuse the algorithmic components that have been developed for direct methods. In this case, the ordering we use is the same than in the direct factorization and a dense block pattern (i.e. a partition of the unknowns) is obtained using an algorithm that lumps columns having few differences in their non zeros pattern. We have adapted the parallel direct solver chain in order to deal with the incomplete block factors defined by this process. Thus the preconditioner computation benefits from the breakthroughts made by the direct solver techniques studied in PaStiX (sections 5.5 and 6.4 ).

the second approach we recently developed is based on the Schur complement approach. In this case, we use a partition of the adjacency graph of the system matrix into a set of small subdomains with overlap. The interior of these subdomains are treated by a direct method. Solving the whole system is then equivalent to solve the Schur complement system on the interface between the subdomains (this system has a much smaller dimension). We use the hierarchical interface decomposition (HID) that has been developed in PHIDAL to reorder and partition this system. Indeed, the HID gives a natural dense block structure of the Schur complement. Based on this partition, we define some efficient block preconditioners that allow the use of BLAS routines and a high degree of parallelism thanks to the HID properties. All these algorithms are implemented in a new library named HIPS . HIPS contains the PHIDAL library (HID ordering and partitioning) and proposes some extensions and new algorithms (multilevel functionalities, hybrid directiterative approach) to the former PHIDAL library. Details can be found in sections 5.6 and 6.4 .
Fast Multipole Methods
In most of scientific computing applications considered nowadays as computational challenges like biological systems, astrophysic or electromagnetism, the introduction of hierarchical methods based on an octree structure has dramatically reduced the amount of computation needed to simulate those systems for a given error tolerance.
Among these methods, the Fast Multipole Method (FMM) allows the computation of the interactions in, for example, a molecular dynamics system of N particles in O(N) time, against O(N^{2}) with a direct approach. The extension of these methods and their efficient implementation on current parallel architectures is still a critical issue. Moreover the use of periodic boundary conditions, or of duplications of the system in 2 out of 3 space dimensions, just as well as the use of higher approximations for integral equations are also still relevant.
In order to treat biological systems of up to several millions of atoms, these methods must be integrated in the QC++ platform (see section 5.7 ). They can be used in the three (quantum, molecular and continuum) models for atomatom interactions in quantum or molecular mechanics, atomsurface interactions for the coupling between continuum and the other models, and also for fast matrixvector products in the iterative solving of the linear system given by the integral formulation of the continuum method. Moreover, the significant experience achieved by the Scotch and PaStiX projects (see section 5.8 and 5.5 ) will be useful in order to develop efficient implementations of the FMM methods on parallel clusters of SMP nodes.