Section: Scientific Foundations
Algorithms and highperformance solvers
Participants : Cécile Dobrzynski, Pascal Hénon, François Pellegrini, Pierre Ramet.
Highperformance direct solvers for distributed clusters
Solving large sparse systems Ax = b of linear equations is a crucial and timeconsuming step, arising in many scientific and engineering applications. Consequently, many parallel techniques for sparse matrix factorization have been studied and implemented.
Sparse direct solvers are mandatory when the linear system is very illconditioned; such a situation is often encountered in structural mechanics codes, for example. Therefore, to obtain an industrial software tool that must be robust and versatile, highperformance sparse direct solvers are mandatory, and parallelism is then necessary for reasons of memory capability and acceptable solving time. Moreover, in order to solve efficiently 3D problems with more than 50 million unknowns, which is now a reachable challenge with new SMP supercomputers, we must achieve good scalability in time and control memory overhead. Solving a sparse linear system by a direct method is generally a highly irregular problem that induces some challenging algorithmic problems and requires a sophisticated implementation scheme in order to fully exploit the capabilities of modern supercomputers.
In the BACCHUS project, we focused first on the block partitioning and scheduling problem for high performance sparse LDL^{T} or LL^{T} parallel factorization without dynamic pivoting for large sparse symmetric positive definite systems. Our strategy is suitable for nonsymmetric sparse matrices with symmetric pattern, and for general distributed heterogeneous architectures the computation and communication performance of which are predictable in advance. This has led to software developments (see sections 5.3 , 5.5 )
Highperformance iterative and hybrid direct/iterative solvers
In addition to the project activities on direct solvers, we also study some robust preconditioning algorithms for iterative methods. The goal of these studies is to overcome the huge memory consumption inherent to the direct solvers in order to solve 3D problems of huge size (several million of unknowns). Our studies focus on the building of generic parallel preconditioners based on ILU factorizations. The classical ILU preconditioners use scalar algorithms that do not exploit well CPU power and are difficult to parallelize. Our work aims at finding some unknown orderings and partitioning that lead to a dense block structure of the incomplete factors. Then, based on the block pattern, some efficient parallel blockwise algorithms can be devised to build robust preconditioners that are also able to fully exploit the capabilities of modern highperformance computers.
In this context, we study two approaches.

The first idea is to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Such incomplete factorization can take advantage of the latest breakthroughs in sparse direct methods and particularly should be very competitive in CPU time (effective power used from processors and good scalability) while avoiding the memory limitation encountered by direct methods. By this way, we expect to be able to solve systems in the order of hundred million of unknowns and even one billion of unknowns. Another goal is to analyze and justify the chosen parameters that can be used to define the block sparse pattern in our incomplete factorization.
The driving rationale for this study is that it is easier to incorporate incomplete factorization methods into direct solution software than it is to develop new incomplete factorizations.
Our main goal at this point is to achieve a significant diminution of the memory needed to store the incomplete factors (with respect to the complete factors) while keeping enough fillin to make the use of BLAS3 (in the factorization) and BLAS2 (in the triangular solves) primitives profitable.
In this approach, we focus on the critical problem to find approximate supernodes of ILU(k) factorizations. The problem is to find a coarser block structure of the incomplete factors. The “exact” supernodes that are exhibited from the incomplete factor non zero pattern are usually very small and thus the resulting dense blocks are not large enough for an efficient use of the BLAS3 routines. A remedy to this problem is to merge supernodes that have nearly the same structure. The benefits of this approach have been shown in [43] . These algorithms are implemented in the PaStiX library.

The second technique makes use of a Schur complement approach.
In recent years, a few Incomplete LU factorization techniques were developed with the goal of combining some of the features of standard ILU preconditioners with the good scalability features of multilevel methods. The key feature of these techniques is to reorder the system in order to extract parallelism in a natural way. Often a number of ideas from domain decomposition are utilized and mixed to derive parallel factorizations.
Under this framework, we developed in collaboration with Yousef Saad (University of Minnesota) algorithms that generalize the notion of “faces” and “edge” of the “wirebasket” decomposition. The interface decomposition algorithm is based on defining a “hierarchical interface structure” (HID). This decomposition consists in partitioning the set of unknowns of the interface into components called connectors that are grouped in “classes” of independent connectors [44] .
In the context of robust preconditioner technique, we have developed an approach that uses the HID ordering to define a new hybrid directiterative solver. The principle is to build a decomposition of the adjacency matrix of the system into a set of small subdomains (the typical size of a subdomain is around a few hundreds or thousand nodes) with overlap. We build this decomposition from the nested dissection separator tree obtained using a sparse matrix reordering software as Scotch . Thus, at a certain level of the separator tree, the subtrees are considered as the interior of the subdomains and the union of the separators in the upper part of the elimination tree constitutes the interface between the subdomains.
The interior of these subdomains are treated by a direct method. Solving the whole system is then equivalent to solve the Schur complement system on the interface between the subdomains which has a much smaller dimension. We use the hierarchical interface decomposition (HID) to reorder and partition this system. Indeed, the HID gives a natural dense block structure of the Schur complement. Based on this partition, we define some efficient block preconditioners that allow the use of BLAS routines and a high degree of parallelism thanks to the HID properties.
We propose several algorithmic variants to solve the Schur complement system that can be adapted to the geometry of the problem: typically some strategies are more suitable for systems coming from a 2D problem discretisation and others for a 3D problem; the choice of the method also depends on the numerical difficulty of the problem. In the HIPS library, we provide full iterative methods (very low memory consumption) as well as hybrid methods that mixes a direct factorization inside the domain and an iterative method in the Schur complement. The library provides many options that allow one to deal with real or complex arithmetic, and symmetric or unsymmetric matrices. In particular, the very interesting feature of HIPS is that it allows one to find some good tradeoff between memory, robustness and time consumption in almost every case.
These works are also supported by the ANRCIS project “SOLSTICE”.
Meshes and graph partitioning
Parallel graph partitioning and static mapping
Finding vertex separators for sparse matrix ordering is only one of the many uses of generic graph partitioning tools. For instance, finding balanced and compact domains in problem graphs is essential to the efficiency of parallel iterative solvers. Here again, because of the size of the problems at stake, parallel graph partitioning tools are mandatory to provide good load balance and minimal communication cost.
The execution of parallel applications implies communication between processes executed on the different cores. On NUMA architectures which are strongly heterogeneous in terms of latency and capacity, communication cost strongly depends on the repartition of tasks among cores. Architectureaware load balancing must take into account both the characteristics of the parallel applications (including for instance task processing costs and the amount of communication between tasks) and the topology of the target architecture (providing the powers of cores and the costs of communication between all of them). When processes are assumed to coexist simultaneously for all the duration of the program, this optimization problem is called mapping. A mapping is called static if it is computed prior to the execution of the program and is never modified at runtime.
The sequential Scotch tool was able to perform static mapping since its first version, but this feature was not widely known nor used by the community. With the increasing need to map very large problem graphs onto very large and strongly heterogeneous parallel machines (whether hierarchical NUMA clusters or GPUbased systems), there is an increasing demand for parallel static mapping tools.
Adaptive dynamic mesh partitioning
Many simulations which model the evolution of a given phenomenon along with time (turbulence and unsteady flows, for instance) need to remesh some portions of the problem graph in order to capture more accurately the properties of the phenomenon in areas of interest. This remeshing is performed according to criteria which are closely linked to the undergoing computation and can involve large mesh modifications: while elements are created in critical areas, some may be merged in areas where the phenomenon is no longer critical.
Performing such remeshing in parallel creates additional problems. In particular, splitting an element which is located on the frontier between several processors is not an easy task, because deciding when splitting some element, and defining the direction along which to split it so as to preserve numerical stability most, require shared knowledge which is not available in distributed memory architectures. Adhoc data structures and algorithms have to be devised so as to achieve these goals without resorting to extra communication and synchronization which would impact the running speed of the simulation.
Most of the works on parallel mesh adaptation attempt to parallelize in some way all the mesh operations: edge swap, edge split, point insertion, etc. It implies deep modifications in the (re)mesher and often leads to bad performance in term of CPU time. An other work [38] proposes to base the parallel remeshing on existing mesher and load balancing to be able to modify the elements located on the frontier between several processors.
In addition, the preservation of load balance in the remeshed simulation requires dynamic redistribution of mesh data across processing elements. Several dynamic repartitioning methods have been proposed in the literature [46] , [45] , which rely on diffusionlike algorithms and the solving of flow problems to minimize the amount of data to be exchanged between processors. However, integrating such algorithms into a global framework for handling adaptive meshes in parallel has yet to be done.