Team Bacchus

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Algorithms and high-performance solvers

Participants : Mathieu Faverge, Sébastien Fourestier, Jérémie Gaidamour, Pascal Hénon [ Corresponding member ] , Jun-Ho Her, Cédric Lachat, Xavier Lacoste, François Pellegrini, Pierre Ramet.

Parallel domain decomposition and sparse matrix reordering

The work carried out within the Scotch project (see section  5.5 ) focused on four main axes.

The first one regards the parallelization of the static mapping routines already available in the sequential version of Scotch . Since its version 5.1 , released last year, Scotch provides parallel graph partitioning capabilities, but graph partitions are computed to date by means of a parallel multilevel recursive bisection framework. This framework provides partitions of very high quality for a moderate number of parts (about under 512), but load imbalance dramatically increases for larger numbers of parts. Also, the more parts the user wants, the more expensive it is to compute them, because of the recursive bisection process. Consequently, efforts have been put this year on designing a direct k-way parallel graph partitioning framework. In fact, the problem which has been considered in this respect is not plain graph partitioning, but static mapping, because of the increasing need to take into account the topology of the target machine when assigning computations to processing elements. Preliminary results have been achieved  [38] , during the second post-doc year of Jun-Ho Her, but much has yet to be done, as the cost of parallel direct k-way static mapping algorithms is still extremely high compared to sequential methods.

The second axis concerns dynamic repartitioning. Since graphs may now comprise more than one billion vertices, distributed on machines having more than one hundred thousand processing elements, it is important to be able to compute partitions which create as few data movements as possible with respect to a prior partition. The integration of repartitioning features into the sequential version of Scotch is currently under way, in the context of the PhD of Sébastien Fourestier, and will be extended to the parallel domain after-wards. These two axes were partially supported by the ANR-CIS project “SOLSTICE”.

A third research axis regards the design of specific graph partitioning algorithms. Several applications, such as Schur complement methods for hybrid solvers (see Section  5.4 ), need k-way partitions where load balance should take into account not only vertices belonging to the sub-domains, but also boundary vertices, which lead to computations on each of the sub-domains which share them. This work, which had been temporarily set aside by lack of manpower, is now being resumed by Jun-Ho Her, in the context of the ANR project “PETAL”.

The fourth axis is the design of efficient and scalable software tools for parallel dynamic remeshing. This is a joint work with Cécile Dobrzynski, which took form this fall with the start of the PhD of Cédric Lachat. Cédric started his work by devising cache-oblivious orderings of the unknowns in the FluidBox and MMG3D 3d software, in order to speed-up computations.

High-performance direct solvers on multi-plateforms

In order to solve linear systems of equations coming from 3D problems and with more than 50 million of unknowns, which is now a reachable challenge for new SMP supercomputers, the parallel solvers must keep good time scalability and must control memory overhead caused by the extra structures required to handle communications.

Static parallel supernodal approach. In the context of new SMP node architectures, we proposed to fully exploit shared memory advantages. A relevant approach is then to use an hybrid MPI-thread implementation. This not yet explored approach in the framework of direct solver aims at solving efficiently 3D problems with much more than 50 million of unknowns. The rationale that motivate this hybrid implementation was that the communications within a SMP node can be advantageously substituted by direct accesses to shared memory between the processors in the SMP nodes using threads. In addition, the MPI communications between processes are grouped by SMP node. We have shown that this approach allows a great reduction of the memory required for communications. Many factorization algorithms are now implemented in real or complex variables, for single or double precision: LLt (Cholesky), LDLt (Crout) and LU with static pivoting (for non symmetric matrices having a symmetric pattern). This latter version is now integrated in the FluidBox software. A survey article on theses techniques is under preparation and will be submitted to the SIAM journal on Matrix Analysis and Applications. It will present the detailed algorithms and the most recent results. We have to add numerical pivoting technique in our processing to improve the robustness of our solver.

Adaptation to NUMA architectures. New supercomputers incorporate many microprocessors which include themselves one or many computational cores. These new architectures induce strongly hierarchical topologies. These are called NUMA architectures. In the context of distributed NUMA architectures, a work has begun, in collaboration with the INRIA RUNTIME team, to study optimization strategies, and to improve the scheduling of communications, threads and I/O. Sparse direct solvers are a basic building block of many numerical simulation algorithms. We propose to introduce a dynamic scheduling designed for NUMA architectures in the PaStiX solver. The data structures of the solver, as well as the patterns of communication have been modified to meet the needs of these architectures and dynamic scheduling. We are also interested in the dynamic adaptation of the computation grain to use efficiently multi-core architectures and shared memory. Experiments on several numerical test cases have been performed to prove the efficiency of the approach on different architectures. M. Faverge defended his Ph.D. [1] on these aspects in the context of the NUMASIS ANR CIGC project.

Hybrid direct-iterative solver based on a Schur complement approach.

In HIPS , we propose several algorithmic variants to solve the Schur complement system that can be adapted to the geometry of the problem: typically some strategies are more suitable for systems coming from a 2D problem discretisation and others for a 3D problem; the choice of the method also depends on the numerical difficulty of the problem. We have a parallel version of HIPS that provides full iterative methods as well as hybrid methods that mixes a direct factorization inside the domain and an iterative method in the Schur complement.

In  [34] , we have presented an hybrid version of the solver where the Schur complement preconditioner was built using parallel scalar ILUT algorithm. This year we have also developed a parallel version of the algorithms where the Schur complement incomplete factorization is done using a dense block structure. That is to say that there is no additional term dropping in the Schur complement preconditioner other than the ones prescribed by the block pattern defined by the HID graph partitioning. This variant of the preconditioner is more expensive in term of memory but for some difficult test cases they are the only alternative to direct solvers. A general comparison of all the hybrid methods in HIPS has been presented in  [35] .

This year, J. Gaidamour has defended his Ph.D.  [33] on the hybrid solver techniques developed in HIPS.

MURGE a common interface to sparse linear solvers.

This year we have also defined a general programming interface for sparse linear solvers. Our goal is to normalize the API to sparse linear solvers and to provide some very simple ways of doing some fastidious taskes such as the parallel matrix assembly for instance. We have thus proposed a generic API specifications called MURGE ( ) and also provided some test programs and documentation. This interface has been coded in HIPS and PaStiX . We have also tested this interface in FluidBox and JOREK for HIPS and PaStiX .


Logo Inria