Section: New Results
Keywords : sparse matrix ordering, parallel sparse matrix ordering, graphs and irregular meshes (graphs meshes, irregular meshes), scalable parallel graph partitioning, high performance computing, direct and hybrid direct-iterative solvers, parallel fast multipole methods.
Algorithms and high-performance solvers
Participants : Patrick Amestoy, Olivier Beaumont, Cédric Chevalier, Olivier Coulaud, Mathieu Faverge, Pierre Fortin, Jérémie Gaidamour, Abdou Guermouche, Pascal Hénon, François Pellegrini, Pierre Ramet, Jean Roman.
Parallel domain decomposition and sparse matrix reordering
The work carried out within the Scotch project (see section 5.8 ) was focused on the design and coding of the parallel graph ordering routines which are the first building blocks of the PT-Scotch (for ``Parallel Threaded Scotch '') parallel graph partitioning and ordering library. PT-Scotch is being designed so as to be able to handle graphs up to a billion vertices on architectures of a thousand processors. This work is carried out in the context of the PhD thesis of Cédric Chevalier, who started on October 2004 and will graduate by September 2007. In order to achieve efficient and scalable parallel graph partitioning, it is necessary to implement a parallel multi-level framework in which distributed graphs are collapsed down to a size which can be handled by a single processor, on which a sequential partition is computed by means of the existing Scotch tool, after which this coarse solution is expanded and refined, level by level, up to obtain a partition of the original distributed graph.
The multi-level framework that we have developed is as follows. Distributed graphs are coarsened using an asynchronous multi-threaded algorithm, and coarsened graphs are folded on one half of the processors, while the other half receives a similar copy. The coarsening and folding process then goes on independently on each of the halves, then quarters, eigths, and so on, of the processors, resulting in the obtainment of different coarsened graphs on every processor, which provides for a better exploration of the problem space when initial partitions are sequentially computed on each of them. Then, best solutions are unfolded and propagated back to the finer graphs, selecting the best partitions of the two at each unfolding stage. In order to keep partition quality at the finest levels, best unfolded partitions have to be refined at every step by means of local optimization algorithms. The problem is that the best sequential local optimization algorithms do not parallelize well, as they are intrinsically sequential, and attempts to relax this strong sequential constraint can lead to severe loss of partition quality when the number of processors increase. However, as the refinement is local, we have overcome this difficulty by proposing in  an algorithm where the optimization is applied only to band graphs that contains vertices that are at a small fixed distance (typically 3) from the projected separators. What we have implemented is a multi-sequential approach: at every uncoarsening step, a distributed band graph is created. Centralized copies of this band graph are then created on every participating processor. These copies can be used collectively to run a scalable parallel multi-deme genetic optimization algorithm, or fully independent runs of a full-featured sequential optimization algorithm. The best refined band separator is projected back to the distributed graph, and the uncoarsening process goes on. All of these algorithms have been successfully implemented in PT-Scotch , which can compute in parallel orderings that are of the same, and even better, quality than the ones computed by the sequential Scotch  . However, due to the amount of communication to be exchanged at the coarsening phase, scalability is still an issue to be addressed.
High-performance direct solvers on multi-plateforms
In order to solve linear systems of equations coming from 3D problems and with more than 10 millions of unkowns, which is now a reachable challenge for new SMP supercomputers, the parallel solvers must keep good time scalability and must control memory overhead caused by the extra structures required to handle communications.
Static parallel supernodal approach. Some experiments were run on the TERA parallel computer at CEA, and the factorization times are close to the ones obtained on the IBM SP3 of CINES (Montpellier, France). For example, on our largest problem (26 millions of unknowns for a 2D 1/2 problem), we reach 500Gflops on 768 processors, that is, about 50% of the peak performance of the TERA computer. In the context of new SMP node architectures, we proposed to fully exploit shared memory advantages. A relevant approach is then to use an hybrid MPI-thread implementation. This not yet explored approach in the framework of direct solver aims at solve efficiently 3D problems with much more than 10 millions of unkowns. The rationale that motived this hybrid implementation was that the communications within a SMP node can be advantageously substituted by direct accesses to shared memory between the processors in the SMP nodes using threads. In addition, the MPI communications between processes are grouped by SMP node. We have shown that this approach allows a great reduction of the memory required for communications  ,  ,. Many factorization algorithms are now implemented in real or complex variables, for single or double precision: LLt (Cholesky), LDLt (Crout) and LU with static pivoting (for non symmetric matrices having a symmetric pattern). This latter version is now integrated in the FluidBox software  . A survey article on thoses techniques is under preparation and will be submitted to the SIAM journal on Matrix Analysis and Applications. It will present the detailed algorithms and the most recent results. We have to add numerical pivoting technique in our processing to improve the robustness of our solver. Moreover, in collaboration with the MUMPS developers (see section 5.4 ), we want to adapt Out-of-Core techniques to overcome the physical memory constraints.
Dynamic parallel multifrontal approach. The memory usage of sparse direct solvers can be the bottleneck to solve large-scale problems involving sparse systems of linear equations of the form Ax=b. If memory is not large enough to treat a given problem, disks must be used to store data that cannot fit in memory (out-of-core storage). In a previous work, we proposed a first out-of-core extension of a parallel multifrontal approach based on the solver MUMPS, where only the computed factors were written to disk during the factorization. We have studied this year in detail the minimum memory requirements of this parallel multifrontal approach and proposed several mechanisms to decrease further those memory requirements. For a given amount of memory, we have also studied the volume of disk accesses involved during the factorization of matrix A, providing insight on the extra cost that we can expect due to I/O. Furthermore, we have studied the impact of low-level I/O mechanisms, and have in particular shown the interest (and difficulty) of relying on direct I/O. Large-scale problems from applicative fields have been used to illustrate our discussions. This work is performed in the context of the PhD of Emmanuel Agullo, in collaboration with Jean-Yves L'Excellent (INRIA project GRAAL) and Patrick Amestoy (ENSEEIHT-IRIT). Once the factors are on disk, they have to be read back for the solution step. In order to improve that step, we collaborate with Tzvetomila Slavova (Ph.D. CERFACS) who focusses on it. For instance we are currently designing an algorithm which allows to schedule the I/O in a way that separates the L and U factors on disk during the factorization step in the unsymmetric case: this will allow to perform twice less reads at the solution step for unsymmetric matrices. A collaboration with Xiaoye S. Li and Esmond G. Ng (Lawrence Berkeley National Laboratory, Berkeley, California, USA) was started to compare the multifrontal factorization to the left-looking.
We work with Tzetomila Slavova (Ph.D. at CERFACS) on the study of the performance of the out-of-core solution phases (forward and backward substitutions). In many applications, the solution phase can be invoked many times for a unique factorization phase. In an out-of-core context, the solution phase can thus become even more costly than the factorization. In a first approach, we can rely on system buffers (or page cache) using a demand-driven scheme to access the disks. We have shown that this approach was not adapted because it cannot "choose" correctly data that must be kept in memory. Furthermore, it is important to really control the amount of memory effectively used (system buffers included). Therefore, a version with direct I/O mechanisms (which does not use system buffers) has been introduced, and we have shown that the performance was comparable with the system approach, with the advantage of effectively controlling the memory usage (there is no use of systeme caches hidden to the application). In a parallel environment we have also shown that the order in which the dependency tree is processed could have a very strong impact on performance, because of the (ir)regularity of the disk accesses involved. Finally, we proposed some heuristics that aim at influencing scheduling decisions taken during the solution step to ensure a high locality for disk accesses.
Finally, in the context of distributed NUMA architectures, a work with the INRIA RUNTIME team to study optimization strategies, to describe and implement communications, threads and I/O scheduling has recently begun. Our solvers will use Madeleine and Marcel libraries in order to provide an experimental application to validate those strategies. M. Faverge has started a Ph.D. (since october 2006) to study these aspects in the context of the NUMASIS ANR CIGC project. Note that in the Out-of-Core context, new problems linked to the scheduling and the management of the computational tasks may arise (processors may be slowed down by I/O operations). Thus, we have to design and study specific algorithms for this particular context (by extending our work on scheduling for heterogeneous platforms).
High-performance hybrid direct-iterative solvers for large sparse systems
The resolution of large sparse linear systems is often the most consuming step in scientific applications. Parallel sparse direct solver are now able to solve efficiently real-life three-dimensional problems having in the order of several millions of equations, but anyway they are limited by the memory requirement. On the other hand, the iterative methods require less memory, but they often fail to solve ill-conditioned systems.
We have developped two approaches in order to find some trade-off between these two classes of methods.
In these work, we consider an approach which, we hope, will bridge the gap between direct and iterative methods. The goal is to provide a method which exploits the parallel blockwise algorithmic used in the framework of the high performance sparse direct solvers. We want to extend these high-performance algorithms to develop robust parallel incomplete factorization based preconditioners for iterative methods such as GMRES or Conjugate Gradient.
Block ILUK preconditioner. The first idea is to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Such incomplete factorization can take advantage of the latest breakthroughts in sparse direct methods and particularly should be very competitive in CPU time (effective power used from processors and good scalability) while avoiding the memory limitation encountered by direct methods. By this way, we expect to be able to solve systems in the order of hundred millions of unknowns and even one billion of unknowns. Another goal is to analyse and justify the chosen parameters that can be used to define the block sparse pattern in our incomplete factorization.
The driving rationale for this study is that it is easier to incorporate incomplete factorization methods into direct solution software than it is to develop new incomplete factorizations. Our main goal at this point is to achieve a significant diminution of the memory needed to store the incomplete factors (with respect to the complete factors) while keeping enough fill-in to make the use of BLAS3 (in the factorization) and BLAS2 (in the triangular solves) primitives profitable.
In  and  , we have shown the benefit of this approach over classic scalar implementation and also over direct factorisations. Indeed, on the AUDI problem (that is a reference 3D test case for direct solver with about one million of unknowns), we are able to solve the system in half the time required by the direct solver while using only one tenth of the memory needed (for a relative residual precision of 10-7 ). We now expect to improve the convergence of our solver that fails on more difficult problems.
This research was included in a NSF/INRIA project and is carried out in collaboration with Yousef Saad (University of Minneapolis, USA).
Recently, we have focused on the critical problem to find approximate supernodes of ILU(k) factorizations. The problem is to find a coarser block structure of the incomplete factors. The ``exact'' supernodes that are exhibited from the incomplete factor non zero pattern are usually very small and thus the resulting dense blocks are not large enough for an efficient use of the BLAS3 routines. A remedy to this problem is to merge supernodes that have nearly the same structure. The preliminary results  ,  ,  have shown the benefits of the new approach.
Hybrid direct-iterative solver based on a Schur complement approach. In recent years, a few Incomplete LU factorization techniques were developed with the goal of combining some of the features of standard ILU preconditioners with the good scalability features of multi-level methods. The key feature of these techniques is to reorder the system in order to extract parallelism in a natural way. Often a number of ideas from domain decomposition are utilized and mixed to derive parallel factorizations.
Under this framework, we developped in collaboration with Yousef Saad (University of Minnesota) algorithms that generalize the notion of ``faces'' and ``edge'' of the ``wire-basket'' decomposition. The interface decomposition algorithm is based on defining a ``hierarchical interface structure'' (HID). This decomposition consists in partitioning the set of unknowns of the interface into components called connectors that are grouped in ``classes'' of independent connectors  .
This year, in the context of robust preconditioner technique, we have developed an approach that uses the HID ordering to define a new hybrid direct-iterative solver. The principle is to build a decomposition of the adjacency matrix of the system into a set of small subdomains (the typical size of a subdomain is around a few hundreds or thousand nodes) with overlap. We build this decomposition from the nested dissection separator tree obtained using a sparse matrix reordering software as Scotch or METIS . Thus, at a certain level of the separator tree, the subtrees are considered as the interior of the subdomains and the union of the separators in the upper part of the elimination tree constitutes the interface between the subdomains.
The interior of these subdomains are treated by a direct method. Solving the whole system is then equivalent to solve the Schur complement system on the interface between the subdomains which has a much smaller dimension. We use the hierarchical interface decomposition (HID) to reorder and partition this system. Indeed, the HID gives a natural dense block structure of the Schur complement. Based on this partition, we define some efficient block preconditioners that allow the use of BLAS routines and a high degree of parallelism thanks to the HID properties.
We propose several algorithmic variants to solve the Schur complement system that can be adapted to the geometry of the problem: typically some strategies are more suitable for systems coming from a 2D problem discretization and others for a 3D problem; the choice of the method also depends on the numerical difficulty of the problem  . For the moment, only a sequential version of these techniques have been implemented in a library (HIPS ). It provides several methods to build an efficient preconditioner in many of these situations. It handles symmetric, unsymmetric, real or complex matrices. HIPS has been built on top of the PHIDAL library and thus also provides some scalar preconditioner based on the multistage ILUT factorization (defined in  ). Those works are the thesis subject of Jérémie Gaidamour that has started since october 2006.
Parallel fast multipole method
The Fast Multipole Method (FMM) is a hierarchical method which computes interactions for the N-body problem in O(N) time for any given precision. In order to compute energy and forces on large systems, we need to improve the computation speed of the method.
This has been realized thanks to a matrix formulation of the main operator in the far field computation : this matrix formulation is indeed implemented with BLAS routines (Basic Linear Algebra Subprograms). Even if it is straightforward to use level 2 BLAS (corresponding to matrix-vector operations), the use of level 3 BLAS (that corresponds to matrix-matrix operations) is interesting because much more efficient. So, thanks to a careful data memory storage, we have rewritten the algorithm in order to use level 3 BLAS, thus greatly improving the overall runtime. Other enhancements of the Fast Multipole Method, such as the use of Fast Fourier Transform, the use of « rotations » or the use of plane wave expansions, allow the reduction of the theoretical operation count. Comparison tests have shown that, for the required precisions in astrophysics or in molecular dynamics, our approach is either faster (compared to rotations and plane waves) or as fast and without any numerical instabilities (compared to the FFT based method), hence justifying our BLAS approach. These results are detailed in  and have been submitted for publication  . Our BLAS version has then been extended to non uniform distributions, requiring therefore a new octree data structure named octree with indirection, that is efficient for both uniform and non uniform distributions. We have also designed an efficient algorithm that detects uniform areas in structured non uniform distributions, since these areas are more suitable for BLAS computations. These results have been presented in  and in  . An efficient parallel code of our BLAS version has finally been developed and validated on shared and distributed memory architectures in  .
We now plan to improve our parallelization thanks to a hybrid MPI-thread programming and to integrate our FMM in complete codes for real simulations in astrophysics and in molecular dynamics.