Section: New Results
Parallel Sparse Direct Solvers
This year, we have pursued some work to add functionalities and improve the Mumps software package, with strong interactions and informal collaborations with many users that provide challenging problems and help us validate and improve our algorithms: (i) industrial teams which experiment and validate our package, (ii) members of research teams with whom we discuss future functionalities wished, (iii) designers of finite element packages who integrate Mumps as a solver for the internal linear systems, (iv) teams working on optimization, (v) physicists, chemists, etc., in various fields where robust and efficient solution methods are critical for their simulations. In all cases, we validate all our research and algorithms on large-scale industrial problems, either coming directly from Mumps users, or from publicly available collections of sparse matrices (Davis collection, Rutherford-Boeing and PARASOL).
In the context of the Solstice project, funded by the ANR (French Research Agency), we have improved the algorithms for null pivot detection and we made them available in Mumps 4.8.4. This is the object of collaborations with other partners of the project, including Cerfacs and EDF, who is one of the main user of this functionality. We also worked closely with Bora Uçar ( Cerfacs ) and Patrick Amestoy (ENSEEIHT-IRIT) on parallel scaling algorithms  and their integration into Mumps . After feedback from several users on the version integrated in Mumps 4.8.0 we have worked on an improved parallel algorithm that accelerates the scaling phase and we have tuned its default numerical behavior. The improved version is included in Mumps 4.8.4. We collaborated with Luc Giraud (ENSEEIHT-IRIT) on hybrid direct-iterative solvers, providing a direct solver with the ability to return a Schur complement, that is then used within an iterative scheme based on domain decomposition. We also collaborate with ENSEEIHT-IRIT on an expertise site for sparse direct solvers, called GRID TLSE. The site has been used a lot to exchange test problems and sparse matrices with users of sparse direct solvers. The goal is to also provide scenarios allowing a user to experiment various combinations of algorithms and solvers on a user's typical test problems. More information can be obtained from http://gridtlse.org .
In the next two paragraphs, we give details on the work performed towards an efficient out-of-core factorization and a parallel analysis, two critical aspects when dealing with large sparse matrices on limited-memory computers.
Although factorizing a sparse matrix is a robust way to solve large sparse systems of linear equations, such an approach is costly both in terms of computation and storage. When the storage required to process a matrix is greater than the amount of memory available on the platform, so-called out-of-core approaches have to be employed: disks extend the main memory to provide enough storage capacity. A first robust approach that stores factors on disk is now officially available within the Mumps package. More research was done in the context of the PhD of Emmanuel Agullo, in which we have investigated both theoretical and practical aspects of such out-of-core factorizations. The Mumps and SuperLU software packages have been used to illustrate the difficulties on real-life matrices. First, we have proposed and studied various out-of-core models that aim at limiting the overhead due to data transfers between memory and disks on uniprocessor machines. To do so, we have revisited the algorithms to schedule the operations of the factorization and have proposed new memory management schemes to fit out-of-core constraints. Then we have focused on a particular factorization method, the multifrontal method, that we have pushed as far as possible in a parallel out-of-core context with a pragmatic approach. We have shown that out-of-core techniques allow to solve large sparse linear systems efficiently, and that a special attention must be paid to low-level I/O mechanisms; in particular we have shown that system I/O's have several drawbacks, that can be avoided by using direct I/O's together with an asynchronous approach at the application level.
When only the factors are stored on disks, a particular attention must be paid to temporary data, which remain in core memory. Therefore we started to rethink the whole schedule of the out-of-core parallel factorization with the objective to achieve a high scalability of the core memory usage.
This work was done in close collaboration with Abdou Guermouche (Université de Bordeaux and LaBRI). Work on supernodal methods and SuperLU was done in collaboration with Xiaoye S. Li (Lawrence Berkeley National Laboratory, Berkeley, USA).
Although the analysis phase of a sparse direct solver only involves symbolic computations, it may still induce considerable computational and memory requirements. The parallelization of the analysis phase can thus provide significant benefits to the solution of large-scale problems and represents an essential feature on computer systems with limited memory capabilities. The core of the analysis phase consists of two main operations:
Elimination tree computation: this step provides a pivotal order that minimizes the fill-in generated at factorization time and identifies independent computational tasks that can be executed in parallel.
Symbolic factorization: simulates the actual factorization in order to estimate the memory that has to be allocated for the factorization phase.
In our approach we can use either PT-Scotch  or ParMETIS  for step 1; those packages return an index permutation and a separators tree which results from an ordering based on nested dissections. Based on this, we first select a number of subtrees that is equal to the number of working processors, and perform a symbolic factorization on each of these subtrees that is based on the usage of quotient graphs  to limit the memory consumption. Once every processor has finished with its subtree, the symbolic factorization of the unknowns associated to the top part of the tree is performed sequentially; because we use quotient graphs on entry, this still allows for a good overall memory scalability. We have observed that although PT-Scotch is slower than ParMETIS, the quality of the ordering it provides is considerably better and does not degrade with the degree of parallelism.
This work was presented in  . We now plan to include it in Mumps and make it available in a future release of the package.