## Section: New Results

Keywords : sparse matrices, direct solvers, multifrontal method, scheduling, memory, out-of-core.

### Parallel Sparse Direct Solvers

Participants : Emmanuel Agullo, Aurélia Fèvre, Jean-Yves L'Excellent.

#### Extension, support and maintenance of the software package MUMPS

This year, we have pursued some work to add functionalities and improve the Mumps software package. The out-of-core functionality (in which computed factors are written to disk) has been made available to a number of users (Samtech S.A., Free Field Technologies, EADS, ...) and we are taking into account their feedback to design and develop a new out-of-core version. As usual, we have had strong interactions with many users, and this has led us to work on the following points:

Reduction of the memory requirements of the solution step. By modifying our algorithms and storing parts of solutions in a datastructure that follows the tree structure, some workarrays of fixed size could be reduced and now scale with the number of processors. This was critical for some applications (e.g., seismic imaging at Geosciences Azur), where the memory usage for the solution step could lead to system paging when using many right-hand sides.

Build a reduced right-hand side and use a partial solution. This new functionality can be used in conjunction with the Schur complement/partial factorization functionality. The need for it arised during the Mumps Users' day (24 October 2006, ENS Lyon). It is particularly useful in domain decomposition methods or in coupled problems, where we can distinguish interior and interface variables. The idea is that when Mumps works on the interior variables, it returns to the user a reduced problem (Schur complement) but now also a reduced right-hand side corresponding to the interface variables. The calling application uses this information to build the solution on the interface problem, which can be reinjected to Mumps in order to build the solution on the interior problem. Note that this implies that forward and backward substitution steps may now be called separately.

Detection of null pivots. A first version was developed that allows the detection of null (tiny) pivots and the treatment of singular problems.

Various more minor improvements and bug corrections.

While the above developments are part of Mumps 4.7.3, the latest release, we have also been working on research work (see the following subsections) that will affect future releases of Mumps . For instance, as part of the ANR Solstice project, we are currently working on coupling parallel scalings algorithms developed by Bora Ucar (CERFACS) with Mumps .

Furthermore, in the context of an Inria ODL (“Opération de Développement Logiciel”), Aurélia Fèvre worked on several software issues: portability to various platforms including Grid'5000, tools for checking the performance from one release to the next, extension of the non-regression tests running each night, as well as code coverage with associated tests, in order to validate existing code and identify dead parts.

Notice that Mumps users constantly provide us with new challenging problems to solve and help us validate and improve our algorithms. We have informal collaborations around Mumps with a number of institutions: (i) industrial teams which experiment and validate our package, (ii) members of research teams with whom we discuss future functionalities wished, (iii) designers of finite element packages who integrate Mumps as a solver for the internal linear systems, (iv) teams working on optimization, (v) physicists, chemists, etc., in various fields where robust and efficient solution methods are critical for their simulations. In all cases, we validate all our research and algorithms on large-scale industrial problems, either coming directly from Mumps users, or from publically available collections of sparse matrices (Davis collection, Rutherford-Boeing and PARASOL).

#### Memory usage and asynchronous communications

The memory usage for the management of asynchronous communications can represent a large portion of the total memory of the parallel multifrontal approach. This is especially true if we envisage out-of-core executions, where part of the other required storage is on disk, but even for in-core executions, that memory is generally far from being negligible. In the algorithm used, data are copied to a buffer before being sent; this is necessary because sent data are not contiguous in memory and we need to reuse that memory in order to perform other tasks during the asynchronous communications. The granularity for the send operations depends on the tasks graph, and one message (containing a block of matrix) is sent from one task to the one that depends on it. The work consisted in dividing such messages into several smaller ones, at the cost of possibly stronger synchronizations: when communication buffers are full (and one process cannot send all the data it needs to send), it sends as much as it can. Because the receiver may be in a similar situation (trying to send but without enough space in its send buffer), we need a mechanism to avoid deadlock. This mechanism is the following: when a process is blocked not being able to send as much as it can, it tries to perform receptions and process messages (slave tasks, assemblies, ...), thus freeing some memory in the buffers of other processes.

This work should have a strong impact on the overall memory usage of parallel sparse direct methods like Mumps as it will allow to process significantly larger problems both in-core and out-of-core. We still have to study more precisely its impact on performance before making it available in a future release.

#### Computing the nullspace of large sparse matrices

In the context of the
LU or
LDL^{t} factorization of large sparse matrices, we have developed a new approach allowing to (i) detect (pseudo-) zero pivots; (ii) delay the factorization of suspect (small) pivots, on which a specific numerical treatment (rank-revealing
QR for example) is performed. Furthermore, once this has been done, an approximation of the null space basis is computed thanks to backward substitutions. A first prototype including these features has been developed within
Mumps . A numerical study of the behaviour and the limits of this approach is currently being done in collaboration with
Cerfacs on test problems provided by EDF, in order to validate the approach, tune it and understand its limits. This type of algorithms is particularly useful in electromagnetism (where the dimension of the nullspace can be large), and in FETI-like
domain decomposition methods (where the dimension of the nullspace is typically smaller than 6).

#### Serial out-of-core factorization

Because of the large memory requirements of sparse direct solvers, the use of out-of-core approaches is mandatory for large-scale problems, where disks are used to extend the core memory. In this context, left-looking and multifrontal methods are the two most widely used approaches. Even though several algorithms have been proposed for both methods, it is not yet obvious which one best fits an out-of-core context. Noticing that there was still room before reaching the intrinsic limits of each method, the natural first step has been to study each method separately in order to improve their respective efficiency.

Concerning the multifrontal method, we have modelled the problem of the minimization of the I/O volume
[Oops!] in the classical (
*terminal allocation scheme* ) case. We have proposed several algorithms and possible associated memory managements for 3 different assembly schemes, including a new assembly scheme that we specifically designed for an out-of-core context. We have then extended this minimization problem
to the more general
*flexible parent allocation* algorithm
[9] : we have explained
[53] how to compute the I/O volume with respect to this scheme and proposed an efficient greedy heuristic
which further reduces the I/O volume on real-life problems in this new context. These new algorithms should allow to improve the new generation of serial out-of-core multifrontal codes based on the flexible allocation scheme (such as
[71] ), as well as the serial parts of parallel codes
[Oops!] .

Thanks to a collaboration with Xiaoye S. Li and Esmond G. Ng (Lawrence Berkeley National Laboratory, Berkeley, California, USA), the I/O volume minimization problem for unsymmetric supernodal methods has also been studied. This was done in the context of a six-month visit of Emmanuel Agullo to Berkeley from June to November 2007, which followed a first visit of fifteen days in September 2006. Contrary to multifrontal methods for which locality is quite natural, one difficulty of supernodal approaches consists in designing a correct scheme respecting some locality constraints, before intending to minimize the I/O volume under these constraints. We have selected two such different schemes from the state-of-the-art: (i) a left-looking method, for which we have designed a prototype simulating the out-of-core behaviour (the actual I/Os are not performed) by extending the SuperLU in-core code; and (ii) a left-looking/right-looking hybrid method for which we have computed the subsequent I/O volume by instrumenting the previous prototype. Note that the implementation of the latter method would require to write the whole kernel of computations and is out of the scope of our short-term perspectives. For both approaches, the I/O volume depends on the partition of the supernodal elimination tree. We have proposed an optimal I/O minimization partitioning algorithm for the left-looking approach. We have shown that this problem is NP-complete in the hybrid method case for which we have thus proposed and implemented a new heuristic. This is a work in progress: we are currently doing some experiments to assess the potential of these algorithms.

Now each method has been accurately modelled and improved to further fit out-of-core requirements, the next step will consist in comparing multifrontal and supernodal approaches both through simulations of the I/O volume and experiments on large real-life problems.

#### Parallel out-of-core factorization

As well as reducing the factorization time, parallelism allows one to further decrease the memory requirements of direct solvers. We have designed (by improving a previous prototype
[52] ) a robust parallel out-of-core multifrontal solver to solve very large sparse linear systems (several
million equations and several hundred million nonzeros). This solver processes the factors out-of-core (but not temporary data). We have shown
[Oops!] how the low-level I/O mechanisms impact the performance and have designed a low-level I/O layer that
avoids the perturbations introduced by system buffers and allows consistently good performance results. This out-of-core solver is publicly available and is already used by several academic and industrial groups. To go significantly further in the memory reduction, it will be interesting to
also store the intermediate working memory on disk. However, before that, two critical issues are being addressed: the first one concerns the memory consumption of
*communication buffers* (see the corresponding paragraph above); the second one concerns the memory for
*I/O buffers* used in the case of asynchronous direct I/Os. To reduce that memory usage, we have worked on reducing the granularity of I/Os, now done panel by panel, in order to allow for almost arbritrarily small buffers. This work is performed in the context of the PhD of Emmanuel
Agullo, in collaboration with Abdou Guermouche (LaBRI and
Inria project ScAlAppliX) and Patrick Amestoy (ENSEEIHT-IRIT).

Once the factors are on disk, they have to be read back for the solution step. In order to improve that step, we collaborate with Tzvetomila Slavova (Ph.D. Cerfacs ) who focuses on this phase of the computation. For instance we are currently designing an algorithm which allows to schedule the I/O in a way that separates the L and U factors on disk during the factorization step in the unsymmetric case: this allows to perform twice less read operations at the solution step.

#### Hybrid Direct-Iterative Methods

We collaborate with Haidar Azzam (Ph.D., Cerfacs ) and Luc Giraud (ENSEEIHT-IRIT) on hybrid direct-iterative solvers. The substructuring methods developed in this context rely on the possibility to compute a partial factorization, with a so-called Schur complement matrix, that is typically computed by a direct solver such as Mumps . The direct solver is called on each subdomain of a physical mesh, and the iterative approach takes care of the interface problem, based on Schur complements provided by our direct solver. We have been working on tuning this functionality and giving advice on how to best exploit the direct solver in the context of such iterative approaches.

#### Expertise site for sparse direct solvers (GRID TLSE project)

The GRID TLSE project (see [51] ), coordinated by ENSEEIHT-IRIT, is an expertise site providing a one-stop shop for users of sparse linear algebra software. This project was initially funded by the ACI Grid (2002-2005). A user can access matrices, databases, information and references related to sparse linear algebra, and can also obtain actual statistics from runs of a variety of sparse matrix solvers on his/her own problem. Each expertise request leads to a number of elementary requests on a Grid of computers for which the Diet middleware developed by Graal is used. Mumps is one of the packages interfaced within the project and that a user will be able to experiment through GRID TLSE. This year, part of the site was opened to the public: functionalities to share sparse matrices and bibliographic tools. For instance, test problems related to the Solstice project are shared among Solstice partners thanks to the TLSE site. Much work has also been done this year (mainly at ENSEEIHT-IRIT) to develop and validate the missing functionalities and be able to define expertise scenarios, with the goal to open the site to the public sometime in 2008.