## Section: New Results

### Parallel Sparse Direct Solvers and Combinatioral Scientific Computing

Participants : Maurice Brémond, Indranil Chowdhury, Guillaume Joslin, Jean-Yves L'Excellent, Bora Uçar.

#### Extension, support and maintenance of the software package MUMPS

This year, we have pursued work to add functionalities and improve the Mumps software package. For example, the parallel analysis which we worked on last year has been made available by default in the public releases of the package, and some of the research work on out-of-core issues has been integrated and validated. As usual, we have had strong interactions with many users (e.g., in the context of the Samtech or Solstice projects, but also through informal collaborations), and this has led us to work on the following points: (i) 64-bit integers to address larger memories; (ii) improvement of load balance and better scalability on specific classes of matrices from EDF and from the French-Israeli Multicomputing project; (iii) more flexible interface from the memory usage point of view, compatibility of compressed orderings with more ordering packages, various performance improvements and bug corrections.

To conclude this section, notice that an action of technological development funded by Inria (ADTMUMPS) has just started which should significantly help improving software engineering aspects, documentation, and developers' tools to validate and experiment the package.

#### Multithreading

The aim of this starting work is to multithread several parts of the MUMPS solver in order to utilize modern multi-core machines more effectively by adding an OpenMP layer on top of the already existing MPI implementation. In addition, using threaded BLAS libraries (such as Goto, MKL and ACML) can provide significant speedups for the BLAS operations within MUMPS. Pure shared memory codes do not exhibit linear speedups with increasing number of cores, and they tend to saturate. The idea here is to increase parallelism by mixing OpenMP along with the MPI processes. Till now, we have identified the bottlenecks of the serial code using profilers like TAU and VTUNE, and have experimented performance improvements by putting OpenMP directives for maximal utilization of the available cores. Significant speedup was observed during the assembly and memory stacking phases, some other key areas like pivot search and solution operations are still being investigated. It has been found that the mixed MPI and OpenMP strategy works well for large unsymmetric cases, where we achieve almost 6 times speedup using 8 cores. However for symmetric cases a pure MPI run on the available cores seems to show optimum performance, the reason for which is not clear. In the next few months we intend to wrap up the OpenMP work and set guidelines for the users who are interested in shared memory implementation of MUMPS.

#### Exact algorithms for a task assignment problem

We consider the following task assignment problem.
Communicating tasks are to be assigned to heterogeneous
processors interconnected with a heterogeneous network. The objective is to minimize the total sum of the execution and communication costs. The problem is NP-hard. We present an exact algorithm based on the well-known A^{*} search. We report simulation results over a wide range of parameters where the largest solved instance contains about three hundred tasks to be assigned to eight processors.

#### On the block triangular form of symmetric matrices

We present some observations on the block triangular form (btf) of structurally symmetric, square, sparse matrices. If the matrix is structurally rank deficient, its canonical btf has at least one underdetermined and one overdetermined block. We prove that these blocks are transposes of each other. We further prove that the square block of the canonical btf, if present, has a special fine structure. These findings help us recover symmetry around the anti-diagonal in the block triangular matrix. The uncovered symmetry helps us to permute the matrix in a special form which is symmetric along the main diagonal while exhibiting the blocks of the original btf. As the square block of the canonical btf has full structural rank, the observation relating to the square block applies to structurally nonsingular, square symmetric matrices as well.

#### On two-dimensional sparse matrix partitioning: Models, methods, and a recipe

We consider two-dimensional partitioning of general sparse matrices for parallel sparse matrix-vector multiply operation. We present three hypergraph-partitioning based methods, each having unique advantages. The first one treats the nonzeros of the matrix individually and hence produces fine-grain partitions. The other two produce coarser partitions, where one of them imposes a limit on the number of messages sent and received by a single processor, and the other trades that limit for a lower communication volume. We also present a thorough experimental evaluation of the proposed two-dimensional partitioning methods together with the hypergraph-based one-dimensional partitioning methods, using an extensive set of public domain matrices. Furthermore, for the users of these partitioning methods, we present a partitioning recipe that chooses one of the partitioning methods according to some matrix characteristics.

#### On the scalability of hypergraph models for sparse matrix partitioning

We investigate the scalability of the hypergraph-based sparse matrix partitioning methods with respect to the increasing sizes of matrices and number of nonzeros. We propose a method to rowwise partition the matrices that correspond to the discretization of two-dimensional domains with the five-point stencil. The proposed method obtains perfect load balance and achieves very good total communication volume. We investigate the behaviour of the hypergraph-based rowwise partitioning method this with respect to the proposed method, in an attempt to understand how scalable the former method is. In another set of experiments, we work on general sparse matrices under different scenarios to understand the scalability of some other hypergraph-based partitioning methods.