## Section: New Results

### High performance scientific computing

Participants : Laura Grigori, Guy Atenekeng, Simplice Donfack, Amal Khabou, Pawan Kumar, Federico Stivoli, Ke Wang.

The focus of this research is on the design of efficient parallel algorithms for solving problems in numerical linear algebra, as solving very large sets of linear equations and large least squares problems, often with millions of rows and columns. These problems arise in many numerical simulations, and solving them is very time consuming.

#### Communication avoiding algorithms for LU and QR factorizations

This research focuses on developing new algorithms for linear algebra problems, that minimize the required communication, in terms of both latency and bandwidth. The results we have obtained to date concern algorithms for solving dense linear systems and dense least squares problems.

In [61] we study
algorithms for performing the LU and QR factorizations of dense
matrices. Recently, two communication optimal algorithms have been
introduced for distributed memory architectures, refered to as
communication avoiding CALU and CAQR (joint work with J. Demmel and
M. Hoemmen from U.C. Berkeley, J. Langou from C.U. Denver, and
H. Xiang from University Paris 6). We study two algorithms based on
CAQR and CALU that are adapted to multicore architectures. They
combine ideas to reduce communication from communication avoiding
algorithms with asynchronism and dynamic task scheduling. For
matrices that are tall and skinny, that is, they have many more rows
than columns, the two algorithms outperform the corresponding
algorithms from Intel MKL vendor library on a dual-socket, quad-core
machine based on Intel Xeon EMT64 processor and on a four-socket,
quad-core machine based on AMD Opteron processor. For matrices of
size m = 10^{5} and m = 10^{6} , for n varying from 10 to 1000,
multithreaded CALU outperforms the corresponding routine dgetrf from
Intel MKL library up to a factor of 2.3 and the corresponding
routine dgetrf from AMD (ACML library) up to a factor of 5.
Multithreaded CAQR outperforms by a factor of 5.3 the corresponding
dgeqrf routine from MKL library. For square matrices, CALU is
slightly slower than MKL dgetrf routine when m = n<5000 , while for
m = n = 10^{4} it is slightly faster than this routine up to a factor of
1.5 and CAQR is less performant than the corresponding routine in
the other librairies.

In 2008 we have introduced a Communication-Avoiding LU factorization (CALU) algorithm for computing in parallel the LU factorization of a dense matrix A distributed in a two-dimensional (2D) layout. To decrease the communication required in the LU factorization, CALU uses a new pivoting strategy, referred to as ca-pivoting, that may lead to a different row permutation than the classic LU factorization with partial pivoting. We have further investigated the numerical stability of CALU. Our numerical results show that ca-pivoting scheme is stable in practice. We observe that it behaves as a threshold pivoting, and in practical experiments is bounded by 3, while in LU factorization with partial pivoting, is bounded by 1,where denotes the matrix of absolute values of the entries of L . Extensive testing on many different matrices always resulted in residuals ||Ax-b|| comparable to those from conventional partial pivoting. In particular we have shown that CALU is equivalent to performing GEPP on a larger matrix formed by entries of the input matrix (sometimes slightly perturbed) and zeros. The paper describing this research is in preparation.

#### Preconditioning techniques

A different direction of research is related to preconditioning large sparse linear systems of equations. In this research we consider different preconditioners based on incomplete factorizations. The tangential filtering is an incomplete factorization technique where it is possible to ensure that the factorization will coincide with the original matrix for some specified vector. This research is performed in the context of ANR PETAL project. The participants at INRIA Saclay are L. Grigori, P. Kumar and K. Wang.

Recent research has shown that ILU combined with tangential filtering leads to very efficient preconditioner for matrices arising from the discretization of scalar equations on structured grids and in the previous year we have investigated further their properties.

The problem of solving block tridiagonal linear systems arising from
the discretization of PDE is considered. The nested factorization
preconditioner introduced by [J. R. Appleyard and I. M. Cheshire, *Nested Factorization* , SPE 12264, presented at the Seventh SPE
Symposium on Reservoir Simulation, San Francisco, 1983] is an
effective preconditioner for certain class of problems and a similar
method is implemented in Schlumerger's Eclipse oil reservoir
simulator. In [63] , a
relaxed version of Nested Factorization preconditioner is proposed as
a replacement to ILU0. Indeed the proposed preconditioner is SPD and
leads to a stable splitting if the input matrix is S.P.D. . For ILU0,
equivalent properties hold if the input matrix is a M-matrix. Moreover
it has no storage cost. Effective multiplicative/additive
preconditioning is achieved in combination with Tangential filtering
preconditioner with the filter vector chosen as vector of
ones . Numerical tests are carried out with both additive and
multiplicative combinations. With this setup the new preconditioner is
as robust as the combination of ILU0 with tangential filtering
preconditioner.

We also designed preconditioners based on Kronecker product approximation or Schilder factorization for saddle point problems arising from PDE or optimization [23] .