Section: New Results
High performance scientific computing
Participants : Laura Grigori, Guy Atenekeng, Simplice Donfack, Amal Khabou, Pawan Kumar, Federico Stivoli, Ke Wang.
The focus of this research is on the design of efficient parallel algorithms for solving problems in numerical linear algebra, as solving very large sets of linear equations and large least squares problems, often with millions of rows and columns. These problems arise in many numerical simulations, and solving them is very time consuming.
Communication avoiding algorithms for LU and QR factorizations
This research focuses on developing new algorithms for linear algebra problems, that minimize the required communication, in terms of both latency and bandwidth. The results we have obtained to date concern algorithms for solving dense linear systems and dense least squares problems.
In [61] we study algorithms for performing the LU and QR factorizations of dense matrices. Recently, two communication optimal algorithms have been introduced for distributed memory architectures, refered to as communication avoiding CALU and CAQR (joint work with J. Demmel and M. Hoemmen from U.C. Berkeley, J. Langou from C.U. Denver, and H. Xiang from University Paris 6). We study two algorithms based on CAQR and CALU that are adapted to multicore architectures. They combine ideas to reduce communication from communication avoiding algorithms with asynchronism and dynamic task scheduling. For matrices that are tall and skinny, that is, they have many more rows than columns, the two algorithms outperform the corresponding algorithms from Intel MKL vendor library on a dual-socket, quad-core machine based on Intel Xeon EMT64 processor and on a four-socket, quad-core machine based on AMD Opteron processor. For matrices of size m = 105 and m = 106 , for n varying from 10 to 1000, multithreaded CALU outperforms the corresponding routine dgetrf from Intel MKL library up to a factor of 2.3 and the corresponding routine dgetrf from AMD (ACML library) up to a factor of 5. Multithreaded CAQR outperforms by a factor of 5.3 the corresponding dgeqrf routine from MKL library. For square matrices, CALU is slightly slower than MKL dgetrf routine when m = n<5000 , while for m = n = 104 it is slightly faster than this routine up to a factor of 1.5 and CAQR is less performant than the corresponding routine in the other librairies.
In 2008 we have introduced a Communication-Avoiding LU factorization
(CALU) algorithm for computing in parallel the LU factorization of a
dense matrix A distributed in a two-dimensional (2D) layout. To
decrease the communication required in the LU factorization, CALU uses
a new pivoting strategy, referred to as ca-pivoting, that may lead to
a different row permutation than the classic LU factorization with
partial pivoting. We have further investigated the numerical
stability of CALU. Our numerical results show that ca-pivoting scheme
is stable in practice. We observe that it behaves as a threshold
pivoting, and in practical experiments is bounded
by 3, while in LU factorization with partial pivoting,
is bounded by 1,where
denotes the matrix
of absolute values of the entries of L . Extensive testing on many
different matrices always resulted in residuals ||Ax-b|| comparable
to those from conventional partial pivoting. In particular we have
shown that CALU is equivalent to performing GEPP on a larger matrix
formed by entries of the input matrix (sometimes slightly perturbed)
and zeros. The paper describing this research is in preparation.
Preconditioning techniques
A different direction of research is related to preconditioning large sparse linear systems of equations. In this research we consider different preconditioners based on incomplete factorizations. The tangential filtering is an incomplete factorization technique where it is possible to ensure that the factorization will coincide with the original matrix for some specified vector. This research is performed in the context of ANR PETAL project. The participants at INRIA Saclay are L. Grigori, P. Kumar and K. Wang.
Recent research has shown that ILU combined with tangential filtering leads to very efficient preconditioner for matrices arising from the discretization of scalar equations on structured grids and in the previous year we have investigated further their properties.
The problem of solving block tridiagonal linear systems arising from the discretization of PDE is considered. The nested factorization preconditioner introduced by [J. R. Appleyard and I. M. Cheshire, Nested Factorization , SPE 12264, presented at the Seventh SPE Symposium on Reservoir Simulation, San Francisco, 1983] is an effective preconditioner for certain class of problems and a similar method is implemented in Schlumerger's Eclipse oil reservoir simulator. In [63] , a relaxed version of Nested Factorization preconditioner is proposed as a replacement to ILU0. Indeed the proposed preconditioner is SPD and leads to a stable splitting if the input matrix is S.P.D. . For ILU0, equivalent properties hold if the input matrix is a M-matrix. Moreover it has no storage cost. Effective multiplicative/additive preconditioning is achieved in combination with Tangential filtering preconditioner with the filter vector chosen as vector of ones . Numerical tests are carried out with both additive and multiplicative combinations. With this setup the new preconditioner is as robust as the combination of ILU0 with tangential filtering preconditioner.
We also designed preconditioners based on Kronecker product approximation or Schilder factorization for saddle point problems arising from PDE or optimization [23] .