## Section: New Results

### High performance solvers for large linear algebra problems

#### Partitioning and communication strategies for sparse non-negative matrix factorization

Non-negative matrix factorization (NMF), the problem of finding two non-negative low-rank factors whose product approximates an input matrix, is a useful tool for many data mining and scientific applications such as topic modeling in text mining and blind source separation in microscopy. In this paper, we focus on scaling algorithms for NMF to very large sparse datasets and massively parallel machines by employing effective algorithms, communication patterns, and partitioning schemes that leverage the sparsity of the input matrix. In the case of machine learning workflow, the computations after SpMM must deal with dense matrices, as Sparse-Dense matrix multiplication will result in a dense matrix. Hence, the partitioning strategy considering only SpMM will result in a huge imbalance in the overall workflow especially on computations after SpMM and in this specific case of NMF on non-negative least squares computations. Towards this, we consider two previous works developed for related problems, one that uses a fine-grained partitioning strategy using a point-to-point communication pattern and on that uses a checkerboard partitioning strategy using a collective-based communication pattern. We show that a combination of the previous approaches balances the demands of the various computations within NMF algorithms and achieves high efficiency and scalability. From the experiments, we could see that our proposed algorithm communicates atleast 4x less than the collective and achieves upto 100x speed up over the baseline FAUN on real world datasets. Our algorithm was experimented in two different super computing platforms and we could scale up to 32000 processors on Bluegene/Q.

More information on these results can be found in [21].

#### Low-rank factorizations in data sparse hierarchical algorithms for preconditioning Symmetric positive definite matrices

We consider the problem of choosing low-rank factorizations in data sparse ma- trix approximations for preconditioning large scale symmetric positive definite matrices. These approximations are memory efficient schemes that rely on hierarchical matrix partitioning and compression of certain sub-blocks of the matrix. Typically, these matrix approximations can be constructed very fast, and their matrix product can be applied rapidly as well. The common prac- tice is to express the compressed sub-blocks by low-rank factorizations, and the main contribution of this work is the numerical and spectral analysis of SPD preconditioning schemes represented by 2$\times $ 2 block matrices, whose off-diagonal sub-blocks are low-rank approximations of the original matrix off-diagonal sub-blocks. We propose an optimal choice of low-rank approximations which minimizes the condition number of the preconditioned system, and demonstrate that the analysis can be applied to the class of hierarchically off-diagonal low-rank matrix approximations. Spec- tral estimates that take into account the error propagation through levels of the hierarchy which quantify the impact of the choice of low-rank compression on the global condition number are provided. The numerical results indicate that the properties of the preconditioning scheme using proper low-rank compression are superior to employing standard choices for low-rank compression. A major goal of this work is to provide an insight into how proper reweighted prior to low-rank compression influences the condition number for a simple case, which would lead to an extended analysis for more general and more efficient hierarchical matrix approximation techniques.

More information on these results can be found in [5].

#### Analyzing the effect of local rounding error propagation on the maximal attainable accuracy of the pipelined Conjugate Gradient method

Pipelined Krylov subspace methods typically offer improved strong scaling on par- allel HPC hardware compared to standard Krylov subspace methods for large and sparse linear systems. In pipelined methods the traditional synchronization bottleneck is mitigated by overlap- ping time-consuming global communications with useful computations. However, to achieve this communication-hiding strategy, pipelined methods introduce additional recurrence relations for a number of auxiliary variables that are required to update the approximate solution. This paper aims at studying the influence of local rounding errors that are introduced by the additional recurrences in the pipelined Conjugate Gradient (CG) method. Specifically, we analyze the impact of local round-off effects on the attainable accuracy of the pipelined CG algorithm and compare it to the tra- ditional CG method. Furthermore, we estimate the gap between the true residual and the recursively computed residual used in the algorithm. Based on this estimate we suggest an automated residual replacement strategy to reduce the loss of attainable accuracy on the final iterative solution. The resulting pipelined CG method with residual replacement improves the maximal attainable accuracy of pipelined CG while maintaining the efficient parallel performance of the pipelined method. This conclusion is substantiated by numerical results for a variety of benchmark problems.

More information on these results can be found in [7].

#### Sparse supernodal solver using block low-rank compression: Design, performance and analysis

We propose two approaches using a Block Low-Rank (BLR)
compression technique to reduce the memory footprint and/or the
time-to-solution of the sparse supernodal solver `PaStiX`. This flat,
non-hierarchical, compression method allows to take advantage of the
low-rank property of the blocks appearing during the factorization of
sparse linear systems, which come from the discretization of partial
differential equations. The proposed solver can be used either as a direct
solver at a lower precision or as a very robust preconditioner.
The first approach, called *Minimal Memory*,
illustrates the maximum memory gain that can be obtained with the BLR
compression method, while the second approach, called *Just-In-Time*,
mainly focuses on reducing the computational complexity and thus the
time-to-solution. Singular Value Decomposition (SVD) and
Rank-Revealing QR (RRQR), as compression kernels, are both compared in
terms of factorization time, memory consumption, as well as numerical
properties.
Experiments on a shared memory node with 24 threads and 128 GB of memory
are performed to evaluate the potential of both strategies. On a set
of matrices from real-life problems, we demonstrate a memory footprint
reduction of up to 4 times using the
*Minimal Memory* strategy and a computational time speedup of up to $3.5$
times with the *Just-In-Time* strategy. Then, we study the impact
of configuration parameters of the BLR solver that allowed us to solve
a 3D laplacian of 36 million unknowns a single node, while the
full-rank solver stopped at 8 million due to memory limitation.

These contributions have been published in International Journal of Computational Science and Engineering (JoCS) [9].

#### Supernodes ordering to enhance Block Low-Rank compression in a sparse direct solver

Solving sparse linear systems appears in many scientific
applications, and sparse direct linear solvers are widely used for
their robustness. Still, both time and memory complexities limit the
use of direct methods to solve larger problems. In order to tackle
this problem, low-rank compression techniques have been introduced
in direct solvers to compress large dense blocks appearing in the
symbolic factorization. In this paper, we consider the Block
Low-Rank compression (BLR) format and adress the problem of clustering
unknowns that come from separators issued from the nested dissection
process. We show that methods considering only intra-separators
connectivity (i.e., k-way or recursive bisection) as well as methods
managing only interaction between separators have some
limitations. We propose a new strategy that considers interactions
between a separator and its children to pre-select some interactions
while reducing the number of off-diagonal blocks in the symbolic
structure. We demonstrate how this new method enhances the BLR
strategies in the sparse direct supernodal solver `PaStiX`.

These contributions have been submitted in SIAM Journal on Matrix Analysis and Applications (SIMAX) [22].