## Section: New Results

### High performance solvers for large linear algebra problems

#### Deflation and preconditioning strategies for sequences of sampled stochastic elliptic equations

We are interested in the quantification of uncertainties in discretized elliptic partial differential equations with random coefficients. In sampling-based approaches, this relies on solving large numbers of symmetric positive definite linear systems with different matrices. In this work, we investigate recycling Krylov subspace strategies for the iterative solution of sequences of such systems. The linear systems are solved using deflated conjugate gradient (CG) methods, where the Krylov subspace is augmented with approximate eigenvectors of the previously sampled operator. These operators are sampled by Markov chain Monte Carlo, which leads to sequences of correlated matrices. First, the following aspects of eigenvector approximation, and their effect on deflation, are investigated: (i) projection technique, and (ii) restarting strategy of the eigen-search space. Our numerical experiments show that these aspects only impact convergence behaviors of deflated CG at the early stages of the sampling sequence. Second, unlike sequences with multiple right-hand sides and a constant operator, our experiments with multiple matrices show the necessity to orthogonalize the iterated residual of the linear system with respect to the deflation subspace, throughout the sampling sequence. Finally, we observe a synergistic effect of deflation and block-Jacobi (bJ) preconditioning. While the action of bJ preconditioners leaves a trail of isolated eigenvalues in the spectrum of the preconditioned operator, for 1D problems, the corresponding eigenvectors are well approximated by the recycling strategy. Then, up to a certain number of blocks, deflated CG methods with bJ preconditioners achieve similar convergence behaviors to those observed with CG when using algebraic multigrid (AMG) as a preconditioner.

This work, developed in the framework of the PhD thesis of Nicolas Venkovic in collaboration with P. Mycek (Cerfacs) and O. Le Maitre (CMAP, Ecole Polytechnique), will be presented at the next Copper Mountain conference on iterative methods.

#### Robust preconditionners via generalized eigenproblems for hybrid sparse linear solvers

The solution of large sparse linear systems is one of the most time consuming kernels in many numerical simulations. The domain decomposition community has developed many efficient and robust methods in the last decades. While many of these solvers fall into the abstract Schwarz (aS) framework, their robustness has originally been demonstrated on a case-by-case basis. In this work, we propose a bound for the condition number of all deflated aS methods provided that the coarse grid consists of the assembly of local components that contain the kernel of some local operators. We show that classical results from the literature on particular instances of aS methods can be retrieved from this bound. We then show that such a coarse grid correction can be explicitly obtained algebraically via generalized eigenproblems, leading to a condition number independent of the number of domains. This result can be readily applied to retrieve or improve the bounds previously obtained via generalized eigenproblems in the particular cases of Neumann-Neumann (NN), Additive Schwarz (AS) and optimized Robin but also generalizes them when applied with approximate local solvers. Interestingly, the proposed methodology turns out to be a comparison of the considered particular aS method with generalized versions of both NN and AS for tackling the lower and upper part of the spectrum, respectively. We furthermore show that the application of the considered grid corrections in an additive fashion is robust in the AS case although it is not robust for aS methods in general. In particular, the proposed framework allows for ensuring the robustness of the AS method applied on the Schur complement (AS/S), either with deflation or additively, and with the freedom of relying on an approximate local Schur complement. Numerical experiments illustrate these statements.

More information on these results can be found in [3]

#### Rank Revealing QR Methods for Sparse Block Low Rank Solvers

In the context of the ANR Sashimi project and the Phd of Esragul Korkmaz, we have investigated several compression methods of dense blocks appearing inside sparse matrix solvers to reduce the memory consumption, as well as the time to solution.

Solving linear equations of type Ax=b for large sparse systems frequently emerges in science and engineering applications, which creates the main bottleneck. In spite that the direct methods are costly in time and memory consumption, they are still the most robust way to solve these systems. Nowadays, increasing the amount of computational units for the supercomputers became trendy, while the memory available per core is reduced. Therefore, when solving these linear equations, memory reduction becomes as important as time reduction. While looking for the lowest possible compression rank, Singular Value Decomposition (SVD) gives the best result. It is however too costly as the whole factorization is computed to find the resulting rank. In this respect, rank revealing QR decomposition variants are less costly, but can introduce larger ranks. Among these variants, column pivoting or matrix rotation can be applied on the matrix A, such that the most important information in the matrix is gathered to the leftmost columns and the remaining unnecessary information can be omitted. For reducing the communication cost of the classical QR decomposition with column pivoting, blocking versions with randomization are suggested as an alternative solution to find the pivots. In these randomized variants, the matrix A is projected on a much lower dimensional matrix by using an independent and identically distributed Gaussian matrix so that the pivoting/rotational matrix can be computed on the lower dimensional matrix. In addition, to avoid unnecessary updates of the trailing matrix at each iteration, a truncated randomized method is suggested and shown to be more efficient for larger matrix sizes. Thanks to these methods, closer results to SVD can be obtained and the cost of compression can be reduced.

A comparison of all these methods in terms of complexity, numerical stability and performance have been presented at the national conference COMPAS'2019 [18], and at the international workshop SparseDay'2019 [19].

#### Accelerating Krylov linear solvers with agnostic lossy data compression

In the context of the Inria International Lab JLESC we have an ongoing collaboration with Argonne National Laboratory on the use of agnostic compression techniques to reduce the memory footprint of iterative linear solvers. Krylov methods are among the most efficient and widely used algorithms for the solution of large linear systems. Some of these methods can, however, have large memory requirements. Despite the fact that modern high-performance computing systems have more and more memory available, the memory used by applications remains a major concern when solving large scale problems. This is one of the reasons why interest in lossy data compression techniques has grown tremendously in the last two decades: it can reduce the amount of information that needs to be stored and communicated. Recently, it has also been shown that Krylov methods allow for some inexactness in the matrix-vector product that is typically required in each iteration. We showed that the loss of accuracy caused by compressing and decompressing the solution of the preconditioning step in the flexible generalized minimal residual method can be interpreted as an inexact matrix-vector product. This allowed us to find a bound on the maximum compression error in each iteration based on the theory of inexact Krylov methods. We performed a series of numerical experiment in order to validate our results. A number of “relaxed compression strategies” was also considered in order to achieve higher compression ratios.

The results of this joint effort will be presented to the next SIAM conférence on Parallel processing SIAM-PP'20.

#### Energy Analysis of a Solver Stack for Frequency-Domain Electromagnetics

High-performance computing aims at developing models and simulations for applications in numerous scientific fields. Yet, the energy consumption of these HPC facilities currently limits their size and performance, and consequently the size of the tackled problems. The complexity of the HPC software stacks and their various optimizations makes it difficult to finely understand the energy consumption of scientific applications. To highlight this difficulty on a concrete use-case, we perform in [8] an energy and power analysis of a software stack for the simulation of frequency-domain electromagnetic wave propagation. This solver stack combines a high order finite element discretization framework of the system of three-dimensional frequency-domain Maxwell equations with an algebraic hybrid iterative-direct sparse linear solver. This analysis is conducted on the KNL-based PRACE-PCP system. Our results illustrate the difficulty in predicting how to trade energy and runtime.

#### Exploiting Parameterized Task-graph in Sparse Direct Solvers

Task-based programming models have been widely studied in the context of dense linear algebra, but remains less studied for the more complex sparse solvers.
In this talk [17], we have presented the use of two different programming models: Sequential Task Flow from StarPU, and Parameterized Task Graph from PaRSEC to parallelize the factorization step of the `PaStiX` sparse direct solver.
We have presented how those programming models have been used to integrate more complex and finer parallelism to take into account new architectures with many computational units.
Efficiency of such solutions on homogeneous and heterogeneous architectures with a spectrum of matrices from different applications have been shown.
We also have presented how such solutions enable, without extra cost to the programmer, better performance on irregular computations such as in the block low-rank implementation of the solver.

#### Block Low-rank Algebraic Clustering for Sparse Direct Solvers

In these talks [20], [21], we adressed the Block Low-Rank (BLR) clustering problem, to cluster unknowns within separators appearing during the factorization of sparse matrices. We have shown that methods considering only intra-separators connectivity (i.e., k-way or recursive bissection) as well as methods managing only interaction between separators have some limitations. The new strategy we proposed consider interactions between a separator and its children to pre-select some interactions while reducing the number of off-diagonal blocks. We demonstrated how this method enhance the BLR strategies in the sparse direct supernodal solver PaStiX, and discuss how it can be extended to low-rank formats with more than one level of hierarchy.

#### Leveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems

In paper [13], we describe how to leverage a task-based implementation of the polar decomposition on massively parallel systems using the PARSEC dynamic runtime system. Based on a formulation of the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, our novel implementation reduces data traffic while exploiting high concurrency from the underlying hardware architecture. First, we replace the most time-consuming classical QR factorization phase with a new hierarchical variant, customized for the specific structure of the matrix during the QDWH iterations. The newly developed hierarchical QR for QDWH exploits not only the matrix structure, but also shortens the length of the critical path to maximize hardware occupancy. We then deploy PARSEC to seamlessly orchestrate, pipeline, and track the data dependencies of the various linear algebra building blocks involved during the iterative QDWH algorithm. PARSEC enables to overlap communications with computations thanks to its asynchronous scheduling of fine-grained computational tasks. It employs look-ahead techniques to further expose parallelism, while actively pursuing the critical path. In addition, we identify synergistic opportunities between the task-based QDWH algorithm and the PARSEC framework. We exploit them during the hierarchical QR factorization to enforce a locality-aware task execution. The latter feature permits to minimize the expensive inter-node communication, which represents one of the main bottlenecks for scaling up applications on challenging distributed-memory systems. We report numerical accuracy and performance results using well and ill-conditioned matrices. The benchmarking campaign reveals up to 2X performance speedup against the existing state-of-the-art implementation for the polar decomposition on 36,864 cores.