## Section: New Results

### High-performance computing on next generation architectures

#### Evaluation of dataflow programming models for electronic structure theory

Dataflow programming models have been growing in popularity as a means to de- liver a good balance between performance and portability in the post-petascale era. In this paper we evaluate different dataflow programming models for electronic struc- ture methods and compare them in terms of programmability, resource utilization, and scalability. In particular, we evaluate two programming paradigms for expressing scientific applications in a dataflow form: (1) explicit dataflow, where the dataflow is specified explicitly by the developer, and (2) implicit dataflow, where a task schedul- ing runtime derives the dataflow using per-task data-access information embedded in a serial program. We discuss our findings and present a thorough experimental anal- ysis using methods from the NWChem quantum chemistry application as our case study, and OpenMP, StarpU and ParSEC as the task-based runtimes that enable the different forms of dataflow execution. Furthermore, we derive an abstract model to explore the limits of the different dataflow programming paradigms.

More information on these results can be found in [8]

#### On soft errors in the Conjugate Gradient method: sensitivity and robust numerical detection

The conjugate gradient (CG) method is the most widely used iterative scheme for the solution of large sparse systems of linear equations when the matrix is symmetric positive definite. Although more than sixty year old, it is still a serious candidate for extreme-scale computation on large computing platforms. On the technological side, the continuous shrinking of transistor geometry and the increasing complexity of these devices affect dramatically their sensitivity to natural radiation, and thus diminish their reliability. One of the most common effects produced by natural radiation is the single event upset which consists in a bit-flip in a memory cell producing unexpected results at application level. Consequently, the future computing facilities at extreme scale might be more prone to errors of any kind including bit-flip during calculation. These numerical and technological observations are the main motivations for this work, where we first investigate through extensive numerical experiments the sensitivity of CG to bit-flips in its main computationally intensive kernels, namely the matrix-vector product and the preconditioner application. We further propose numerical criteria to detect the occurrence of such faults; we assess their robustness through extensive numerical experiments.

More information on these results can be found in [16].

#### Energy analysis of a solver stack for frequency-domain electromagnetics

High-performance computing (HPC) aims at developing models and simulations for applications in numerous scientific fields. Yet, the energy consumption of these HPC facilities currently limits their size and performance, and consequently the size of the tackled problems. The complexity of the HPC software stacks and their various optimizations makes it difficult to finely understand the energy consumption of scientific applications. To highlight this difficulty on a concrete use-case, we perform an energy and power analysis of a software stack for the simulation of frequency-domain electromagnetic wave propagation. This solver stack combines a high order finite element discretization framework of the system of three-dimensional frequency-domain Maxwell equations with an algebraic hybrid iterative-direct sparse linear solver. This analysis is conducted on the KNL-based PRACE-PCP system. Our results illustrate the difficulty in predicting how to trade energy and runtime.

More information on these results can be found in [18].

#### A compiler front-end for OpenMP's variants

OpenMP 5.0 introduced the concept of *variant*: a directive which can be
used to indicate that a function is a variant of another existing *base
function*, in a specific context (eg: `foo_gpu_nvidia` could be declared
as a variant of `foo` , but only when executing on specific NVIDIA
hardware).

In the context of PRACE-5IP, in collaboration with the Inria STORM team, we
want to leverage this construct to be able to take advantage of the StarPU
heterogeneous scheduler through the interoperability layer between OpenMP and
StarPU. We started this work by implementing the necessary changes in the Clang
front-end to support OpenMP's *variant*. We have assessed this support in
the `Chameleon` dense linear algebra package. Indeed, `Chameleon` relies on
sequential task-based algorithms where sub-tasks of the overall algorithms are
submitted to a runtime system. Additionally to the `quark` , `PaRSEC` and
`StarPU` support, we have implemented an OpenMP support in `Chameleon`. The
originality of the proposed approach is that this OpenMP support can either rely
on a native OpenMP runtime system or indirectly use the above mentioned
OpenMP-StarPU back-end. We are currently assessing the approach on multicore
homogeneous machines, the next step being heterogeneous architectures.