## Section: Scientific Foundations

### High-performance computing on next generation architectures

Participants : Rached Abdelkhalek, Olivier Coulaud, Iain Duff, Pierre Fortin, Luc Giraud, Abdou Guermouche, Guillaume Latu, Jean Roman.

The research directions proposed in `HiePACS` are strongly
influenced by both the applications we are studying and the
architectures that we target (i.e., massively parallel architectures,
...). Our main goal is to study the methodology needed to
efficiently exploit the new generation of high-performance computers
with all the constraints that it induces. To achieve this
high-performance with complex applications we have to study both
algorithmic problems and the impact of the architectures on the
algorithm design.

From the application point of view, the project will be interested in multiresolution, multiscale and hierarchical approaches which lead to multi-level parallelism schemes. This hierarchical parallelism approach is necessary to achieve good performance and high-scalability on modern massively parallel platforms. In this context, more specific algorithmic problems are very important to obtain high performance. Indeed, the kind of applications we are interested in are often based on data redistribution for example (e.g. code coupling applications). This well-known issue becomes very challenging with the increase of both the number of computational nodes and the amount of data. Thus, we have both to study new algorithms and to adapt the existing ones. In addition, some issues like task scheduling have to be restudied in this new context. It is important to note that the work done in this area will be applied for example in the context of code coupling (see Section 3.5 ).

Considering the complexity of modern architectures like
massively parallel architectures (i.e., Blue Gene-like platforms) or
new generation heterogeneous multicore architectures, task
scheduling becomes a challenging problem which is central to
obtain a high efficiency. Of course, this work requires the
use/design of scheduling algorithms and models specifically to
tackle our target problem. This has to be done in collaboration with
our colleagues from the scheduling community like for example
O. Beaumont (INRIA CEPAGE Project-Team). It is important to note that this
topic is strongly linked to the underlying programming
model. Indeed, considering multicore architectures, it has appeared, in
the last five years, that the best programming model is an approach
mixing multi-threading within computational nodes and message
passing between them. In the last five years, a lot of work has been
developed in the high-performance computing community to understand
what is critic to efficiently exploit massively multicore
platforms that will appear in the near future. It appeared that the
key for the performance is firstly the grain of
computations. Indeed, in such platforms the grain of the parallelism
must be small so that we can feed all the processors with a
sufficient amount of work. It is thus very crucial for us to design new high
performance tools for scientific computing in this new context. This
will be done in the context of our solvers, for example, to adapt to
this new parallel scheme. Secondly, the larger the number of cores
inside a node, the more complex the memory hierarchy. This remark
impacts the behaviour of the algorithms within the node. Indeed, on
this kind of platforms, NUMA effects will be more and more
problematic. Thus, it is very important to study and design
data-aware algorithms which take into account the affinity between
computational threads and the data they access. This is particularly
important in the context of our high-performance tools. Note that
this work has to be based on an intelligent cooperative underlying
runtime (like the `marcel` thread library developed by the INRIA
RUNTIME Project-Team) which allows a fine management of data
distribution within a node.

Another very important issue concerns high-performance computing
using “heterogeneous” resources within a computational
node. Indeed, with the emergence of the `GPU` and the use of
more specific co-processors (like clearspeed cards, ...), it is
important for our algorithms to efficiently exploit these new kind of
architectures. To adapt our algorithms and tools to these
accelerators, we need to identify what can be done on the `GPU`
for example and what cannot. Note that recent results in the field
have shown the interest of using both regular cores and `GPU`
to perform computations. Note also that in opposition to the case of
the parallelism granularity needed by regular multicore
architectures, `GPU` requires coarser grain parallelism. Thus,
making both `GPU` and regular cores work all together will lead to two
types of tasks in terms of granularity. This represents a challenging
problem especially in terms of scheduling.
Our final goal would be to have high performance solvers
and tools which can efficiently run on all these types of
complex architectures by exploiting all the resources of the
platform (even if they are heterogeneous).

In order to achieve an advanced knowledge concerning the design of
efficient computational kernels to be used on our high performance
algorithms and codes, we will develop research activities first on
regular frameworks before extending them to more irregular and complex
situations.
In particular, we will work first on optimized dense linear algebra
kernels and we will use them in our more complicated hybrid
solvers for sparse linear algebra and in our fast multipole algorithms for
interaction computations.
In this context, we will participate to the development of those kernels
in collaboration with groups specialized in dense linear algebra. In particular,
we intend develop a strong collaboration with the group of Jack Dongarra
at the University Of Tennessee. The objectives will be to
develop dense linear algebra algorithms and libraries for multicore
architectures in the context the PLASMA project
(http://icl.cs.utk.edu/plasma/ )
and for `GPU` and hybrid multicore/`GPU` architectures in the context of the
MAGMA project
(http://icl.cs.utk.edu/magma/ ).

The applications targeting massively parallel architectures are
very sensitive to communication or I/O management schemes. This
observation becomes particularly true, when we consider applications
dealing with a huge amount of data like very large scale simulations
that may produce petaBytes of data. Thus, in the continuation of the
work we did around `out-of-core` extensions of our former sparse linear
solvers, we will study how we can efficiently deal with this huge
amount of data. Obtaining performance when relying on I/O operations
or on data transfers is mainly constrained by the capacity to
overlap as much as much possible these operations with
computations. Another key feature is prefetching in the context of
I/O intensive applications. Even, if the problem is a well-known
issue which has been studied in the past decade, it remains very
complex regarding the complexity of our target platforms were we
already need prefetching and asynchronism to efficiently exploit the
platform (this is particularly true in the case of `GPU` ).

A more prospective objective is to study the fault tolerance in the
context of large-scale scientific applications for massively
parallel architectures. Indeed, with the increase of the number of
computational cores per node, the probability of a hardware crash on
a core is dramatically increased. This represents a crucial problem
that needs to be addressed. However, we will only study it at the
algorithmic/application level even if it needed lower-level
mechanisms (at OS level or even hardware level). Of course, this
work can be done at lower levels (at operating system) level for
example but we do believe that handling faults at the application
level provides more knowledge about what has to be done (at
application level we know what is critical and what is not). The
approach that we will follow will be based on the use of a
combination of fault-tolerant implementations of the run-time
environments we use (like for example `FT-MPI` ) and
an adaptation of our algorithms to try to manage this kind of
faults. This topic represents a very long range objective which
needs to be addressed to guaranty the robustness of our solvers and
applications.

Finally, it is important to note that the main goal of `HiePACS` is to
design tools and algorithms that will be used within
complex simulation frameworks on next-generation parallel
machines. Thus, we intend with our partners to use the proposed
approach in complex scientific codes and to validate them within
very large scale simulations.