## Section: Scientific Foundations

### High performance Fast Multipole Method for N-body problems

Participants : Olivier Coulaud, Pierre Fortin, Luc Giraud, Jean Roman.

In most scientific computing applications considered nowadays as
computational challenges (like biological and material systems,
astrophysics or electromagnetism), the introduction of hierarchical
methods based on an octree structure has dramatically reduced the
amount of computation needed to simulate those systems for a given
error tolerance. For instance, in the N-body problem arising from
these application fields, we must compute all pairwise
interactions among N objects (particles, lines, ...) at every
timestep. Among these methods, the Fast Multipole
Method (FMM) developed for gravitational potentials in astrophysics
and for electrostatic (coulombic) potentials in molecular simulations
solves this N-body problem for any given precision with O(N) runtime
complexity against O(N^{2}) for the direct computation.

The potential field is decomposed in a near field part, directly
computed, and a far field part approximated thanks to multipole and
local expansions.
In the former `ScAlApplix` project, we introduced a matrix formulation of the
FMM that exploits the cache hierarchy on a processor through the Basic
Linear Algebra Subprograms (BLAS). Moreover, we developed a parallel
adaptive version of the FMM algorithm for heterogeneous particle
distributions, which is very efficient on parallel clusters of SMP
nodes. Finally on such computers, we developed the first hybrid
MPI-thread algorithm, which enables to reach better parallel
efficiency and better memory scalability.
We plan to work on the following points in `HiePACS` .

#### Improvement of calculation efficiency

Nowadays, the high performance computing community is examining
alternative architectures that address the limitations of modern
cache-based designs. `GPU` (Graphics Processing Units) and the Cell
processor have thus already been used in astrophysics and in molecular
dynamics. The Fast Mutipole Method has also been implemented on `GPU` .
We intend to examine the
potential of using these forthcoming processors as a building block
for high-end parallel computing in N-body calculations. More
precisely, we want to take advantage of our specific underlying BLAS routines
to obtain an efficient and easily portable FMM for these new architectures.
Algorithmic issues such as dynamic load balancing among heterogeneous
cores will also have to be solved in order to gather all the available
computation power.
This research action will be conduced on close connection with the
activity described in
SectionÂ
3.2 .

#### Non uniform distributions

In many applications arising from material physics or astrophysics, the distribution of the data is highly non uniform and the data can grow between two time steps. As mentioned previously, we have proposed a hybrid MPI-thread algorithm to exploit the data locality within each node. We plan to further improve the load balancing for highly non uniform particle distributions with small computation grain thanks to dynamic load balancing at the thread level and thanks to a load balancing correction over several simulation time steps at the process level.

#### Fast Multipole Method for dislocation operators

The engine that we develop will be extended to new potentials arising from material physics such as those used in dislocation simulations. The interaction between dislocations is long ranged (O(1/r) ) and anisotropic, leading to severe computational challenges for large-scale simulations. Several approaches based on the FMM or based on spatial decomposition in boxes are proposed to speed-up the computation. In dislocation codes, the calculation of the interaction forces between dislocations is still the most CPU time consuming. This computation has to be improved to obtain faster and more accurate simulations. Moreover, in such simulations, the number of dislocations grows while the phenomenon occurs and these dislocations are not uniformly distributed in the domain. This means that strategies to dynamically balance the computational load are crucial to acheive high performance.

#### Fast Multipole Method for boundary element methods

The boundary element method (BEM) is a well known
solution of boundary value problems appearing in various fields of
physics. With this approach, we only have to solve an integral
equation on the boundary. This implies an interaction that decreases in space, but results
in the solution of a dense linear system with
O(N^{3}) complexity. The FMM calculation that performs the
matrix-vector product enables the use of Krylov subspace
methods. Based on the parallel data distribution of the underlying
octree implemented to perform the FMM, parallel preconditioners can be
designed that exploit the local interaction matrices computed at the
finest level of the octree.
This research action will be conduced on close connection with the
activity described in SectionÂ
3.3 .
Following our earlier experience, we plan to first consider
approximate inverse preconditionners that can efficiently exploit
these data structures.