Team AlGorille

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: New Results

Structuring of Applications for Scalability

Participants : Pierre-Nicolas Clauss, Sylvain Contassot-Vivier, Vassil Iordanov, Thomas Jost, Jens Gustedt, Soumeya Leila Hernane, Constantinos Makassikis, Stéphane Vialle.

Large Scale and Interactive Fine Grained Simulations

Our library parXXL allows the validation of a wide range of fine grained applications and problems. This year we were able to test the interactive simulation of PDEs in physics, see [4] , on a large scale. Also, biologically inspired neural networks have been investigated using parXXL and the InterCell software suite. The InterCell suite and these applicative results have been presented in [22] .

Distribution of a N-dimensional problem

In the previous years we have distributed a Stochastic Control Algorithm with EDF R&D. Our parallel algorithm has been designed to support dynamic N-dimensional problems on large scale architectures. It successfully run some electricity asset management problem with 7-energy-stocks and 10-state-variables, and achieved both speedup and size-up on PC clusters (up to 256 nodes and 512 cores) and on a Blue Gene/P (up to 8192 nodes and 32768 cores). In 2009, EDF already used this distributed and optimized algorithm and implementation in three applications. In 2010 we published a book chapter that introduces all algorithmic issues of this research [33] .

However, we designed a parallel algorithm embedded in a stochastic control applicative algorithm. So, we design an applied research project aiming to develop a portable library to distribute and manage some dynamic N-dimensional arrays on large scale architectures, independently of the final application. Collaboration with EDF would be greatly helpful, but the project can be achieved by SUPÉLEC and INRIA .

Large Scale Models and Algorithms for Random Structures

A realistic generation of graphs is crucial as an input for testing large scale algorithms, theoretical graph algorithms as well as network algorithms, e.g our platform generator in Section  6.2 .

Commonly used techniques for the random generation of graphs have two disadvantages, namely their lack of bias with respect to history of the evolution of the graph, and their incapability to produce families of graphs with non-vanishing prescribed clustering coefficient. In this work we propose a model for the genesis of graphs that tackles these two issues. When translated into random generation procedures it generalizes well-known procedures such as those of Erdős & Rény and Barabási & Albert. When just seen as composition schemes for graphs they generalize the perfect elimination schemes of chordal graphs. The model iteratively adds so-called contexts that introduce an explicit dependency to the previous evolution of the graph. Thereby they reflect a historical bias during this evolution that goes beyond the simple degree constraint of preference edge attachment. Fixing certain simple statical quantities during the genesis leads to families of random graphs with a clustering coefficient that can be bounded away from zero.

This year, we have run intensive simulations of these models that confirm the theoretical results and that showed the ability of that approach to model the properties of graphs from application domains. A manuscript reporting on these experimental results has been submitted to a journal, see [37] .

Structuring algorithms for co-processing units

In 2009 and 2010, we have designed and experimented several algorithms and applications, in the fields of option pricing for financial computations, generic relaxation methods, and PDE solving applied to a 3D transport model simulating chemical species in shallow waters. We aim at designing a large range of algorithms for GPU cluster architectures, to develop a real knowledge about mixed coarse and fine grained parallel algorithms, and to accumulate practical experience about heterogeneous cluster programming.

Our PDE solver on GPU cluster has been designed in the context of a larger project on the study of asynchronism (see  3.1 and  6.1.5 ). We needed an efficient sparse linear solver. So, we have designed and developed such a solver on a cluster of GPU (up to 16 GPUs). As the GPU memory is still limited and iterative algorithms are less memory consuming than direct ones, our approach was to compare several iterative algorithms on a GPU. The results have lead to several deductions:

In 2010 we have optimized our synchronous and asynchronous algorithms and implementations of our PDE solver, both on CPU and GPU cluster. The asynchronous parallel algorithm runs faster iterations, but requires more iterations and more complex convergence detection, see Section  6.1.5 . It appears not always faster than the synchronous algorithm, depending on the problem size and the cluster features and size. We measured both computing and energy performances of our PDE solver in order to track the best solution, function of the problem size, the cluster size and the features of the cluster nodes. We are tracking the most efficient solution for each configuration. It can be based on a CPU or a GPU computing kernel, and on a synchronous or asynchronous parallel algorithm. Moreover, the fastest solution is not always the less energy consuming. See Section  6.2.2 . Our recent results are introduced in [18] and [31] . We aim to design and automatic selection of the right kernel and the right algorithm, and to implement an auto-adaptive application, avoiding to the user to have to choose the kernel and algorithm to run.

In parallel, in the framework of the PhD thesis of Wilfried Kirschenmann, co-supervised by Stéphane Vialle (SUPELEC & AlGorille team) and Laurent Plagne (EDF SINETICS team), we have designed and implemented a unified framework based on generic programming to achieve a development environment adapted both to multi-core CPUs, multi-core CPUS with SSE units, and GPUs, for linear algebra applied to neutronic computations, see [27] and [23] . Our framework is composed of two layers: (1) MTPS is a low-level layer hiding the real parallel architecture used, and (2) Legolas++ is a high-level layer allowing to the application developer to rapidly implement linear algebra operations. The Legolas++ layer aims to decrease the development time, while the MTPS layer aims to automatically generates very optimized code for the target architecture and to decrease the execution time. Experimental performances of the MTPS layer appeared very good, the same source code achieved performances close to 100% of the theoretical ones, on any supported target architecture. Our strategy is to generate optimized data storage and data access code for each target architecture, not just different computing codes. A new version of Legolas++ is under development and will be achieved in 2011. It is optimized to use the MTPS layer.

At least, we have continued to design option pricers on clusters of GPUs, with Lokman Abbas-Turki (PhD student at University of Marne-la-Valée) and some colleagues from financial computing. In the past we developed some European option pricers, distributing independent Monte-Carlo computations on the nodes of a GPU cluster. In 2010 we succeeded to develop an American Option pricer on our GPU clusters, distributing strongly coupled Monte-Carlo computations. The Monte-Carlo trajectories depend on each others, and lead to many data transfers between CPUs and GPUs, and to many communications between cluster nodes. First results are encouraging, we achieve speedup and size up. Our algorithm and implementation will be optimized in 2011. Again, we investigate bot computing and energy performances of our developments, in order to compare interests of CPU clusters and GPU clusters considering execution speed and the exploitation cost of our solution.


In the previous paragraph is mentioned a project including the study of sparse linear solvers on GPU. That project deals with the study of asynchronism in hierarchical and hybrid clusters mentioned in  3.1 .

In that context, we study the adaptation of asynchronous iterative algorithms on a cluster of GPUs for solving PDE problems. In our solver, the space is discretized by finite differences and all the derivatives are approximated by Euler equations. The inner computations of our PDE solver consist in solving linear equations (generally sparse). Thus, a linear solver is included in our solver. As that part is the most time consuming one, it is essential to get a version as fast as possible to decrease the overall computation time. This is why we have decided to implement it on GPU, as discussed in the previous paragraph. Our parallel scheme uses the Multisplitting-Newton which is a more flexible kind of block decomposition. In particular, it allows for asynchronous iterations.

Our first experiments, conducted on an advection-diffusion problem, have shown very interesting results in terms of performances  [54] . Moreover, another aspect which is worth being studied is the full use of all the computational power present on each node, in particular the multiple cores, in conjunction with the GPU. This is a work in progress.


Logo Inria