Section: New Results
Structuring of Applications for Scalability
Participants : PierreNicolas Clauss, Sylvain ContassotVivier, Vassil Iordanov, Thomas Jost, Jens Gustedt, Soumeya Leila Hernane, Constantinos Makassikis, Stéphane Vialle.
Large Scale and Interactive Fine Grained Simulations
Our library parXXL allows the validation of a wide range of fine grained applications and problems. This year we were able to test the interactive simulation of PDEs in physics, see [4] , on a large scale. Also, biologically inspired neural networks have been investigated using parXXL and the InterCell software suite. The InterCell suite and these applicative results have been presented in [22] .
Distribution of a Ndimensional problem
In the previous years we have distributed a Stochastic Control Algorithm with EDF R&D. Our parallel algorithm has been designed to support dynamic Ndimensional problems on large scale architectures. It successfully run some electricity asset management problem with 7energystocks and 10statevariables, and achieved both speedup and sizeup on PC clusters (up to 256 nodes and 512 cores) and on a Blue Gene/P (up to 8192 nodes and 32768 cores). In 2009, EDF already used this distributed and optimized algorithm and implementation in three applications. In 2010 we published a book chapter that introduces all algorithmic issues of this research [33] .
However, we designed a parallel algorithm embedded in a stochastic control applicative algorithm. So, we design an applied research project aiming to develop a portable library to distribute and manage some dynamic Ndimensional arrays on large scale architectures, independently of the final application. Collaboration with EDF would be greatly helpful, but the project can be achieved by SUPÉLEC and INRIA .
Large Scale Models and Algorithms for Random Structures
A realistic generation of graphs is crucial as an input for testing large scale algorithms, theoretical graph algorithms as well as network algorithms, e.g our platform generator in Section 6.2 .
Commonly used techniques for the random generation of graphs have two disadvantages, namely their lack of bias with respect to history of the evolution of the graph, and their incapability to produce families of graphs with nonvanishing prescribed clustering coefficient. In this work we propose a model for the genesis of graphs that tackles these two issues. When translated into random generation procedures it generalizes wellknown procedures such as those of Erdős & Rény and Barabási & Albert. When just seen as composition schemes for graphs they generalize the perfect elimination schemes of chordal graphs. The model iteratively adds socalled contexts that introduce an explicit dependency to the previous evolution of the graph. Thereby they reflect a historical bias during this evolution that goes beyond the simple degree constraint of preference edge attachment. Fixing certain simple statical quantities during the genesis leads to families of random graphs with a clustering coefficient that can be bounded away from zero.
This year, we have run intensive simulations of these models that confirm the theoretical results and that showed the ability of that approach to model the properties of graphs from application domains. A manuscript reporting on these experimental results has been submitted to a journal, see [37] .
Structuring algorithms for coprocessing units
In 2009 and 2010, we have designed and experimented several algorithms and applications, in the fields of option pricing for financial computations, generic relaxation methods, and PDE solving applied to a 3D transport model simulating chemical species in shallow waters. We aim at designing a large range of algorithms for GPU cluster architectures, to develop a real knowledge about mixed coarse and fine grained parallel algorithms, and to accumulate practical experience about heterogeneous cluster programming.
Our PDE solver on GPU cluster has been designed in the context of a larger project on the study of asynchronism (see 3.1 and 6.1.5 ). We needed an efficient sparse linear solver. So, we have designed and developed such a solver on a cluster of GPU (up to 16 GPUs). As the GPU memory is still limited and iterative algorithms are less memory consuming than direct ones, our approach was to compare several iterative algorithms on a GPU. The results have lead to several deductions:

there does not exist one method which fulfills both performances and generality

speedups of GPUs according to CPUs are quite variable but most often around 10

the GPU versions are less accurate than their CPU counterparts
In 2010 we have optimized our synchronous and asynchronous algorithms and implementations of our PDE solver, both on CPU and GPU cluster. The asynchronous parallel algorithm runs faster iterations, but requires more iterations and more complex convergence detection, see Section 6.1.5 . It appears not always faster than the synchronous algorithm, depending on the problem size and the cluster features and size. We measured both computing and energy performances of our PDE solver in order to track the best solution, function of the problem size, the cluster size and the features of the cluster nodes. We are tracking the most efficient solution for each configuration. It can be based on a CPU or a GPU computing kernel, and on a synchronous or asynchronous parallel algorithm. Moreover, the fastest solution is not always the less energy consuming. See Section 6.2.2 . Our recent results are introduced in [18] and [31] . We aim to design and automatic selection of the right kernel and the right algorithm, and to implement an autoadaptive application, avoiding to the user to have to choose the kernel and algorithm to run.
In parallel, in the framework of the PhD thesis of Wilfried Kirschenmann, cosupervised by Stéphane Vialle (SUPELEC & AlGorille team) and Laurent Plagne (EDF SINETICS team), we have designed and implemented a unified framework based on generic programming to achieve a development environment adapted both to multicore CPUs, multicore CPUS with SSE units, and GPUs, for linear algebra applied to neutronic computations, see [27] and [23] . Our framework is composed of two layers: (1) MTPS is a lowlevel layer hiding the real parallel architecture used, and (2) Legolas++ is a highlevel layer allowing to the application developer to rapidly implement linear algebra operations. The Legolas++ layer aims to decrease the development time, while the MTPS layer aims to automatically generates very optimized code for the target architecture and to decrease the execution time. Experimental performances of the MTPS layer appeared very good, the same source code achieved performances close to 100% of the theoretical ones, on any supported target architecture. Our strategy is to generate optimized data storage and data access code for each target architecture, not just different computing codes. A new version of Legolas++ is under development and will be achieved in 2011. It is optimized to use the MTPS layer.
At least, we have continued to design option pricers on clusters of GPUs, with Lokman AbbasTurki (PhD student at University of MarnelaValée) and some colleagues from financial computing. In the past we developed some European option pricers, distributing independent MonteCarlo computations on the nodes of a GPU cluster. In 2010 we succeeded to develop an American Option pricer on our GPU clusters, distributing strongly coupled MonteCarlo computations. The MonteCarlo trajectories depend on each others, and lead to many data transfers between CPUs and GPUs, and to many communications between cluster nodes. First results are encouraging, we achieve speedup and size up. Our algorithm and implementation will be optimized in 2011. Again, we investigate bot computing and energy performances of our developments, in order to compare interests of CPU clusters and GPU clusters considering execution speed and the exploitation cost of our solution.
Asynchronism
In the previous paragraph is mentioned a project including the study of sparse linear solvers on GPU. That project deals with the study of asynchronism in hierarchical and hybrid clusters mentioned in 3.1 .
In that context, we study the adaptation of asynchronous iterative algorithms on a cluster of GPUs for solving PDE problems. In our solver, the space is discretized by finite differences and all the derivatives are approximated by Euler equations. The inner computations of our PDE solver consist in solving linear equations (generally sparse). Thus, a linear solver is included in our solver. As that part is the most time consuming one, it is essential to get a version as fast as possible to decrease the overall computation time. This is why we have decided to implement it on GPU, as discussed in the previous paragraph. Our parallel scheme uses the MultisplittingNewton which is a more flexible kind of block decomposition. In particular, it allows for asynchronous iterations.
Our first experiments, conducted on an advectiondiffusion problem, have shown very interesting results in terms of performances [54] . Moreover, another aspect which is worth being studied is the full use of all the computational power present on each node, in particular the multiple cores, in conjunction with the GPU. This is a work in progress.