Section: New Results
Structuring of Applications for Scalability
Participants : PierreNicolas Clauss, Sylvain ContassotVivier, Vassil Iordanov, Thomas Jost, Jens Gustedt, Soumeya Leila Hernane, Constantinos Makassikis, Stéphane Vialle.
Large Scale and Interactive Fine Grained Simulations
The integration of the formerly separated libraries ParCeL and SSCRAP into parXXL allows the validation of the whole on a wide range of fine grained applications and problems. Among the applications that we started testing this year is the interactive simulation of PDEs in physics, based on the InterCell project, see [4] . There the idea is to express PDEs as local equations in the discrete variable space and to map them in terms of update functions on a cellular automaton. With the help of parXXL this finegrained automaton can then be mapped on the coarse grained target machine, and the automaton cells can communicate synchronously or asynchronously. Finally any finegrained automaton required to solve the expressed PDE can be generated quickly and evaluated efficiently. Our hope is to be able to find solutions for certain types of physically motivated problems, for which currently no performing solvers exist.
A first applicative result has been presented in [32] , and several experiments have been exhibited at SuperComputing2009 on the INRIA booth. At the end of 2009, several complex physical problems and also biologically inspired neural networks are under investigation using parXXL and the InterCell software suite.
Distribution of a Stochastic Control Algorithm
The current version of our Stochastic Control Algorithm, see Section 4.1.4 , allows to successfully optimize an electricity asset management problem with 7energystocks and 10statevariables, and to achieve both speedup and sizeup on PC clusters (up to 256 nodes and 512 cores) and on a Blue Gene/P (up to 8192 nodes and 32768 cores, rank 13 in Top500 in first semester 2008).
In 2009, EDF used this distributed and optimized algorithm and implementation in three applications: (1) an electricity asset management tool at EDF R&D (optimizing the simultaneous control of N electricity production units to minimize the cost of the electricity production), (2) a valorisation tool for a thermic power station in the English subsidiary company EDF Energy (tracking the best production control of this power station function of the energy market), and (3) in another valorisation tool of a thermic power station in Holland (with more constraints on the production control). In parallel, we have attempted to adapt our parallel algorithm for GPU clusters. We succeeded to design and implement a coarse and fine grained solution on GPU clusters, but we did not achieve good performances, and pointed out the limit of the GPU approach for this problem. Some new large scale experiments are planed in 2010 at EDF, and we aim to experiment the fault tolerance mechanisms designed in the PhD thesis of Constantinos Makassikis on some of these applications.
In 2009, we also attempted to port this parallel algorithm on a GPU cluster while EDF has exploited the multicore CPU cluster version in 3 applications. New applications are under development at EDF, and large scale experiments are planed for 2010.
Large Scale Models and Algorithms for Random Structures
A realistic generation of graphs is crucial as an input for testing large scale algorithms, theoretical graph algorithms as well as network algorithms, e.g our platform generator in Section 6.2 .
Commonly used techniques for the random generation of graphs have two disadvantages, namely their lack of bias with respect to history of the evolution of the graph, and their incapability to produce families of graphs with nonvanishing prescribed clustering coefficient. In this work we propose a model for the genesis of graphs that tackles these two issues. When translated into random generation procedures it generalizes wellknown procedures such as those of Erdős & Rény and Barabási & Albert. When just seen as composition schemes for graphs they generalize the perfect elimination schemes of chordal graphs. The model iteratively adds socalled contexts that introduce an explicit dependency to the previous evolution of the graph. Thereby they reflect a historical bias during this evolution that goes beyond the simple degree constraint of preference edge attachment. Fixing certain simple statical quantities during the genesis leads to families of random graphs with a clustering coefficient that can be bounded away from zero. The description of the model is found in [38] ; two internships, [stale citation GUSTEDTTR2009tobefinished], show promising experimental results.
New Control and Data Structures for Efficiently Overlapping Computations, Communications and I/O
Mutual exclusion is one of the classical problems of distributed computing. Several solutions have been devised in the literature, but most of them remain relatively far from practitioner needs. This concern first of all distributed platforms, on which are particularly difficult to control and to certify. We proposed an extension to the classical NaimiTrehel algorithm to allow partial locks on a given data. Such ranged locks offer a semantic close to the POSIX file locking, where each thread locks the subpart of the file it is working on, in the hope that this work can prove useful to high performance computing practitioner [30] .
Further, with the thesis [10] we introduced the framework of ordered readwrite locks , ORWL, that are characterized by two main features: a strict FIFO policy for access and the attribution of access to lockhandles instead of processes or threads. These two properties allow applications to have a controlled proactive access to resources and thereby to achieve a high degree of asynchronism between different tasks of the same application. For the case of iterative computations with many parallel tasks which access their resources in a cyclic pattern we provide a generic technique to implement them by means of ORWL. It was shown that the possible execution patterns for such a system correspond to a combinatorial lattice structure and that this lattice is finite iff the configuration contains a potential deadlock. In addition, we provide efficient algorithms: one that allows for a deadlockfree initialization of such a system and another one for the detection of deadlocks in an already initialized system. The model and theoretical properties of it are published in [15] . An experimental validation of our approach is given in [22] .
structuring algorithms for coprocessing units
In 2009 we have designed and experimented several algorithms and applications, in the fields of option pricing for financial computations, generic relaxation methods, and PDE solving applied to a 3D transport model simulating chemical species in shallow waters. We aim to design a large range of algorithms for GPU cluster architectures, to develop a real knowledge about mixed coarse and fine grained parallel algorithms, and to accumulate practical experience about heterogeneous cluster programming.
Our PDE solver on GPU cluster has been designed in the context of a larger project on the study of asynchronism (see 3.1 and 6.1.6 ). We needed an efficient sparse linear solver. So, we have designed and developed such a solver on a cluster of GPU (up to 16 GPUs). As the GPU memory is still limited and iterative algorithms are less memory consuming than direct ones, our approach was to compare several iterative algorithms on a GPU. The results have lead to several deductions:

there does not exist one method which fulfills both performances and generality

speedups of GPUs according to CPUs are quite variable but most often around 10

the GPU versions are less accurate than their CPU counterparts
All those results are rather encouraging for the use of GPUs in linear problems. Those results have been presented in [33] . The following of that work will be to develop a direct method. The main interest of direct methods lies in the better performances compared to iterative ones. This work will be the subject of the thesis of Thomas Jost, which will be cosupervised by Bruno Lévy from the Alice INRIA team and Sylvain ContassotVivier from the AlGorille team.
In parallel, in the framework of the PhD thesis of Wilfried Kirschenmann, cosupervised by Stéphane Vialle (SUPELEC & AlGorille team) and Laurent Plagne (EDF SINETICS team), we investigate the design of a unified framework based on generic programming to achieve a development environment adapted both to multicore CPUs and GPUs, for linear algebra applied to neutronic computations.
Asynchronism
In the previous paragraph has been mentioned a project in which the study of sparse linear solvers on GPU has been led. That project deals with the study of asynchronism in hierarchical and hybrid clusters mentioned in 3.1 .
Our first experiments, conducted on an advectiondiffusion problem, have shown very interesting results in terms of performances [26] as well as in terms of energetic efficiency (in submission). Moreover, another aspect which is worth being studied is the full use of all the computational power present on each node, in particular the multiple cores, in conjunction with the GPU. This is work in progress.
Heterogeneous Architecture Programming
In 2009 we have designed and experimented several algorithms and applications, in the fields of option pricing for financial computations, PDE solving applied to a 3D transport model simulating chemical species in shallow waters, and generic relaxation methods. In parallel, in collaboration between EDF and SUPÉLEC we investigate the design of a unified framework based on generic programming to achieve a development environment adapted both to multicore CPUs and GPUs, for linear algebra applied to neutronic computations.
Moreover we have measured computing and energetic performances of our algorithms, in order to compare interests of CPU clusters and GPU clusters, function of the algorithms and application domains.