Team AlGorille

Overall Objectives
Scientific Foundations
Application Domains
New Results
Other Grants and Activities

Section: New Results

Experimentation Methodology

Participants : Tomasz Buchert, Pierre-Nicolas Clauss, Fekari El Mehdi, Jens Gustedt, Martin Quinson, Cristian Rosa, Lucas Nussbaum.

Overall Improvements of the SimG rid Framework

This year was the second year (out of three) of the ANR project centered on SimG rid, of which we are principal investigator (see  8.2.4 ). Several improvement therefore occurred in the framework, with numerous contributions from the ANR participants. This served as a flagship for the whole SimG rid project and hosted several of our research efforts, detailed in the subsequent sections (up to 6.3.6 ).

In addition, the software quality efforts were pursued through the INRIA ADT project (see  8.2.1 ) in order to maximize the impact of our research on our current user community. First, we improved further our automated regression testing infrastructure, by increasing the test coverage and through nightly builds on the INRIA pipol infrastructure. Then, we began a reorganization of the user documentation. Also, performance tuning deserved a lot of our attention this year too. Finally, two new bindings were added to make the framework usable to user preferring the Ruby or lua programming languages over the C (Java bindings were already available).

Finally, several operations were conducted to increase our user community, such as tutorials and flier distributions during conferences or the edition of a SimGrid newsletter aiming at explaining the newest tool evolutions to the users. The most preeminent actions were the organization of the first SimGrid User Days in April in Corsica. This event, hosting about 20 members of the development team alongside to 20 (current or potential) users for one week, were the occasion to present, and get feedback on, some new features as well as to ensure that the planned developments match the user community expectations. We were also present at the SuperComputing conference to try to meet potential users.

Synthesizing Generic Experimental Environments for Simulation

Simulation allows the fast and reproducible exploration of numerous experimental scenarios, including what if scenarios testing conditions not available at experimenter hand. But this almost unbound freedom in the experimental setup can also reveal disturbing. We we analyzed the requirements expressed by different research communities. As the existing tools of the literature are too specific, we then propose a more generic experimental environment synthesizer called Simulacrum. This tool allows its users to select a model of a currently deployed computing grid or generate a random environment, extract a subset of it that fulfills his/her requirements and import the result into the SimG rid framework. This work, conducted in collaboration with F. Suter from the Computing Center at IN2P3 as well as with L. Bobelin from the Mescal team at INRIA Rhônes-Alpes, lead to a publication [28] .

Model-Checking of Distributed Applications

For a few years we have cooperated with Stephan Merz from the VeriDis project team on adding model checking capabilities to the SimG rid framework. The expected benefit of such an integration is that programmers can complement simulation runs by exhaustive state space exploration in order to detect errors such as race conditions that would be hard to reproduce by testing. Indeed, a simulation platform provides a controlled execution environment that mediates interactions between processes, and between processes and the environment, and thus provides the basic functionality for implementing a model checker. The principal challenge is the state explosion problem, as a naive approach to systematic generation of all possible process interleavings would be infeasible beyond the most trivial programs. Moreover, it is impractical to store the set of global system states that have already been visited: the programs under analysis are arbitrary C programs with full access to the heap, making even a hashed representation of system states very difficult and costly to implement.

In 2010, we made major advances on both the theoretical and the practical side of designing and implementing a model checker for SimGrid. Given that processes interact through explicit message passing, which is ultimately implemented in the SIMIX layer of SimGrid, the model checker need only control the possible interleavings of the operations provided at this layer (Send , Recv , Test , and WaitAny ). Redundant interleavings can be ruled out by relying on Dynamic Partial-Order Reduction (DPOR) [53] , which requires determining which interleaving orders may potentially lead to different results. Because we have only four primitive operations to consider, we could formally specify them in TLA + and prove independence results based on this model. This compares very favorably to a similar analysis carried out by Pervez et al. [56] at the MPI level, requiring more than 100 pages of TLA + specifications alone, for comparable reductions. This result has been published at a workshop [29] .

In fact, the same techniques are also useful for avoiding redundant computations when performing large numbers of simulation runs. A stateless model checker relying on DPOR has now been implemented within the SimGrid platform, and a submission to a major conference is in preparation.


The final goal of SMPI is to simulate a C/C++/FORTRAN MPI program designed for a multi-processor system on a single computer without any source code modification. This address one of the main limitation of SimG rid, which requires the application to be written using one of the specific interfaces atop the simulator. New efforts have been put since July 2009 in this project, hereby continuing the work initiated by Henri Casanova and Mark Stilwell at University of Hawai'i at Manoa.

Previous work included a prototype implementation of various MPI primitives such as send , recv , isend , irecv and wait . Since the project's revival, many of the collective operations (such as bcast , alltoall , reduce ) have been implemented. The standard network model used in SimG rid has also been reworked to reach a higher precision in communication timings. Indeed, MPI programs are traditionally run on high performance computers such as clusters, and this requires to capture fine network details to correctly model the program behavior. Starting from the existing, validated network model of SimGrid, we have derived for SMPI a specific model a piece-wise linear model which closely fits real measurements. In particular, it enables to correctly models small messages and messages above the eager/rendezvous protocol limit. This work has been accepted for publication at the IPDPS conference in may 2011, and is already published as a research report [35] .

Application Workload and Trace Replay

Simulations in SimG rid are usually written as sets of agents exchanging messages. In some settings, an event-oriented approach may however lead to simpler solutions. Instead of specifying a large function containing the full logic of the agent, one simply specify how it should react to each possible incoming message or event. This could be particularly interesting for example when replaying post-mortem traces captured on real applications. In 2010, we worked on two aspects of this problem: the trace capture and the trace replay.

Concerning the trace capture, we initiated the Simterpose project, which aims at providing emulation capabilities on top of SimGrid, by intercepting the actions of the application and providing them to the simulator. During the internship of Marion Guthmuller in 2010, different methods to intercept the actions of applications were evaluated (ptrace , LD_PRELOAD , DynInst , Valgrind ), and a trace extraction tool using ptrace was developed (thus similar to the classical tools strace or gdb ). A preliminary publication is in preparation while we aim at improving this prototype and integrate it properly in the SimG rid framework to actually allow the emulation of arbitrary applications through the simulator.

Concerning the trace replay, we worked on a replay mechanism in SimG rid specialized for the study of MPI applications through post-mortem analysis. This work was conducted in collaboration with F. Suter from the Computing Center at IN2P3 as well as with F. Desprez and G. Markomanolis from the Graal team at INRIA Rhônes-Alpes. Its main originality is to rely on time-independent execution traces. This allows us to completely decouple the acquisition process from the actual replay of the traces in a simulation context. Then we are able to acquire traces for large application instances without being limited to an execution on a single compute cluster. Finally our replay framework is built directly on top of the SimG rid simulation kernel. This work was recently submitted to the CCGrid conference, and is also available as research report: [36] .

SimGrid Scalability Improvements

In addition to the software tuning and improvement described in 6.3.1 , we tackled the main SimG rid scalability limitations at an algorithmic level.

One of the main remaining limitation were the memory consumption due to its current network representation. Indeed, SimGrid used to use complete routing table, with the whole set of links used to go from any host to any other host. This large O(N2) routing table currently prevented SimGrid to simulate a platform with more than a few thousands of nodes, regardless of the simulated application. De Munck et al. have proposed [49] to recompute dynamically this routing using shortest-path algorithms. However, this approach suffers both from performance and scalability issues.

This year, we proposed a new platform description formalism to handle very large platforms in collaboration with A. Legrand and L. Bobelin from the Mescal team at INRIA Rhônes-Alpes, and with F. Suter from the Computing Center at IN2P3. This formalism takes advantage from the hierarchical structure of the platforms and from the regularity of some parts of it. This enables to build a small memory footprint description of the platform that would replace the current comprehensive routing table and would be much faster and scalable than the previous generic approaches proposed in [49] . This improvement will enable to keep using the same realistic flow-based models as earlier but with platforms that are many orders of magnitude larger than what is currently possible. The implementation of this new approach constituted the main work of David Marquez from University of Buenos Aires, Argentina, during his internship in our team. A publication summarizing these improvements is currently under preparation.

These improvements completely solved the memory pressure due to the internal network representation, but the memory remains the main limitation for example when simulating high performance applications through the new SMPI interface. In that case, it may reveal necessary to distribute the simulation in order to leverage the memory of several computing facilities. For that, we completely redesigned the SIMIX layer of the simulator in order to further decouple the execution of each user process from the others and from the simulation kernel execution. This decoupling were recently finished and we are now working on executing the simulation both in parallel (to leverage several computing cores) and in distributed (to leverage the memory of several computers).

Formal Verification of Distributed Algorithms' Specifications

In joint research with Stephan Merz and Sabina Akhtar of the Mosel team of INRIA Nancy and LORIA, we extended the PlusCal language  [55] to allow the description and verification of distributed algorithms, whereas the original language is geared towards shared-memory concurrent programming. In 2010, the compiler from this language to the TLA + tool suite were made operational, and were presented in [13] . We are now exploring how to allow the TLC model-checker to apply dynamic partial order reduction technique to fight the combinatorial explosion. This could be made possible by adding specific information in the TLA + files generated from our extension of PlusCal.


During an Internship with Tomasz Buchert, the implementation of CPU performance evaluation in Wrekavoc was reconsidered to handle the case of the emulation of multi-core systems using multi-core nodes. Three different methods for the emulation of multi-core CPU performance were designed (Fracas, CPU-Hogs, CPU-Gov). This work resulted in a first publication in a workshop on the Fracas method [15] . A more complete publication is also in preparation, and was already published as a research report [34] . Tomasz Buchert submitted this work as a master thesis at his home university [46] .

Grid'5000 and ADT Aladdin-G5K

Grid'5000 is an experimental platform for research on distributed systems, composed of 9 sites in France. In 2010, the “grelon” cluster was replaced by the newly bought “graphene” cluster, composed or 144 nodes (with 4 cores and 16 GB of RAM each). In addition to “graphene”, the “griffon” cluster (92 nodes, 8 cores each) is still operational.

Lucas Nussbaum collaborated with the INRIA CACAO team on the technical aspects of the RSA-768 experiment that led to a new record in integer factorization. This work led to a publication [24] .


The DSL-Lab platform was built during the ANR JC DSL-Lab. It is composed of 40 nodes dedicated to experiments on the broadband Internet. In 2010, a final publication summarizing the results of the project was published [21] .

Experimental cluster of GPUs

The experimental platform of SUPÉLEC for "GPGPU", see Section  4.2.6 , has been improved again in 2010.

First, the 16 NVIDIA GPUs GTX285 have been replaced by new NVIDIA GPUs GTX480 ("Fermi" architecture), and the 16 old GT8800 have been replaced by the 16 GTX285. So, the first cluster is now composed of 16 PCs, each one hosting a dual-core CPU and a GPU card: a nVIDIA GeForce GT285, with 1GB of RAM (on the GPU card). The 16 nodes are interconnected across a devoted Gigabit Ethernet switch. The second cluster has 16 more recent nodes, composed of an Intel Nehalem CPU with 4 hyper-threaded cores at 2.67GHz, and a nVIDIA GTX480 ("Fermi") GPU card with 1.5GB of memory. This cluster has a Gigabit Ethernet interconnection network too. These 2 clusters can been accessed and used like one 32-nodes heterogeneous cluster of hybrid nodes. This platform has allowed us to experiment different algorithms on an heterogeneous cluster of GPUs.

Second, the energy consumption of each node of the cluster hosting the GTX285 GPUs is monitored by a Raritan DPXS20A-16 device that continuously measures the electric power consumption (in Watts). But the cluster hosting the GTX480 GPUs consumes more energy, and exceeds the maximum energy supported by a Raritan DPXS20A-16 device. So, we have improved its energy monitoring system, and it is now monitored by two different Raritan devices.

Third, we have also improved our software (Perl and shell script) that sample the electrical power (Watt) measured by the Raritan devices and compute the energy (Joule or Watt Hour) consumed by the computation on each node and on the complete cluster (including the interconnection switch). This energy consumption monitoring system (hardware and software) has been intensively used to measure performances of our PDE solver (see sections  6.1.4 and  6.1.5 ), and our American option pricer (see section  6.1.4 ).

This platform has been intensively used to get experimental performance measures published in [31] , [19] , [18] and in a book chapter to appear in 2011 [60] .


Logo Inria