Section: New Results
Network services for high demanding applications
Design and development of an MPI gateway
Keywords : MPI, high-speed interconnects, Grid, Grid5000, heterogeneity, relays.
The MPI standard is often used in parallel applications for communication needs. Most of them are designed for homogeneous clusters but MPI implementations for grids have to take into account heterogeneity and long distance network links in order to maintain a high performance level. These two constraints are not considered together in existing MPI implementations and raise the question of MPI efficiency in grids. Our goal is to significatively improve the performance execution of MPI applications on the grid.
We have done a state of the art, a performance evaluation, understanding and tuning of four recent MPI implementations for the Grid : MPICH-Madeleine, GridMPI, OpenMPI and MPICH2. The comparison is based on the executions of pingpong, NAS Parallel Benchmarks and a real application of geophysics. These experiments take place on the national GRID'5000 testbed. We show that a tuning of both TCP protocol and MPI implementation are necessary to obtain good performances on the grid. We study the impact on application time execution of a long-way latency between two groups of 8 MPI tasks for each NAS parallel benchmark. Our experiments and tunings presented in  lead to the conclusion that GridMPI performs better results than the others and that executing MPI applications on a grid can be beneficial if some specific parameters are well tuned.
Based on these results, we propose a new transparent layer called MPI5000 and placed between MPI and TCP allowing application composed of several tasks to be correctly distributed on available node regarding the grid topology and the application scheme. Thus, our layer needs two data files: a file describing the grid topology including available nodes, both latency and bandwidth between the nodes and between sites; another file describing the application communication patterns with the size and the amount of messages sent between MPI processes. Using these two data files, our layer should realise an efficient placement of tasks on grid nodes.
Our layer also proposes to transparently slipt TCP connections between MPI processes in order to take into account the grid topology. This new architecture is based on a system of relays placed at the LAN/WAN interface. We replace each end-to-end TCP connection by three connections (two on the LAN between a node and a relay, one on the WAN between two relays). Thus, it allows a faster lost recovery on LAN as well as a reduction of memory used because the size of TCP buffers depends on RTT latency of the connection. Thanks to our architecture, we have proposed to use different TCP implementations for local and distant communications. The relays could also implement a different scheduling strategy of MPI messages : for instance, we could give priority to small messages (usually MPI control messages). Finally, as MPI applications are mostly using small messages, they are more penalised if the network is congestionned by large flows. Thanks to the communication aggreagtion between relays, we have showed that our acrhitecture allows to keep the congestion window closer to available throughput on the long-distance network.
This work is detailled and evaluated in  , and shows which applications can benefit from these optimisations. We analyse for many points, the overhead and the benefits of the use of proxies. The theoritical analysis is supported by experiments. We conclude that for MPI applications that are using collective operations, the benefit on losses and retransmissions generally do not hide the overhead added by the splitting of the connections. Other applications benefit from this mechanism if they communicate sufficiently.
The implementation of MPI5000 is based both on a library between MPI and the operating system and on relays. Thus, the proposed architecture is independant of MPI implementations and is totally transparent for applications.
Development of a metrology platform on Grid5000
Keywords : metrology, monitoring, Gtrc-Net1, packet capture, header extraction.
This activity is partially supported by the program GridNets-FJ (Équipe associée ) between INRIA and AIST (Japan).
Researches in network traffic analysis embrace a large diversity of goals and are based on a variety of methodologies and tools. To have a better insight on the real nature and on the evolution of network traffic we argue that fine-grain analysis of real traffic traces have to complement simulations studies as well as coarse grain measurement performed by classical flow measurement systems. In particular, packet level measurements and analysis are needed. However, such methodologies are resource consuming and require very high performance devices to be operational in real high speed networks. In we present the Metroflux system which aims at providing researchers and network operators with a very flexible and accurate packet-level traffic analysis toolkit configured for 1 Gbps and 10 Gbps speed links. This system is based on the GtrcNet FPGA-based device technology and on specific statistical analysis tools. We show the potential and the facilities offered by the Metroflux system coupled with the Grid5000 large scale experimental platform and the Network eXperiment Engine (NXE ) we have developed. In we illustrate the application of Metroflux with the practical validation of the theoretical prediction relating self-similarity and heavy tails given by Taqqu theorem. We also illustrate several usages of this toolset, such as the investigation of conditions under which several traffic theories apply, as well as studies on traffic, protocols and systems interactions.