Team Runtime

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

High-performance message passing over generic Ethernet hardware

Participants : Nathalie Furmento, Brice Goglin.

The Open-MX message passing stack (described in Section  5.8 ) offers a native message passing layer on any Ethernet hardware. The API compatibility with the native Myrinet Express stack already enables existing parallel application to use Open-MX . Indeed, several legacy high-performance layers such MPICH2 or Open MPI run works transparently on top of Open-MX with satisfying performance thanks to advanced data movement techniques [8] .

We showed that Open-MX opens a large room for innovative memory management optimizations. Indeed, thanks to Open-MX not requiring complex synchronization between the application, the driver and the NIC, we were able to implement an overlapped memory pinning model, causing the expensive pinning overhead to be hidden behind the actual communication time [32] . Moreover, by combining this idea with I/O AT copy offload, we implemented in Open-MX a dramatically improved intra-node communication stack. As soon as large messages are involved or processes are not sharing a cache, Open-MX now outperforms most existing MPI layers as soon as messages are large [33] . This work raised the awareness of inefficient large message intra-node communications in MPICH2 and Open MPI , leading to the development of our KNem driver (see Section  6.8 ).

Finally, Open-MX is also an interesting framework for studying next-generation hardware features that could help Ethernet hardware becoming legacy in the context of high-performance computing. We exhibited some cache-inefficiency problems in the Open-MX receive stack that are inherited from the Ethernet model. By adding Open-MX -aware packet filtering capabilities in the Multiqueue firmware of Myri-10G boards, we are able to control the location of the processing of the incoming Open-MX traffic. We extended this model by providing an automatic binding facility for user-space applications. This model enables the whole processing of each incoming Open-MX packet on the core that runs its target application, causing the overall cache efficiency to improve dramatically [34] , [20] .

Another example of stateless offload ability that can easily be added to Ethernet NICs so as to bring message passing performance improvements is a clever interrupt coalescing. Indeed, we showed that the usual coalescing of interrupts that was designed for TCP/IP only favors large messages while it dramatically increases small message latency. We designed a dedicated coalescing mechanism and showed that its implementation in Myri-10G NICs improves Open-MX performance with regards to both these important HPC metrics [30] .


Logo Inria