Section: New Results
Topology-aware High-Performance Computing
Participants : François Broquedis, Jérôme Clet-Ortega, Brice Goglin, Emmanuel Jeannot, Guillaume Mercier, Stéphanie Moreaud, Samuel Thibault.
The democratization of multicore processors and NUMA machines spreads complex and hierarchical architectures to the whole world of high-performance computing and even more. So far, the need to master the internal hardware topology was critical only to large shared-memory machines but now comes to smaller nodes and clusters as well.
We showed that a proper MPI processes binding policy within NUMA nodes induces significant impact for parallel application performance  ,  . We proposed an automatic placement scheme that gathers information about the application communication patterns during a preliminary run so as to place processes according to their communication affinities and to the hardware characteristics such as shared caches or NUMA nodes. We developed a specific algorithm (called TreeMatch ) for matching the processes to the resources in order to reduce the communication cost of the application. However, in order to be able to place the MPI processes onto the various computing cores, we need to acquire the most encompassing vision of the architecture.
The hwloc software (see Section 5.2 ) answers this problem by offering a detailed knowledge of the hardware in a portable and abstracted manner. We showed that hwloc can help popular high-performance OpenMP or MPI software  . Indeed, scheduling OpenMP threads according to their affinities or placing MPI processes according to their communication patterns shows interesting performance improvement thanks to hwloc . An optimized MPI communication strategy may also be dynamically chosen according to the location of the communicating processes in the machine and its hardware characteristics.