Team Runtime

Overall Objectives
Scientific Foundations
Application Domains
New Results
Contracts and Grants with Industry
Other Grants and Activities

Section: New Results

Efficient scheduling of OpenMP threads on NUMA machines

Participants : Olivier Aumage, François Broquedis, Raymond Namyst, Pierre-André Wacrenier.

To express parallelism, scientific programmers are used to program with OpenMP , a high level parallel language, that relies on a set of annotations (including scheduling directives). While OpenMP -parallelized applications suit well SMP computers, their execution on NUMA architectures are far from being optimal, particularly when considering irregular applications. This is due to the difficulty to combine load balancing and thread/memory affinity relations. Indeed, nowadays OpenMP runtimes do not map the application parallel structure to the underlying architecture considering threads and data relations.

To solve this problem, we designed “ ForestGOMP ”, an extension to the GNU OpenMP (GOMP) run-time support that relies on the Marcel /BubbleSched thread scheduling package already described in Section  5.1 . This structured approach extends the scope of OpenMP to NUMA architectures and nested parallelism. Indeed, while the raw performance of ForestGOMP on flat parallelism is similar to GOMP and icc , ForestGOMP nested parallelism outperforms them on irregular applications.

ForestGOMP is also now able to take thread/memory affinities into account while distributing the load on hierarchical architectures. It relies on the MaMI memory manager to allocate, bind or migrate memory buffers [26] , [42] . Moreover ForestGOMP adopts a two-ways mechanism [24] to decide how often the distribution needs to be updated. First, every time the application programmer updates the memory affinities, the bubble scheduler is called to check the current distribution. This approach may not be sufficient for irregular applications, so ForestGOMP also provides a more dynamic mechanism based on hardware counters inspecting. The runtime checks the counters on a regular basis and infers the amount of remote memory accesses initiated from the current processor while defining a threshold from which ForestGOMP will call the scheduler for checking the current distribution. These two approaches are complementary. Indeed, in some cases updates from the application programmer will not need the scheduler to rethink the current distribution. In other cases the programmer is able to roughly define which part of his application will work on which data, but cannot tell precisely when and how. Hardware counters can help reacting at the right time for these situations.