Section: New Results
Scheduling over Heterogeneous Multicore Architectures
Participants : Cédric Augonnet, Raymond Namyst, Samuel Thibault.
In almost every computer nowadays lies a Graphical Processing Unit (GPU) and in that GPU lies a nominal computing power that makes the power of even the most recent multicore central processing units looks anemic in comparison. Innovative processing architectures such as the Cell Broadband Engine from IBM found in the Sony's Playstation 3 also come full of promises. Thus, it did not take long in these days of ever growing computing needs for people to explore these new lands.
The power of these new heterogeneous computing architectures does not come for free, however. Cell's multiple synergistic processing units (SPUs) are equipped with a very small amount of memory. GPUs put drastic constraints on the data access pattern and require highly regular computations to actually deliver their full power. While scheduling over those architectures, the problem of mapping tasks onto available units is not the only one anymore. One also has to provide constructs and mechanisms to tailor those tasks to the characteristics of a given processing unit — refinement/filtering mechanisms — as well as to make sure that such tasks have the suitable data at hand when needed — memory/caching management and consistency mechanisms.
We thus designed a unified scheduling engine that makes it possible to easily implement task scheduling policies on top of heterogeneous multicore architectures. Combined with the use of performance models, which can be obtained through auto-tuning mechanisms [21] , [17] , [16] , we have shown that substantial performance improvements result from the use of such scheduling policies. Not only does StarPU allow to make use of all computing resources at the same time (regular CPUs as well as a mixture of heterogeneous GPUs), but its scheduling engine even enables to benefit from the actual heterogeneity of the machines [23] , [41] .
The memory management library of the StarPU runtime system (described in Section 5.10 ) has been designed in order to leverage the inner complexity of accelerator programming by automating data coherency management, and can even prefetch data ahead of the actual computation to increase yet more the efficiency of the computation. Our approach thus makes it possible to efficiently handle arbitrarily large datasets instead of being limited by the size of accelerators' embedded memory. Therefore, we used StarPU to implement dense linear algebra parallel kernels that run simultaneously on multicore processors and GPGPUs. We also demonstrated the flexibility of our approach by adding support for the Cell/BE processor with very little efforts [22] . We showed that in addition to a unified execution model, it is important to exhibit an expressive interface to let the programmer (or the upper library) express his/her knowledge at the algorithmic level in order to give hint to the runtime system.