Section: New Results
Program optimizations
Practical Approach
Participants : Grigori Fursin, Albert Cohen, Cédric Bastoul, Louis-Noël Pouchet, Walid Benabderrahmane.
Here are the most recent key scientific achievements.
-
Empirically demonstrating that significant performance gains can be achieved with program optimizations, provided architecture phenomena are better factored in during the optimization process. Observing though that long compositions of program transformations are required.
-
Releasing the first machine-learning based research compiler (MILEPOST GCC [98] ) that combines Interactive Compilation Interface [54] and static program feature extractor to predict good program optimizations to reduce execution time, code size and compilation time for a given program on a given architecture automatically using predictive modeling and statistical techniques. This compiler opens many research opportunities and is used in the EU HiPEAC network of excellence [55] as a default compilation platform. The development of MILEPOST GCC has been coordinated by Grigori Fursin (project coordinator - Michael O'Boyle). IBM made two press-releases about this work in June, 2008 and May, 2009 [57] , [56] .
-
Showing that it is possible to capture the complex interplays between architecture and program behavior using machine-learning techniques, using that knowledge to drive program optimizations.
Publications of 2008: [97] , [94] [54] [98] . Publications of 2009: [28] , [45] , [16]
-
Developing multiversioning applications to make static programs adaptable at run-time [41] , [32] , [31] .
-
Enabling predictive run-time code scheduling on heterogeneous (CPU-GPU) architectures [40] .
-
Developing collective optimization approaches leveraging the knowledge of multiple users to transparently and continuously optimize programs or improve default compiler optimization heuristic [32] , [31] .
-
Developing a polyhedral program representation that facilitates the composition of complex transformation sequences.
-
Addressing the code generation performance issues associated with polyhedral program representation.
-
Further leveraging polyhedral program representation to propose novel methods for scanning the space of program transformations.
-
Extending the polyhedral model to irregular control flow (thus significantly increasing their application domain) and demonstrating the extension allows existing optimization techniques to successfully apply to relevant benchmarks (this work has been submitted and accepted for publication in 2009 at Compiler Construction 2010).
Collective Tuning Center
Participants : Grigori Fursin, Olivier Temam.
We created an open community-driven collaborative wiki-based portal http://cTuning.org that brings together academia, industry and end-users to develop intelligent collective tuning technology that automates and simplifies compiler, program and architecture design and optimization. This technology minimizes repetitive time consuming tasks and human intervention using collective optimization, run-time adaptation, statistical and machine learning techniques. It can already help end users and researchers to improve execution time, code size, power consumption, reliability and other important characteristics of the available computing systems automatically (ranging from supercomputers to embedded systems) and should eventually enable development of the emerging intelligent self-tuning adaptive computing systems. Collective Optimization Database is intended to improve the quality of academic research by avoiding costly duplicate experiments and providing reproducible results.
Transitive Closure of Union of Affine Relations
Participants : Denis Barthou, Anna Beletska, Albert Cohen, Konrad Trifunovic.
We studied a method to compute the transitivite closure of a union of affine relations on integer tuples. Within Presburger arithmetics, complete algorithms to compute the transitive closure exist for convex polyhedra only. In presence of non-convex relations, there exists little but special cases and incomplete heuristics. We introduce novel sufficient and necessary conditions defining a class of relations for which an exact computation is possible. These conditions can be relaxed to define larger classes where conservative approximations and/or more complex closed forms can be obtained. Our method is immediately applicable to a wide area of symbolic computation problems. It is illustrated on representative examples and compared with state of the art approaches.
Optimizing code through iterative specialization
Participants : Minhaj Khan, Henri-Pierre Charles, Denis Barthou.
Code specialization is a way to obtain significant improvement in the performance of an application. It works by exposing values of different parameters in source code. The availability of these specialized values enables the compilers to generate better optimized code. Although most of the efficient source code implementations contain specialized code to benefit from these optimizations, the real impact of specialization may however vary depending upon the value of the specializing parameter.
We have studied in [116] an iterative approach for code specialization. From some specialized code, we search for a better version of code by re-specializing the code, followed by a low-level code analysis. The specialized versions fulfilling the required criteria are then transformed to generate another equivalent version of the original specialized code. The approach, tested on Itanium2 architecture using gcc/icc compilers show significant improvement in the performance of different benchmarks.
Simulation of the Lattice QCD and Technological Trends in Computation
Participants : Mouad Bahi, Denis Barthou, Cédric Bastoul, Walid Benabderrhamane, Christine Eisenbeis, Julien Jaeger, Louis-Noël Pouchet.
This is a joint ANR project “PetaQCD” with Lal (Orsay), Irisa Rennes (Caps/Alf), IRFU (CEA Saclay), LPT (Orsay), Caps Entreprise (Rennes), Kerlabs (Rennes), LPSC (Grenoble).
Simulation of the Lattice QCD is a challenging computational problem. Currently, technological trends in computation show multiple divergent models of computation. We are witnessing homogeneous multicore architectures, the use of accelerator on-chip or off-chip, in addition to the traditional architectural models.
On the verge of this technological abundance, assessing the performance tradeoffs of computing nodes based on these technologies is of crucial importance to many scientific computing applications.
In this study [114] , we focus on assessing the efficiency and the performance expected for the Lattice QCD problem on representative architectures and we project the expected improvement on these architectures and their impact on performance for the Lattice QCD. We additionally try to pinpoint the limiting factors for performance on these architectures. This work takes place in ANR PARA and ANR QCDNEXT (both 2005-2008) and has led to the project ANR PetaQCD (2009-2011)[33] .
Loop Optimization using Adaptive Compilation and Kernel Decomposition
Participants : J. Jaeger, P. Oliveira, S. Louise, D. Barthou.
We study a new hierarchical compilation approach for the generation of high performance applications, relying on the use of state of the art compilers. This appproach is not application dependent and do not require any assembly hand-coding. It relies on the decomposition of the loop nests of the hotest functions in the application into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach for dense linear algebra in 2005, reaching performance of constructor libraries. The advantage of the generated kernels is that their performance no longer depend on data input, but only on its location in memory hierarchy. Using a performance model for the memory hierarchy, it is possible to find out the best composition of kernels to use.
For larger applications, the code is no longer regular and data accesses are in particular irregular (use of indirections). Working with applications of project ANR PARA (MPEG4, QCD, oil simulation and BLAST), we study how to adapt the previous approach to these cases. When control is irregular (involving different execution path), we study the the WCET, in particular in the context of embedded applications for MPSOC architectures. This is the subject of an on-going collaboration with CEA/Lastre.
Dataflow Analysis for Irregular Programs and its applications
Participants : M. Belaoucha, S. Touati, D. Barthou.
Instance-wise dataflow analysis provides the exact execution of a statement defining a value that is read at some other point during a program execution. This analysis generates more precise information than traditional dependence analyses and can therefore validate more optimizing transformations. An implementation of this analysis, as a standalone library, has be performed by M. Belaoucha (and funded by contract Teraops and PARMA) and its integration in gcc/Graphite is in progress.