Section: Scientific Foundations
Back-end code optimizations for embedded processors
Participants : Benoit Boissinot, Florian Brandner, Quentin Colombet, Alain Darte, Fabrice Rastello.
Compilation is an old activity, in particular back-end code optimizations. We first give some elements that explain why the development of embedded systems makes compilation come back as a research topic. We then detail the code optimizations that we are interested in, both for aggressive and just-in-time compilation.
Embedded systems and the revival of compilation & code optimizations
Applications for embedded computing systems generate complex programs and need more and more processing power. This evolution is driven, among others, by the increasing impact of digital television, the first instances of umts networks, and the increasing size of digital supports, like recordable dvd , and even Internet applications. Furthermore, standards are evolving very rapidly (see for instance the successive versions of mpeg ). As a consequence, the industry has rediscovered the interest of programmable structures, whose flexibility more than compensates for their larger size and power consumption. The appliance provider has a choice between hard-wired structures (Asic), special-purpose processors (Asip), or (quasi) general-purpose processors (dsp for multimedia applications). Our cooperation with stm icroelectronics leads us to investigate the last solution, as implemented in the ST100 (dsp processor) and the ST200 (vliw dsp processor) family for example. Compilation and, in particular, back-end code optimizations find a second life in the context of such embedded computing systems.
At the heart of this progress is the concept of virtualization , which is the key for more portability, more simplicity, more reliability, and of course more security. This concept, implemented through binary translation, just-in-time compilation, etc., consists in hiding the architecture-dependent features as far as possible during the compilation process. It has been used for quite a long time for servers such as HotSpot, a bit more recently for workstations, and it is quite recent for embedded computing for reasons we now explain.
As previously mentioned, the definition of “embedded systems” is rather imprecise. However, one can at least agree on the following features:
even for processors that are programmable (as opposed to hardware accelerators), processors have some architectural specificities, and are very diverse;
many processors (but not all of them) have some limited resources, in particular in terms of memory;
for some processors, power consumption is an issue;
in some cases, aggressive compilation (through cross-compilation) is possible, and even highly desirable for important functions.
This diversity is one of the reason why virtualization, which starts to be more mature, is becoming more and more common in programmable embedded systems, in particular through CIL (a standardization of MSIL). This implies a late compilation of programs, through just-in-time (JIT), including dynamic compilation. Some people even think that dynamic compilation, which can have more information because performed at run-time, can outperform the performances of “ahead-of-time” compilation.
Performing code generation (and some higher-level optimizations) in a late phase is potentially advantageous, as it can exploit architectural specificities and run-time program information such as constants and aliasing, but it is more constrained in terms of time and available resources. Indeed, the processor that performs the late compilation phase is, a priori , less powerful (in terms of memory for example) than a processor used for cross-compilation. The challenge is thus to spread the compilation process in the time by deferring some optimizations (“deferred compilation”) and by propagating some information for those whose computation is expensive (“split compilation”). Classically, a compiler has to deal with different intermediate representations (IR) where high-level information (i.e., more target-independent) co-exist with low-level information. The split compilation has to solve a similar problem where, this time, the compactness of the information representation, and thus its pertinence, is also an important criterion. Indeed, the IR is evolving not only from a target-independent description to a target-dependent one, but also from a situation where the compilation time is almost unlimited (cross-compilation) to one where any type of resource is limited. This is also a reason why static single assignment (SSA) is becoming specific to embedded compilation, even if it was first used for workstations. Indeed, SSA is a sparse (i.e., compact) representation of liveness information. In other words, if time constraints are common to all JIT compilers (not only for embedded computing), the benefit of using SSA is also in terms of its good ratio pertinence/storage of information. It also enables to simplify algorithms, which is also important for increasing the reliability of the compiler.
In addition, this continuum of compilation strategies should integrate the need for exploiting the parallel computing resources that all recent (and future) architectures provide. A solution is to develop domain-specific languages (DSL), which adds yet another dimension to the problem of designing intermediate representation.
We now give more details on the code optimizations we want to consider and on the methodology we want to follow.
Aggressive and just-in-time optimizations of assembly-level code
Compilation for embedded processors is difficult because the architecture and the operations are specially tailored to the task at hand, and because the amount of resources is strictly limited. For instance, the potential for instruction level parallelism (simd , mmx ), the limited number of registers and the small size of the memory, the use of direct-mapped instruction caches, of predication, but also the special form of applications  generate many open problems. Our goal is to contribute to their understanding and their solutions.
As previously explained, compilation for embedded processors include both aggressive and just in time (JIT) optimizations. Aggressive compilation consists in allowing more time to implement costly solutions (so, looking for complete, even expensive, studies is mandatory): the compiled program is loaded in permanent memory (rom , flash, etc.) and its compilation time is not significant; also, for embedded systems, code size and energy consumption usually have a critical impact on the cost and the quality of the final product. Hence, the application is cross-compiled, in other words, compiled on a powerful platform distinct from the target processor. Just-in-time compilation corresponds to compiling applets on demand on the target processor. For compatibility and compactness, the source languages are CLI or Java. The code can be uploaded or sold separately on a flash memory. Compilation is performed at load time and even dynamically during execution. Used heuristics, constrained by time and limited resources, are far from being aggressive. They must be fast but smart enough.
Our aim is, in particular, to find exact or heuristic solutions to combinatorial problems that arise in compilation for vliw and dsp processors, and to integrate these methods into industrial compilers for dsp processors (mainly ST100, ST200, Strong ARM). Such combinatorial problems can be found for example in register allocation, in opcode selection, or in code placement for optimization of the instruction cache. Another example is the problem of removing the multiplexer functions (known as functions) that are inserted when converting into ssa form. These optimizations are usually done in the last phases of the compiler, using an assembly-level intermediate representation. In industrial compilers, they are handled in independent phases using heuristics, in order to limit the compilation time. We want to develop a more global understanding of these optimization problems to derive both aggressive heuristics and JIT techniques, the main tool being the ssa representation.
In particular, we want to investigate the interaction of register allocation, coalescing, and spilling, with the different code representations, such as ssa . One of the challenging features of today's processors is predication  , which interferes with all optimization phases, as the ssa form does. Many classical algorithms become inefficient for predicated code. This is especially surprising, since, besides giving a better trade-off between the number of conditional branches and the length of the critical path, converting control dependences into data dependences increases the size of basic blocks and hence creates new opportunities for local optimization algorithms. One has first to adapt classical algorithms to predicated code  , but also to study the impact of predicated code on the whole compilation process.
As mentioned in Section 2.3 , a lot of progress has already been done in this direction in our past collaborations with stm icroelectronics. In particular, the goal of the Sceptre project was to revisit, in the light of ssa , some code optimizations in an aggressive context, i.e., by looking for the best performances without limiting, a priori , the compilation time and the memory usage. One of the major results of this collaboration was to show that it is possible to exploit ssa to design a register allocator in two phases, with one spilling phase relatively target-independent, then the allocator itself, which takes into account architectural constraints and optimizes other aspects (in particular, coalescing). This new way of considering register allocation has shown its interest for aggressive static compilation. But it offers three other perspectives:
A simplification of the allocator, which again goes toward a more reliable compiler design, based on ssa .
The possibility to handle the hardest part, the spilling phase, as a preliminary phase, thus a good candidate for split compilation.
The possibility of a fast allocator, with a much higher quality than usual JIT approaches such as “linear scan”, thus suitable for virtualization and JIT compilation.
These additional possibilities have not been fully studied or developed yet. The objective of our new contract with stm icroelectronics, called Mediacom, is to address them. More generally, we want to continue to develop our activity on code optimizations, exploiting ssa properties, following our two-phases strategy:
First, revisit code optimizations in an aggressive context to develop better strategies, without eliminating too quickly solutions that may have been considered as too expensive in the past.
Then, exploit the new concepts introduced in the aggressive context to design better algorithms in a JIT context, focusing on the speed of algorithms and their memory footprint, without compromising too much on the quality of the generated code.
We want to consider more code optimizations and more architectural features, such as registers with aliasing, predication, and, possibly in a longer term, vectorization/parallelization again.