Section: New Results
Instruction Cache Optimization
The instruction cache is a small memory with fast access. All binary instructions of a program are executed from it. In the ST220 processor from stm icroelectronics, the instruction cache is direct mapped: the cache line i can hold only instructions whose addresses are equal to i modulo L, where L is the size of the cache. When a program starts executing a block that is not in the cache, one must load it from main memory; this is called a cache miss. This happens either at the first use of a function (cold miss), or after a conflict (conflict miss). There is a conflict when two functions share the same cache line; each of them removes the other from the cache when their executions are interleaved. The cost of a cache miss is of the order of 150 cycles for the ST220, hence the interest of minimizing the number of conflicts by avoiding line sharing when two functions are executed in the same time slot. For this problem we have considered the three following objective functions:
Minimizing the number of conflicts for a given execution trace. This can be reduced to the Max-K-Cut and Ship-Building problems.
Minimizing the size of the code. This is equivalent to a traveling salesman problem (building an Hamiltonian circuit) on a very special graph called Cyclic-Metric .
Maximizing code locality at cache line granularity is the neighboring problem. This can be reduced to the traveling salesman problem.
We have proved several non-approximability and polynomiality results related to the COL , EXP , and NBH problems. For the COL problem, Gloy and Smith (GS) proposed a greedy heuristic based on a conflict graph called the temporal relationship graph (TRG). EXP is not taken into consideration. Pettis and Hansen (PH) proposed a greedy fusion without code size expansion to enhance locality. As a side effect, for small codes (compared to cache size) PH also reduces cache conflicts. We proposed several algorithmic improvements to these approaches: we have lowered the complexity of GS by a factor varying on our benchmark suite from 50 to 500; in  we had proposed a modified version of GS that takes EXP into account; we have improved this heuristic (B); we have also developed a heuristic solution (FILLGAP) that provides no code size expansion at all; finally, we have defined an affinity graph that reflects the NBH problem and implemented a more aggressive heuristic than PH to optimize the NBH problem. Depending on the structure of the conflict graph, we choose one among those three strategies. The code size expansion obtained this way is usually less than 2 percents and the cache conflict rate gains in average 5 percents compared to the best known algorithms.
We also worked on a completely different approach whose goal is to provide a procedure placement that yields no code size expansion and takes NBH into account. Our algorithm still has to be implemented. Finally, we currently work on the following problem, but with no satisfactory answer yet:
- Profiling vs static
The GS algorithm is based on a conflict graph built using profiling. In practice, it leads to a much better conflict reduction than any other known static based approach. But a profiling technique has several drawbacks: the compilation process of profiling feedback is too complex for most application developers. Also, using good representative data sets for profiling is an important issue. The problem is then to be able to build a TRG using fully static, fully dynamic, or hybrid techniques.
This work is a continuation of the work done in the previous contract with the cec team at stm icroelectronics, see Section 7.1 .