Project : tropics
Section: Scientific Foundations
Keywords : static analysis, program transformation, compilation, abstract syntax tree, control flow graph, data flow analysis, data dependence graph, abstract interpretation.
Static Analyses and Transformation of programs
- abstract syntax tree
Tree representation of a computer program, that keeps only the semantically significant information and abstracts away syntactic sugar such as indentation, parentheses, or separators.
- control flow graph
Representation of a procedure body as a directed graph, whose nodes, known as basic blocks, contain each a list of instructions to be executed in sequence, and whose arcs represent all possible control jumps that can occur at run time.
- abstract interpretation
Model that describes program static analyses as a special sort of execution, in which all branches of control switches are taken simultaneously, and where computed values are replaced by abstract values from a given semantic domain. Each particular analysis gives birth to a specific semantic domain.
- data flow analysis
Program analysis that studies how a given property of variables evolves with execution of the program. Data Flow analyses are static, therefore studying all possible run-time behaviors and making conservaive approximations. A typical data-flow analysis is to detect whether a variable is initialized or not, at any location in the source program.
- data dependence analysis
Program analysis that studies the itinerary of values during program execution, from the place where a value is generated to the places where it is used, and finally to the place where it is overwritten. The collection of all these itineraries is often stored as a data dependence graph, and data flow analysis most often rely on this graph.
- data dependence graph
Directed graph that relates accesses to program variables, from the write access that defines a new value to the read accesses that use this value, and conversely from the read accesses to the write access that overwrites this value. Dependences express a partial order between operations, that must be preserved to preserve the program's result.
The most obvious example of a program transformation tool is certainly a compiler. Other examples are program translators, that go from one language or formalism to another, or optimizers, that transform a program to make it run better. AD is just one such transformation. These tools use sophisticated analyses  to improve the quality of the produced code. These tools share their technological basis. More importantly, there are common mathematical models to specify and analyze them.
An important principle is abstraction: the core of a compiler should not bother about syntactic details of the compiled program. In particular, it is desirable that the optimization and code generation phases be independent from the particular input programming language. This can generally be achieved through separate front-ends, that produce an internal language-independent representation of the program, generally a abstract syntax tree. For example, compilers like gcc for c and g77 for fortran77 have separate front-ends but share most of their back-end.
One can go further. As abstraction goes on, the internal representation becomes more language independent, and semantic constructs such as declarations, assignments, calls, IO operations, can be unified. Analyses can then concentrate on the semantics of a small set of constructs. We advocate an internal representation composed of three levels.
At the top level is the call graph, whose nodes are the procedures. There is an arrow from node A to node B iff A possibly calls B. Recursion leads to cycles. The call graph captures the notions of visibility scope between procedures, that come from modules or classes.
At the middle level is the control flow graph. There is one flow graph per procedure, i.e. per node in the call graph. The flow graph captures the control flow between atomic instructions. Flow control instructions are represented uniformly inside the control flow graph.
At the lowest level are abstract syntax trees for the individual atomic instructions. Certain semantic transformations can benefit from the representation of expressions as directed acyclic graphs, sharing common sub-expressions.
To each basic block is associated a symbol table that gives access to properties of variables, constants, function names, type names, and so on. Symbol tables must be nested to implement lexical scoping.
Static program analyses can be defined on this internal representation, which is largely language independent. The simplest analyses on trees can be specified with inference rules . But many analyses are more complex, and are thus better defined on graphs than on trees. This is the case for data-flow analyses, that look for run-time properties of variables. Since flow graphs are cyclic, these global analyses generally require an iterative resolution. Data flow equations is a practical formalism to describe data-flow analyses. Another formalism is described in , which is more precise because it can distinguish separate instances of instructions. However it is still based on trees, and its cost forbids application to large codes. Abstract Interpretation is a theoretical framework to study complexity and termination of these analyses.
Data flow analyses must be carefully designed to avoid or control combinatorial explosion. The classical solution is to choose a hierarchical model. In this model, information, or at least a computationally expensive part of it, is synthesized. Specifically, it is computed bottom up, starting on the lowest (and smallest) levels of the program representation and then recursively combined at the upper (and larger) levels. Consequently, this synthesized information must be made independent of the context (i.e., the rest of the program). When the synthesized information is built, it is used in a final pass, essentially top down and context dependent, that propagates information from the ``extremities'' of the program (its beginning or end) to each particular subroutine, basic block, or instruction.
Even then, data flow analyses are limited, because they are static and thus have very little knowledge of actual run-time values. Most of them are undecidable; that is, there always exists a particular program for which the result of the analysis is uncertain. This is a strong, yet very theoretical limitation. More concretely, there are always cases where one cannot decide statically that, for example, two variables are equal. This is even more frequent with two pointers or two array accesses. Therefore, in order to obtain safe results, conservative over-approximations of the computed information are generated. For instance, such approximations are made when analyzing the activity or the TBR (``To Be Restored'') status of some individual element of an array. Static and dynamic array region analyses provide very good approximations. Otherwise, we make a coarse approximation such as considering all array cells equivalent.
When studying program transformations, one often wants to move instructions around without changing the results of the program. The fundamental tool for this is the data dependence graph. This graph defines an order between run-time instructions such that if this order is preserved by instructions rescheduling, then the output of the program is not altered. Data dependence graph is the basis for automatic parallelization. It is also useful in AD. Data dependence analysis is the static data-flow analysis that builds the data dependence graph.