Project : tropics
Section: Scientific Foundations
Keywords : program transformation, automatic differentiation, scientific computing, simulation, optimization, adjoint models.
- automatic differentiation
(AD) Automatic transformation of a program, that returns a new program that computes some derivatives of the given initial program, i.e. some combination of the partial derivatives of the program's outputs with respect to its inputs.
- adjoint model
Mathematical manipulation of the partial derivative equations that define a problem, that returns new differential equations that define the gradient of the original problem's solution.
General trade-off technique, used in the reverse mode of AD, that trades duplicate execution of a part of the program to save some memory space that was used to save intermediate results. Checkpointing a code fragment amounts to running this fragment without any storage of intermediate values, thus saving memory space. Later, when such an intermediate value is required, the fragment is run a second time to obtain the required values.
Automatic or Algorithmic Differentiation (AD) differentiates programs. An AD tool takes as input a source computer program P that, given a vector argument XIRn, computes some vector function Y = F(X) IRm. The AD tool generates a new source program that, given the argument X, computes some derivatives of F. In short, AD first assumes that P represents all its possible run-time sequences of instructions, and it will in fact differentiate these sequences. Therefore, the control of P is put aside temporarily, and AD will simply reproduce this control into the differentiated program. In other words, P is differentiated only piecewise. Experience shows that this is reasonable in most cases, and going further is still an open research problem. Then, any sequence of instructions is identified with a composition of vector functions. Thus, for a given control:
where each fk is the elementary function implemented by instruction Ik. Finally, AD simply applies the chain rule to obtain derivatives of F. Let us call Xk the values of all variables after each instruction Ik, i.e. X0 = X and Xk = fk(Xk-1). The chain rule gives the Jacobian of F
which can be mechanically translated back into a sequence of instructions , and these sequences inserted back into the control of P, yielding program . This can be generalized to higher level derivatives, Taylor series, etc.
In practice, the above Jacobian is often far too expensive to compute and store. Notice for instance that equation (2) repeatedly multiplies matrices, whose size is of the order of m×n. Moreover, some problems are solved using only some projections of . For example, one may need only sensitivities, which are for a given direction in the input space. Using equation (2), sensitivity is
which is easily computed from right to left, interleaved with the original program instructions. This is the principle of the tangent mode of AD, which is the most straightforward, of course available in tapenade.
which is most efficiently computed from right to left, because matrix×vector products are so much cheaper than matrix×matrix products. This is the principle of the reverse mode of AD.
This turns out to make a very efficient program, at least theoretically . The computation time required for the gradient is only a small multiple of the run time of P. It is independent from the number of parameters n. In contrast, notice that computing the same gradient with the tangent mode would require running the tangent differentiated program n times.
We can observe that the Xk are required in the inverse of their computation order. If the original program overwrites a part of Xk, the differentiated program must restore Xk before it is used by . There are two strategies for that:
Recompute All (RA): the Xk is recomputed when needed, restarting P on input X0 until instruction Ik. The taf tool uses this strategy. Brute-force RA strategy has a quadratic time cost with respect to the total number of run-time instructions p.
Store All (SA): the Xk are restored from a stack when needed. This stack is filled during a preliminary run of P, that additionally stores variables on the stack just before they are overwritten. The adifor and tapenade tools use this strategy. Brute-force SA strategy has a linear memory cost with respect to p.
Both RA and SA strategies need a special storage/recomputation trade-off in order to be really profitable, and this makes them become very similar. This trade-off is called checkpointing. Since tapenade uses the SA strategy, let us describe checkpointing in this context. The plain SA strategy applied to instructions I1 to Ip builds the differentiated program sketched on figure 1, where
an initial ``forward sweep'' runs the original program and stores intermediate values (black dots), and is followed by a ``backward sweep'' that computes the derivatives in the reverse order, using the stored values when necessary (white dots). Checkpointing a fragment C of the program is illustrated on figure 2. During the forward sweep, no value is stored while in C. Later, when the backward sweep needs values from C, the fragment is run again, this time with storage. One can see that the maximum storage space is grossly divided by 2. This also requires some extra memorization (a ``snapshot''), to restore the initial context of C. This snapshot is shown on figure 2 by slightly bigger black and white dots.
Checkpoints can be nested. In that case, a clever choice of checkpoints can make both the memory size and the extra recomputations grow like only the logarithm of the size of the program.