# Project : tropics

## Section: Scientific Foundations

Keywords : program transformation, automatic differentiation, scientific computing, simulation, optimization, adjoint models.

### Automatic Differentiation

Participants : Mauricio Araya-Polo, Benjamin Dauvergne, Laurent Hascoët, Christophe Massol, Valérie Pascual.

- automatic differentiation
(AD) Automatic transformation of a program, that returns a new program that computes some derivatives of the given initial program, i.e. some combination of the partial derivatives of the program's outputs with respect to its inputs.

- adjoint model
Mathematical manipulation of the partial derivative equations that define a problem, that returns new differential equations that define the gradient of the original problem's solution.

- checkpointing
General trade-off technique, used in the reverse mode of AD, that trades duplicate execution of a part of the program to save some memory space that was used to save intermediate results. Checkpointing a code fragment amounts to running this fragment without any storage of intermediate values, thus saving memory space. Later, when such an intermediate value is required, the fragment is run a second time to obtain the required values.

Automatic or Algorithmic Differentiation (AD) differentiates
*programs*. An AD tool takes as input
a source computer program P that, given a vector argument XIR^{n},
computes some vector function Y = F(X) IR^{m}.
The AD tool generates a new source program that,
given the argument X, computes some derivatives of F.
In short, AD first assumes that P represents all its possible run-time
sequences of instructions, and it will in fact differentiate these sequences.
Therefore, the *control* of P is put aside temporarily, and AD will
simply reproduce this control into the differentiated program.
In other words, P is differentiated only piecewise.
Experience shows that this is reasonable in most cases, and going further is
still an open research problem.
Then, any sequence of instructions is identified with a composition of
vector functions. Thus, for a given control:

where each f_{k} is the elementary function implemented by instruction I_{k}.
Finally, AD simply applies the chain rule to obtain derivatives of F.
Let us call X_{k} the values of all variables after each
instruction I_{k}, i.e. X_{0} = X and X_{k} = f_{k}(X_{k-1}).
The chain rule gives the Jacobian of F

which can be mechanically translated back into a sequence of instructions , and these sequences inserted back into the control of P, yielding program . This can be generalized to higher level derivatives, Taylor series, etc.

In practice, the above Jacobian is often far too expensive to compute and store.
Notice for instance that equation (2) repeatedly multiplies matrices, whose size
is of the order of m×n. Moreover, some problems are solved using
only some projections of . For example, one may need only *sensitivities*,
which are for a given direction in the input space.
Using equation (2), sensitivity is

which is easily computed from right to left, interleaved with the original
program instructions. This is the principle of the *tangent mode* of AD,
which is the most straightforward, of course available in tapenade.

However in optimization, data assimilation [41],
adjoint problems [35], or inverse problems,
the appropriate derivative is the *gradient*.
Using equation (2), the gradient is

which is most efficiently computed from right to left,
because matrix×vector products are so much cheaper
than matrix×matrix products.
This is the principle of the *reverse mode* of AD.

This turns out to make a very efficient program, at least
theoretically [37]. The computation time
required for the gradient is only a small multiple of the run time of P.
It is independent from the number of parameters n.
In contrast, notice that computing the same gradient with the *tangent mode*
would require running the tangent differentiated program n times.

We can observe that the X_{k} are required in
the *inverse* of their computation order. If the
original program *overwrites* a part of X_{k},
the differentiated program must restore X_{k} before it is used
by . There are two strategies for that:

**Recompute All (RA):**the X_{k}is recomputed when needed, restarting P on input X_{0}until instruction I_{k}. The taf[33] tool uses this strategy. Brute-force RA strategy has a quadratic time cost with respect to the total number of run-time instructions p.**Store All (SA):**the X_{k}are restored from a stack when needed. This stack is filled during a preliminary run of P, that additionally stores variables on the stack just before they are overwritten. The adifor[28] and tapenade tools use this strategy. Brute-force SA strategy has a linear memory cost with respect to p.

Both RA and SA strategies need a special storage/recomputation trade-off
in order to be really profitable, and this makes them
become very similar. This trade-off is called *checkpointing*.
Since tapenade uses the SA strategy, let us describe checkpointing
in this context. The plain SA strategy applied to instructions I_{1} to I_{p}
builds the differentiated program sketched on figure 1, where

an initial ``forward sweep'' runs the original program and stores intermediate values
(black dots), and is followed by a ``backward sweep'' that computes the
derivatives in the reverse order, using the stored values when necessary (white dots).
Checkpointing a fragment **C** of the program is illustrated on figure 2.
During the forward sweep, no value is stored while in **C**.
Later, when the backward sweep needs values from **C**, the fragment is run again,
this time with storage. One can see that the maximum storage space is grossly divided
by 2. This also requires some extra memorization (a ``snapshot''),
to restore the initial context of **C**. This snapshot is shown on figure 2
by slightly bigger black and white dots.

Checkpoints can be nested. In that case, a clever choice of checkpoints can make both the memory size and the extra recomputations grow like only the logarithm of the size of the program.