Team tropics

Overall Objectives
Scientific Foundations
Application Domains
New Results
Inria / Raweb 2004
Project: tropics

Project : tropics

Section: Scientific Foundations

Keywords : program transformation, automatic differentiation, scientific computing, simulation, optimization, adjoint models.

Automatic Differentiation

Participants : Mauricio Araya-Polo, Benjamin Dauvergne, Laurent Hascoët, Christophe Massol, Valérie Pascual.

automatic differentiation

(AD) Automatic transformation of a program, that returns a new program that computes some derivatives of the given initial program, i.e. some combination of the partial derivatives of the program's outputs with respect to its inputs.

adjoint model

Mathematical manipulation of the partial derivative equations that define a problem, that returns new differential equations that define the gradient of the original problem's solution.


General trade-off technique, used in the reverse mode of AD, that trades duplicate execution of a part of the program to save some memory space that was used to save intermediate results. Checkpointing a code fragment amounts to running this fragment without any storage of intermediate values, thus saving memory space. Later, when such an intermediate value is required, the fragment is run a second time to obtain the required values.

Automatic or Algorithmic Differentiation (AD) differentiates programs. An AD tool takes as input a source computer program P that, given a vector argument X$ \in$IRn, computes some vector function Y = F(X) $ \in$IRm. The AD tool generates a new source program that, given the argument X, computes some derivatives of F. In short, AD first assumes that P represents all its possible run-time sequences of instructions, and it will in fact differentiate these sequences. Therefore, the control of P is put aside temporarily, and AD will simply reproduce this control into the differentiated program. In other words, P is differentiated only piecewise. Experience shows that this is reasonable in most cases, and going further is still an open research problem. Then, any sequence of instructions is identified with a composition of vector functions. Thus, for a given control:

Im1 $\mtable{...}$(1)

where each fk is the elementary function implemented by instruction Ik. Finally, AD simply applies the chain rule to obtain derivatives of F. Let us call Xk the values of all variables after each instruction Ik, i.e. X0 = X and Xk = fk(Xk-1). The chain rule gives the Jacobian Im2 $F^\#8242 $ of F

Im3 ${F^\#8242 {(X)}=f_p^\#8242 {(}X_{p-1}{)~.~}f_{p-1}^\#8242 {(}X_{p-2}{)~.~\#8943 ~.~}f_1^\#8242 {(}X_0{)}}$(2)

which can be mechanically translated back into a sequence of instructions Im4 $I_k^\#8242 $, and these sequences inserted back into the control of P, yielding program Im5 $P^\#8242 $. This can be generalized to higher level derivatives, Taylor series, etc.

In practice, the above Jacobian Im6 ${F^\#8242 {(X)}}$ is often far too expensive to compute and store. Notice for instance that equation (2) repeatedly multiplies matrices, whose size is of the order of m×n. Moreover, some problems are solved using only some projections of Im6 ${F^\#8242 {(X)}}$. For example, one may need only sensitivities, which are Im7 ${F^\#8242 {(X).}\mover X\#729 }$ for a given direction Im8 $\mover X\#729 $ in the input space. Using equation (2), sensitivity is

Im9 ${F^\#8242 {(X).}\mover X\#729 =f_p^\#8242 {(}X_{p-1}{)~.~}f_{p-1}^\#8242 {(}X_{p-2}{)~.~\#8943 ~.~}f_1^\#8242 {(}X_0{)~.~}\mover X\#729 ,}$(3)

which is easily computed from right to left, interleaved with the original program instructions. This is the principle of the tangent mode of AD, which is the most straightforward, of course available in tapenade.

However in optimization, data assimilation [41], adjoint problems [35], or inverse problems, the appropriate derivative is the gradientIm10 ${F^{\#8242 *}{(X).}\mover Y¯}$. Using equation (2), the gradient is

Im11 ${F^{\#8242 *}{(X).}\mover Y¯=f_1^{\#8242 *}{(}X_0{).}f_2^{\#8242 *}{(}X_1{).~\#8943 ~.}f_{p-1}^{\#8242 *}{(}X_{p-2}{).}f_p^{\#8242 *}{(}X_{p-1}{).}\mover Y¯,}$(4)

which is most efficiently computed from right to left, because matrix×vector products are so much cheaper than matrix×matrix products. This is the principle of the reverse mode of AD.

This turns out to make a very efficient program, at least theoretically [37]. The computation time required for the gradient is only a small multiple of the run time of P. It is independent from the number of parameters n. In contrast, notice that computing the same gradient with the tangent mode would require running the tangent differentiated program n times.

We can observe that the Xk are required in the inverse of their computation order. If the original program overwrites a part of Xk, the differentiated program must restore Xk before it is used by Im12 ${f_{k+1}^{\#8242 *}{(}X_k{)}}$. There are two strategies for that:

Both RA and SA strategies need a special storage/recomputation trade-off in order to be really profitable, and this makes them become very similar. This trade-off is called checkpointing. Since tapenade uses the SA strategy, let us describe checkpointing in this context. The plain SA strategy applied to instructions I1 to Ip builds the differentiated program sketched on figure 1, where

Figure 1. The ``Store-All'' tactic

an initial ``forward sweep'' runs the original program and stores intermediate values (black dots), and is followed by a ``backward sweep'' that computes the derivatives in the reverse order, using the stored values when necessary (white dots). Checkpointing a fragment C of the program is illustrated on figure 2. During the forward sweep, no value is stored while in C. Later, when the backward sweep needs values from C, the fragment is run again, this time with storage. One can see that the maximum storage space is grossly divided by 2. This also requires some extra memorization (a ``snapshot''), to restore the initial context of C. This snapshot is shown on figure 2 by slightly bigger black and white dots.

Figure 2. Checkpointing C with the ``Store-All'' tactic

Checkpoints can be nested. In that case, a clever choice of checkpoints can make both the memory size and the extra recomputations grow like only the logarithm of the size of the program.


Logo Inria