robin hogan university of reading fast reverse-mode automatic differentiation using expression...

Robin HoganUniversity of Reading

Fast reverse-mode automatic differentiation using expression templates in C++

OverviewOverview• Spaceborne radar and lidar• Adjoint coding• Automatic differentiation• New approach• Testing with lidar multiple-scattering forward models

Spaceborne radar, lidar and Spaceborne radar, lidar and radiometersradiometers

The A-Train– NASA– 700-km orbit– CloudSat 94-GHz radar (launch 2006)– Calipso 532/1064-nm depol. lidar– MODIS multi-wavelength radiometer– CERES broad-band radiometer– AMSR-E microwave radiometer

EarthCARE: launch 2015(?)– ESA+JAXA– 400-km orbit: more

sensitive– 94-GHz Doppler radar– 355-nm HSRL/depol. lidar– Multispectral imager– Broad-band radiometer– Heart-warming name

EarthCare

What do CloudSat and Calipso What do CloudSat and Calipso see?see?

Cloudsat radar

CALIPSO lidar

Target classificationInsectsAerosolRainSupercooled liquid cloudWarm liquid cloudIce and supercooled liquidIceClearNo ice/rain but possibly liquidGround

Delanoe and Hogan (2008, 2010)

• Radar: ~D6, detects whole profile, surface echo provides integral constraint

• Lidar: ~D2, more sensitive to thin cirrus and liquid but attenuated

• Radar-lidar ratio provides size D

Unified Unified retrievalretrieval

Ingredients developedImplement previous work

Not yet developed

1. New ray of data: define state vector

Use classification to specify variables describing each species at each gateIce: extinction coefficient, N0’, lidar extinction-to-backscatter ratio

Liquid: extinction coefficient and number concentrationRain: rain rate, drop diameter and melting iceAerosol: extinction coefficient, particle size and lidar ratio

3a. Radar model

Including surface return and multiple scattering

3b. Lidar model

Including HSRL channels and multiple scattering

3c. Radiance model

Solar and IR channels

4. Compare to observations

Check for convergence

6. Iteration method

Derive a new state vectorAdjoint of full forward modelQuasi-Newton scheme

3. Forward model

Not converged

Converged

Proceed to next ray of data

2. Convert state vector to radar-lidar resolution

Often the state vector will contain a low resolution description of the profile

7. Calculate retrieval error

Error covariances and averaging kernel

Unified retrieval: Forward Unified retrieval: Forward modelmodel

• From state vector x to forward modelled observations H(x)...

Ice & snow Liquid cloud Rain Aerosol

Ice/radar

Liquid/radar

Rain/radar

Ice/lidar

Liquid/lidar

Rain/lidar

Aerosol/lidar

Ice/radiometer

Liquid/radiometer

Rain/radiometer

Aerosol/radiometer

Radar scattering profile

Lidar scattering profile

Radiometer scattering profile

Lookup tables to obtain profiles of extinction, scattering & backscatter coefficients, asymmetry factor

Sum the contributions from each constituent

x

Radar forward modelled obs

Lidar forward modelled obs

Radiometer fwd modelled obs

H(x)Radiative transfer models

Adjoint of radar model (vector)

Adjoint of lidar model (vector)

Adjoint of radiometer model

Gradient of cost function (vector)

xJ=HTR-1[y–H(x)]

Vector-matrix multiplications: around the same cost as the original forward

operations

Adjoint of radiative transfer models

yJ=R-1[y–H(x)]

Radiative transfer modelsRadiative transfer modelsObservation Model Speed StatusRadar reflectivity factor

Multiscatter: single scattering option N OK

Radar reflectivity factor in deep convection

Multiscatter: single scattering plus TDTS MS model (Hogan and Battaglia 2008)

N2 OK

Radar Doppler velocity Single scattering OK if no NUBF; fast MS model with Doppler does not exist

N2 Not available for MS

HSRL lidar in ice and aerosol

Multiscatter: PVC model (Hogan 2008) N OK

HSRL lidar in liquid cloud

Multiscatter: PVC plus TDTS models N2 OK

Lidar depolarization Multiscatter: under development N2 In progressInfrared radiances Delanoe and Hogan (2008) two-stream source

function methodN No adjoint

Infrared radiances RTTOV (EUMETSAT license) N Disappointing accuracy for clouds

Solar radiances LIDORT (permissive license) N Testing

• After much pain have hand-coded adjoint for multiscatter model (in C) but still need adjoint for all the rest of the algorithm (in C++)

Adjoint and Jacobian codingAdjoint and Jacobian coding• Variational retrieval methods are posed as:

– “find the vector x that minimises the cost function J(x)”• Two common minimization methods:

– The quasi-Newton method requires the “adjoint code” to compute the gradient ∂J/∂x for any x

– The Gauss-Newton method writes the observational part of the cost function as the sum of the squared deviation of the observations from their forward modelled counterparts y, and requires a code to compute the Jacobian matrix H = ∂y/∂x

• Since J(x) is complicated (containing all of our radiative transfer models), the code to generate ∂J/∂x or ∂y/∂x is even more complicated– Can it be generated automatically?

Approaches to adjoint codingApproaches to adjoint coding• Do it by hand (e.g. ECMWF)

– Painful and time consuming to debug– Generates the most efficient code

• Do it numerically: perturb each element of x one by one– Inefficient and infeasible for large x– Subject to round-off error– What I’m using at the moment with Unified Algorithm

• Automatic differentiation 1: Use a source-to-source compiler– E.g. TAPENADE/TAF/TAC++ generate adjoint source file from

algorithm file: generates quite efficient code– Comercial: 5k/year for TAF/TAC++ academic license and need

permission to distribute generated source code– TAPENADE requires to upload file to server– Limited support for C++ classes and no support for C++ templates

• Automatic differentiation 2: Use an operator overloading technique– E.g. CppAD, ADOL-C, in principle can work with any language features– Typically 25 times slower than hand-coded adjoint!– Can we do better?

Simple exampleSimple example• Consider simple algorithm y(x0, x1) contrived for didactic purposes:

• Implemented in C or Fortran90 as:

• Task: given ∂J/∂y, we want to compute ∂J/∂x0 and ∂J/∂x1

double algorithm(const double x[2]) {double y = 4.0;double s = 2.0*x[0] + 3.0*x[1]*x[1];y *= sin(s);return y;

}

function algorithm(x) result(y) implicit none real, intent(in) :: x(2) real :: y real :: s y = 4.0 s = 2.0*x(1) + 3.0*x(2)*x(2) y = y * sin(s) returnendfunction

• Differentiate the algorithm:

• Write each statement in matrix form:

• Transpose the matrix to get equivalent adjoint statement:

Creating the adjoint code 1Creating the adjoint code 1

– Consider y as dJ/dy

– Consider y as the derivative of y with respect to something

Creating the adjoint code 2Creating the adjoint code 2• Apply adjoint statements in reverse order:

double algorithm_AD(const double x[2], double y_AD[1], double x_AD[2]) {double y = 4.0;double s = 2.0*x[0] + 3.0*x[1]*x[1];y *= sin(s);/* Adjoint part: */double s_AD = 0.0;y_AD[0] += sin(s) * y_AD[0];s_AD += y * cos(s) * y_AD[0];x_AD[0] += 3.0 * s_AD;x_AD[1] += 6.0 * x[0] * s_AD;s_AD = 0.0;y_AD[0] = 0.0;return y;

}

Note: need to store intermediate values for the reverse passHand-coding is time-consuming and error prone for large codes

Forward mode:

Reverse mode:

Automatic differentiationAutomatic differentiation• We want something like this (now in C++):

• Operators (e.g. +–*/) and functions (e.g. sin, exp, log) applying to adouble objects are overloaded not only to return the result of the operation, but also to store the gradient information in stack

• Libraries CppAD, SACADO and ADOL-C do this but the result is around 25 times slower than hand-coded adjoints… why?

adouble algorithm(const adouble x[2]) {adouble y = 4.0;adouble s = 2.0*x[0] + 3.0*x[1]*x[1];y *= sin(s);return y;

}// Main codeStack stack; // Object where info will be storedadouble x[2] = {…, …} // Set algorithm inputsadouble y = algorithm(x); // Run algorithm and store info in stacky.set_gradient(y_AD); // Set dJ/dystack.reverse(); // Run adjoint code from stored infox_AD[0] = x[0].get_gradient(); // Save resulting values of dJ/dx0x_AD[1] = x[1].get_gradient(); // ... and dJ/dx1

Simple change: label “active” variables as a new type

Minimum necessary storageMinimum necessary storage• What is the minimum necessary storage to store these statements?

• If we label each gradient by an integer (since they’re unknown in forward pass) then we need two stacks that can be added to as the algorithm progresses:

• Can then run backwards through stack to compute adjoints

Index to LHS gradient

Index to first operation

2 (y) 0

3 (s) 0

2 (y) 2

… …

#Multiplier Index to RHS

gradient

0 2.0 0 (x0)

1 6.0x1 1 (x1)

2 sin(s) 2 (y)

3 y cos(s) 3 (s)

4 … …

Statement stack Operation stack

Adjoint algorithm is simpleAdjoint algorithm is simple• Need to cope with three different

types of differential statement:

Forward mode:

Reverse mode:

n

i ii xmy0

ya *0* y

amxx iii **

General differential statement:

Equivalent adjoint statements:

for i = 0 to n:

……which can be coded as which can be coded as followsfollows

• This does the right thing in our three cases:– Zero on RHS– One or more gradients on RHS– Same gradient on LHS and RHS

1. Loop over derivative statements in reverse order

2. Save gradient

3. Skip if gradient equals 0 (big optimization) 4. Loop over operations

5. Update an adjoint

““Dual numbers” approachDual numbers” approach• How can these stacks be created?• Consider what happens when compiler sees this line:

• Compiler splits this up into two parts with temporary t:

• We could define adouble as “dual number” [x, x] (invented by Clifford 1873) and then overload sin and operator*:

[sin(s), cos(s)*s] = sin([s, s])[y*t, t*y+y*t] = [y, y] * [t, t]

• This would correctly apply–but only if the gradient terms on the right-hand-side are known!

• This is not useful for the reverse-mode (adjoint) when we want to store a symbolic representation of the gradient on the forward sweep which is then filled on the reverse sweep–Dual numbers are used in some forward-mode-only (tangent linear) automatic differentiation tools.

y = y * sin(s)

adouble t = sin(s)y = operator*(y, t)

So how do CppAD & ADOL-C So how do CppAD & ADOL-C work?work?

• In the forward pass they store the whole algorithm symbolically, not just the derivative form!

• This means every operator and function needs to be stored symbolically (e.g. 0 for plus, 1 for minus, 42 for atan etc)

• The stored algorithm can then be analysed to generate an adjoint function

• This all happens behind the scenes so easy to use, but not surprising that it is 25 times slower than a hand-coded adjoint

Computational graphsComputational graphs• The basic problem is that standard operator overloading can only pass

information from the most nested operation outwards

operator*

siny

s

Pass value of sin(s)

Pass y sin(s) to be new y

Implementing the chain ruleImplementing the chain rule

Differentiate multiply operator

Differentiate sine function

Computational graph 2Computational graph 2• Clearly differentiation most naturally involves passing information in

the opposite sense

operator*

siny

s

Pass y

Pass y cos(s)

Pass sin(s)

Add sin(s)y to

stack

Add y cos(s)s to stack

Each node representing arbitrary function or operator y(a) needs to be able to take a real number w and pass wdy/da down the chain

Binary function or operator y(a,b) would pass wdy/da to one argument and wdy/db to other

At the end of the chain, store the result on the stack

But how do we implement this?

What is a template?What is a template?• Templates are a key ingredient to generic programming in C++• Imagine we have a function like this:

• We want it to work with any numerical type (single precision, complex numbers etc) but don’t want to laboriously define a new overloaded function for each possible type

• Can use a function template:

double cube(const double x) {double y = x*x*x;return y;

}

template <typename Type>Type cube(Type x) {

Type y = x*x*x;return y;

}

double a = 1.0;b = cube(a); // compiler creates function cube<double>

complex<double> c(1.0, 2.0); // c = 1 + 2id = cube(c); // compiler creates function cube<complex<double> >

What is an expression What is an expression template?template?

• C++ also supports class templates– Veldhuizen (1995) used this feature to introduce the idea of

Expression Templates to optimize array operations and make C++ as fast as Fortran-90 for array-wise operations

• We use it as a way to pass information in both directions through the expression tree:– sin(A) for an argument of arbitrary type A is overloaded to return

an object of type Sin<A>– operator*(A,B) for arguments of arbitrary type A and B is

overloaded to return an object of type Multiply<A,B>• Now when we compile the statement “y=y*sin(x)”:

– The right-hand-side resolves to an object “RHS” of type Multiply<adouble,Sin<adouble> >

– The overloaded assignment operator first calls RHS.value() to get y– It then calls RHS.calc_gradient(), to add entries to operation stack– Multiply and Sin are defined with member functions so that they

can correctly pass information up and down the expression tree

New approachNew approach• The following types are

passed up the chain at compile time:

operator*

siny

s

Sin<adouble>

Multiply<adouble,Sin<adouble> >

operator*

siny

s

Pass y

Pass y cos(s)

Pass sin(s)

Add sin(s)y to

stack

Add y cos(s)s to stack

Each function and operator y(a) implements a function calc_gradient that takes a real number w and passes wdy/da down the chain:

adouble

adouble

Implementation of Sin<A>Implementation of Sin<A>

…Adept library has done this for all operators and functions

// Definition of Sin classtemplate <class A>class Sin : public Expression<Sin<A> > { public: // Member functions // Constructor: store reference to a and its numerical value Sin(const Expression<A>& a) : a_(a), a_value_(a.value()) { } // Return the value double value() const { return sin(a_value_); } // Compute derivative and pass to a void calc_gradient(Stack& stack, double multiplier) const { a_.calc_gradient(stack, cos(a_value_)*multiplier); } private: // Data members const A& a_; // A reference to the object double a_value_; // The numerical value of object};

// Overload the sin function: it returns a Sin<A> object template <class A>inlineSin<A> sin(const Expression<A>& a) { return Sin<A>(a); }

OptimizationsOptimizations

• Why are expression templates fast?– Compound types representing complex expressions are known at

compile time– C++ automatically inlines function calls between objects in an

expression, leaving little more than the operations you would put in a hand-coded application of the chain rule

• Further optimizations:– Stack object keeps memory allocated between calls to avoid time

spent allocating incrementally more memory– If the Jacobian is computed it is done in strips to exploit

vectorization (SSE/SSE2 on Intel) and loop unrolling– The current stack is accessed by a global but thread-local variable,

rather than storing a link to the stack in every adouble object (as in CppAD and ADOL-C)

Testing using lidar multiple Testing using lidar multiple scattering modelsscattering models

• Photon Variance-Covariance method for small-angle multiple scattering– Hogan (JAS 2008)– Somewhat similar to a monochromatic radiance model– Four coupled ODEs are integrated forward in space– Several variables at N gates give N output signals– Computational cost proportional to N

• Time-dependent two-stream method for wide-angle multiple scattering– Hogan and Battaglia (JAS 2008)– Similar to a time-dependent 1D advection model– Four coupled PDEs are integrated forward in time– Several variables at N gates gives N output signals– Computational cost proportional to N 2

Simulation of 3D photon Simulation of 3D photon transporttransport

• Animation of scalar flux (I++I–)– Colour scale is logarithmic– Represents 5 orders of

magnitude• Domain properties:

– 500-m thick– 2-km wide– Optical depth of 20– No absorption

• In this simulation the lateral distribution is Gaussian at each height and each time

Benchmark resultsBenchmark results

Adjoint PVC N=50 TDTS N=50

Hand-coded adjoint 3.0 (1.0+2.0) 3.6 (1.0+2.6)

New C++ library: Adept 3.5 (2.7+0.8) 3.8 (2.6+1.2)

ADOL-C 25 (18+7) 20 (15+5)

CppAD 29 (15+7+7) 34 (17+8+9)

• Time relative to original code, gcc-4.4, Pentium 2.5 GHz, 2 MB cache

Only 5-20% slower than hand-coded adjoint

5-9 times faster than leading libraries providing same functionality

4-20 times faster for 50x350 Jacobian

OutlookOutlook• New library Adept (Automatic Differentiation using Expression

Templates) produces adjoint with minimum difficulty for user– No knowledge of templates required by user at all– Simple and efficient to compute Jacobian matrix as well– Freely available at http://www.met.reading.ac.uk/clouds/adept/

• Typically 5-20% slower than hand-coded adjoints– But immeasurably faster in terms of programmer time

• Code is complete for applying to any C code with real numbers• Further development desirable:

– Complex numbers– Use within C++ matrix/vector libraries, particularly those that

already use Expression Templates (like the one I use for the Unified Algorithm)

– Easily facilitate checkpointing so large codes don’t exhaust memory– Automatically compute higher-order derivatives (e.g. Hessian matrix)

• Potential for student projects to get small data assimilation systems up and running and efficient quickly

• Impossible to apply in Fortran: no template capability!

and 2nd derivative (the Hessian matrix):

Gradient Descent methods

– Fast adjoint method to calculate xJ means don’t need to calculate Jacobian

– Disadvantage: more iterations needed since we don’t know curvature of J(x)

– Quasi-Newton method to get the search direction (e.g. L-BFGS used by ECMWF): builds up an approximate inverse Hessian A for improved convergence

– Scales well for large x– Poorer estimate of the error at the

end

Minimizing the cost functionMinimizing the cost function

Gradient of cost function (a vector)

Gauss-Newton method

– Rapid convergence (instant for linear problems)

– Get solution error covariance “for free” at the end

– Levenberg-Marquardt is a small modification to ensure convergence

– Need the Jacobian matrix H of every forward model: can be expensive for larger problems as forward model may need to be rerun with each element of the state vector perturbed

112 BHRHxTJ

axBaxxyRxy 11

2

1)()(

2

1 TT HHJ

axBxyRHx 11 )(HJ T

JJii xxxx

12

1 Jii xAxx 1

Time-dependent 2-stream Time-dependent 2-stream approx.approx.• Describe diffuse flux in terms of outgoing stream I+ and incoming

stream I–, and numerically integrate the following coupled PDEs:

• These can be discretized quite simply in time and space (no implicit methods or matrix inversion required)

SII

r

I

t

I

c 211

1

SII

r

I

t

I

c 211

1

Time derivative Remove this and we have the time-independent two-stream approximation

Spatial derivative Transport of radiation from upstream

Loss by absorption or scatteringSome of lost radiation will enter the other stream

Gain by scattering Radiation scattered from the other stream

Source

Scattering from the quasi-direct beam into each of the streams

Hogan and Battaglia (2008, J. Atmos. Sci.)

robin hogan university of reading fast reverse-mode automatic differentiation using expression...

Documents

radarlidar resolution

lidar extinction

lidar multiple

attenuated radarlidar

multiscatter model

doppler radar

adjoint code

modelled obs lidar