index []pavan/classnotes.pdf · 2004. 12. 1. · conjugate gradient method:::::xx 3.1 positive...

INDEX

I. THE N-BODY PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21. History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22. Governing Equations and exact Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

2.1 The Hamiltonian system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Two body problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22.3 Three body problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3. Numerical Methods for ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.1 Basic ODE theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 The Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

The approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5Local and global truncation errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Stability and absolute stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

3.3 The Backward Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73.4 Second order methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.5 The Runge-Kutta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

4. Numerical solution of N-body problem in matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

II. PERFORMANCE AND PARALLEL COMPUTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111. Floating point operation count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112. Computer Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2.1 CPU (Central Processing Unit) types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Vector processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11Cache based microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

2.2 Parallel Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13Shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13Distributed memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Combined models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

2.3 Top 500 supercomputers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143. Beginning MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Example: getting started (Init, Finalize, Comm size, Comm rank, Barrier) . . . . 15Using Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Using C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16Using C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Example: computing π (Bcast, Reduce, Allreduce) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17Using Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Using C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

3.3 Parallel performance: Timing your code, scalability (Wtime) . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Example: Matrix-vector multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

III. POISSON EQUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251. Given u, compute approximate ∆u on grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.1 Finite difference approximation of ∆u. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251.2 Evaluate ∆u on grid in serial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261.3 Evaluate ∆u on grid in parallel (using blocking or nonblocking send/recv) . . . . . . . . . . . . 26

2. Iterative Methods to solve Ax = b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

1

2.1 The Jacobi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xx2.2 Gauss-Seidel and SOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xx2.3 Convergence criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xx2.4 Implementing Jacobi to solve ∆u = f in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

3. Conjugate Gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx3.1 Positive definite matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 The CG algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .263.3 Convergence criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .263.4 Implementation to solve ∆u = f in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4. Input/Output for parallel codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xx

IV. FOURIER TRANSFORM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341. The DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.1 Vector spaces, basis, inner products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .341.2 Derivation of DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .341.3 Signal processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381.4 Approximation and Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401.5 Solving differential equations in 1D (periodic case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

2. The FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413. The DFT in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

3.1 Derivation of the 2D DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Computing the 2D DFT in MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .443.3 Solving differential equations in 2D (periodc case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4. Solving discrete Poisson Equation exactly using FFT (nonperiodic case). . . . . . . . . . . . . . . . . . .454.1 1D: Solving u′′ = f in [a, b], u(a) = ua, u(b) = ub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 2D: Solving ∆u = f in D, u = g on ∂D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . last page

2

I. THE N-BODY PROBLEM2. Governing EquationsThe governing equations for the motion of n objects with mass mi positioned at xi under the actionof the forces they induce on themselves follows from Newton’s first law of motion (Force = mass×acceleration), to be

mid2xidt2

= −N∑

j=1j 6=i

mimjGxi − xj

|xi − xj |3 , i = 1, . . . N

or

d2xidt2

= −N∑

j=1j 6=i

mjGxi − xj

|xi − xj |3 (1)

The exclusion j 6= i accounts for the fact that an object does not induce any motion on itself.Eq (1) can be written in terms of a summation kernel K,

d2xidt2

= −N∑

j=1j 6=i

mjK(xi − xj) where K(x) = −G x|x|3

Solving equations involving of this form involving sums with large value N remains an importantand difficult problem on which much current research is focused. Some applications are:• Celestial mechanics remains an important field of applications. Descrition of one of LANL’s

science runs last year follows. Essentially: are trying to model dynamics of galaxies to get infoabout big bang.Cosmology is the study of the large-scale structure and evolution of the Universe. Observationsof the Universe have revealed a wealth of structure in the form of galaxy clusters, filamentsand voids. These objects reflect the global evolution of the Universe as well as the physics ofthe very early epochs during which cosmic structure arose. The origin of large-scale structureand the evolution of the Universe can be probed by picking a set of cosmological parameters,modeling the growth of structure, and then comparing the model to the observations. This workmodels the evolution of dark matter in large regions of the Universe using more than a billionparticles. the simulations used an advanced parallel treecode algorithm to solve the gravitationalN-body problem in unprecedented detail. The image represents the density of dark matter in theUniverse. Our galaxy lives in a clump of dark matter similar to the size of the medium-sizedobjects in the image. We simulated more than 100 different cosmological models during therun on Q. These simulations can be compared with observations of extragalactic structure suchas the Sloan Digital Sky Survey.

• Molecular Dynamics (specially protein folding, chemical reactions)

• Electrostatics

• Fluid dynamics (vortex dynamics. Kernel is different, but same difficulties in computing sum)

3

2.1. Two body problemProblem statement:

d2x1dt2

= −m2G x1 − x2|x1 − x2|3d2x2dt2

= −m1G x2 − x1|x2 − x1|3Taking the top equation minus the bottom equation get

d2rdt2

= −(m2 + m1)G rr3

(2)

where r = x1 − x2. Solution to this IVP is uniquely determined by initial conditions on positionand velocity

r(0) = ro and v(0) = vo (3)

This equation can be solved analytically. (See website) An outline of the solution is as follows.

Step 1: Prove that angular momentumh = r × v

is conserved. This has the important consequence that the motion of the 2 bodies lie in aplane.

Step 2: Take cross product of Eq (2) with h. Manipulate using vector analysis identities and thefact that dh/dt = 0 to write both sides as a derivative, then integrate to get

v × h = µrr

+ k

where µ = (m1 + m2)G. Take the dot product with r on both sides and solve for r, getsolution to (2):

r =h2/µ

1 + kµ cos γ

which is an conic section with eccentricity e = k/µ.

The constants h (absolute value of angular momentum) and k (absolute value of constant of inte-gration) can be found from the initial conditions. Solutions shows only possibility of the motionis that r travels on an elliptical (0 ≤ e ≤ 1), hyperbolic (e > 1), or parabolic (e = 1) trajectory.Which shape the trajectory has depends on the initial conditions.

2.2. Three body problemIn contrast to the two body problem, the equations for the 3 body problem (Eq 1 with N=3) aregenerally unsolvable. Meaning that in general, there exists no analytic solution. Of course, forcertain initial conditions one can find exact solutions, such as for the restricted three body problemsolved by Lagrange. Here one assumes 1 body has negligible mass, and that the three bodies lie inthe plane. Find solution with steady equilibria of massless body with respect to motion of othertwo bodies at L1, L2, L3, L4. Sometimes, satellite put at these equilibria. Example: SOHO. 1998,placed near L1 lost control went into spin, lost contact because .. weren’t pointing in right direction.Found by looking near L1.

Lagrange (1736-1813): claimed by the french (although they almost killed him for being for-eigner), born in Turin, Italy, lived there 30 years, then 20 years in Berlin (where he took Euler’slead position when Euler went to Russia). Contemporary and in contact with Euler, d’Alembert,Bernoulli, Laplace, Legendre, Fourier.

4

3. System of ODEs

3.1 Basic ODE theory

Given a system of ordinary differential equations (ODEs), what is the correct problem to solve?One of you mentioned boundary value problem as opposed to initial value problem. Consider

Example: y′′ = −y → y′ = x, z′ = −y. Has solutions y = A sin(t + α), z = A cos(t + α). Twoparameters corresponding to second order system. Need to conditions. Note that the twoboundary conditions y(0) = y(π) = 0 specifies α = 0 but does not specify A. (Plot Figure)Thus, not well-defined problem.

While certainly possible that BC uniquely specifies pb, we will consider only to IVPs.

Theorem: Consider the IVPẋ = f(x),x(0) = xo . (1)

Suppose f and all its partial derivatives are bounded and continuous for x in some closedconnected set D ∈

at a fixed time grows arbitrarily large, and does not approach zero. The solution does thereforenot depend continuously on the initial data.

Theorem: If f is continuous with bounded continuous derivatives in D, then the IVP (1) is well-posedwrt any initial condition in D.

Note that often we cannot satisfy the condition in the whole domain, but only in a subregionof the y space. [ For example, if f(y, t) = y2, ∂f/∂y exists and is bounded in any finite region.]Theorem 1 can be used to guarantee a uniqe solution while y remains in that region. Theorem2 is valid as long as the perturbation remains in that region. [In the example, perturbations ofy(0) > 0 are bounded as long as perturbation does not reduce y below 0]. Note that solution mayleave domain in finite time [as in example].

Example: Write n-body problem as system of ODEs. Condition is satisfied in a closed domain inwhich xi 6= xj .

Ref: Garabedian,John,

3.2 The Euler Method

We will consider difference methods to solve ODE. We approximate the solution on t ∈ [0, b]at a sequence of discrete points tj called the mesh points. We will assume that these points areequally spaced,

tj = j∆t, j = 1, . . . m, ∆t = b/m

A difference method is also called a step by step method and provides a reul for computingthe approximation at step j,

yj = approximation of y(tj)

in terms of the approximation at the previous time yj−1 and possibly preceding values (in case ofmultistep methods).

There are two types of errors that appear: truncation (or discretization) error in approximatingthe finite difference equation (local and global truncation error), and roundoff error.

We would like the method to converge. Any desired degree of accuracy can be achieved forappropriate IVP by choosing sufficiently small ∆t.

If the IVP is well posed, would also like the method to be stable: a method is stable if thereexists an ∆to such that a change in the starting condition produces a bounded change in thenumerical solution at fixed time for all 0 < ∆t < ∆to.

3.2.1 The Euler ApproximationUsing Taylor’s Theorem get that

y(h) ≈ y(0) + hy′(0)

ory1 ≈ y0 + hf(y0)

In general, define Euler Approximation:

yn+1 ≈ yn + hf(yn)

6

Geometric interpretation:Convergence? Yes: if f Lipschitz in y , continuous for t ∈ [a, b], y(0) → y0 and h → 0, then

yn → y(n).

3.2.2 Local and global truncation errorsLocal truncation error (LTE) : amount by which solution fails to satisfy the difference method.

y(tn+1) = y(tn) + hf(y(tn), tn) + LTE

That is, error that results when a single step is performed with exact input data. If this errorvanishes with h, consistency. Lax: consistency plus stability = convergence. From Taylor expansion:LTE = O(h2)

Global truncation error : if y is continuously differentiable 3 times, then

y(tn) − yn = O(h)

A method with global error O(hk) is called a method of order k. Thus, the Euler method is a firstorder method. In general, for a method of order k the LTE = O(hk+1)

For the Euler Method it can be shown specifically that

y(tn) − yn = hg(t) + O(h2)

Richardson extrapolationeh = y(tn) − yhn = hg(t) + O(h2)

e2h = y(tn) − y2hn = 2hg(t) + O(h2)Then

e2h − 2eh = y(tn) − (y2hn − 2yhn) = O(h2)So get second order approximation of solution.

3.2.3 Stability and Absolute StabilityTo test stability, need to check whether small changes grow, if f satisfies Lipschitz condition,

or bounded continuous derivatives. consider change from yn to zn, so that we are solving

zn+1 = zm + hf(zm, tm)

instead ofyn+1 = ym + hf(ym, tm)

Subtracting and setting en = zn − yn get

|en+1| ≤ |en| ≤ hL|en|

Thus|em| ≤ (1 + hL)m−n|en| ≤ ebL|en|

bounded change independent of h.Stability and convergence are concerned with the limiting process as ∆t → 0. In practice, we

must compute with a finite number of steps, and are really concerend with the size of errors for

7

such nonzero ∆t. In particular, we want to know if the errors we introduce at each step (truncationand roundoff) have a small or large effect on the answer. We therefore would like to define absolutestability as follows: ” A method is abolutely stable for a given step size h and a given differentialequation if the change due to a perturbation of size δ in one of the mesh values yn is no largerthatn δ in subsequent values ym, m > n”

However, this defintion is too problem dependent, so utilize a test eqution. We define absolutestability for the differential equation y′ = λy, where λ is a complex constant: The region of absolutestability of a method is that set of values of h and λ for which a perturbation in a single value ynwill produce a change in subsequent values which does not increase from step to step.

Euler’s method applied to test equation :

yn+1 = yn + λhyn = (1 + λh)yn

Region of absolute stability: |1 + λh| < 1 (circle in the complex λ − h plane centered at (-1,0).Difference between stability and abs stability. Stability: error grows by bdd amount as h → 0.

A-stability: error does not grow for test case.

Im( h)λ

Re( h)λ

3

−3

−3

Fig: Region of absolute stability for forward Euler method

Test case: Solve y′ = −1000(y − t2) + 2t, y(0) = 1, for t ∈ [0, 1] using Eulers method, and∆t = 1, 0.1, 0.01, 0.001, 0.0001, 0.00001. Explain your results.

3.3 The Backward Euler MethodInstead of using a Taylor series expansion based around tn to approximate yn + 1 (forward),

we can use the expansion about tn+1 to approximate yn (backwards)

y(tn) = y(tn+1) − hf(y(tn+1, tn+1) + O(h2)

ory(tn+1) = y(tn) + hf(y(tn+1, tn+1) + O(h2)

to propose the following method

yn+1 = y(tn) + hf(yn+1, tn+1)

8

The LTE is O(h2) (the solution satisfies the difference equation to second order) , so this gives aglobal error o of O(h) ⇒ a first order method, called the Backward Euler Method. This methodhas a much improved region of A-stability. Applied to y′ = λy, get

yn+1 = yn + hλyn+1

oryn+1 =

11 − hλyn

The amplification factor | 11−hλ| is < 1 whenever hλ < 0, so A-stability region all of half-plane.However, problem of this method is that it is implicit. Need to solve a typically nonlinear equationfor yn+1 at each time step. In general, implicit schemes have better abs stability properties, butare more costly to implement.

Im( h)λ

Re( h)λ

3

−3

−3

Fig: Region of absolute stability for backwards Euler method

3.4 Second order methodsWe saw Euler + Richardson extrapolation gives second order approximation. A second order

method (second order results at each timestep) is obtained by applying Euler+Richardson extrapat each timestep. What do we get?

yh/2n+1/2 = yn + h/2f(yn, tn)

yh/2n+1 = y

h/2n+1/2 + h/2f(y

h/2n+1/2, tn+1/2)

yhn+1 = yn + hf(yn, tn)

yn+1 = 2yh/2n+1 − yhn+1

This can be summarized ask1 = hf(yn, tn)

k2 = hf(yn + k1/2, tn + h/2)

yn+1 = yn + k2

This is called the midpoint method, since it is an approximation to yn+1−ynh = y′n+1/2, which is a

second order approximation to the derivative, as opposed to the first order approximation we usedto derive the Euler method.

9

In general, can look for scheme of the form

k1 = hf(yn, tn)

k2 = hf(yn + αk1, tn + βh)

yn+1 = yn + ak1 + bk2

(3.1)

Choose α, β, a, b such that LTE as small as possible. Write Taylor expansion for actual solution.Write Taylor expansion for method. Get them to match. For LTE=O(h3), get

a + b = 1, α = β =12b

For b = 1, a = 0, α = β = 1/2, get midpoint method. For b = a = 1/2, α = β = 1, get theHeun method, or modified trapezoidal method, since it approximates yn+1−ynh =

y′n+y′n+1

2 (convinceyourself of this by writing the method out).

3.5 The Runge-Kutta MethodBy similar approach, one can derive 4th oder methods. One starts with an Ansatz of the form(3.1), but with 4 intermediate steps k1, k2, k3, k4. The condition that the LTE be O(h5) yields a setof equations for the unknowns αi, βi, ai, bi which has infinitely many solutions. The most common4th order scheme derived this way is the 4th order Runge Kutta scheme (RK4)

k1 = hf(yn, tn)

k2 = hf(yn + k1/2, tn + h/2)

k3 = hf(yn + k2/2, tn + h/2)

k4 = hf(yn + k3, tn + h)

yn+1 = yn + (k1 + 2k2 + 2k3 + k4)/6

Region of stability:

Im( h)λ

Re( h)λ

��

��

��

��

��

��

��

��

��

��

��

��

��

��

3

−3

−3

Fig: Region of absolute stabilityfor 4th order Adams−Bashforth

10

3.6 Adams-Bashforth MethodsObtained by integrating polynomial approximations to integrand. 4th order faster (only two

function evals per timestep) but worse region of stability.

Im( h)λ

Re( h)λ��

��

3

−3

−3

Fig: Region of absolute stabilityfor 4th order Adams−Bashforth

11

II. PERFORMANCE AND PARALLEL COMPUTING

1. Floating point operation countOne floating point operation: one addition, subtraction, multiplication or division. More difficultto count: function calls such as exponential or trig functions, how many operations?

Often reported: FLOPS: number of operations per second. To obtain: count number ofoperations, get cputime, divide. Try for your Euler and RK codes.

Caveat: FLOPS count easily manipulated. Depends on how efficiently code is written, howgood the compiler is at optimization, how efficiently the CPU’s memory cache is used. For example:

Example: a(b+c) 2 operations ab+ac 3 operations if both take same time, second gives higherflops count.

Example: Direct summation algorithms are relatively simple, may give high flops count. Fasterand more efficient fast summation algorithms (see later in course) may give lower flops, althoughthey have a better time-to-solution.

The peak flop count is obtained by assuming most efficient floating point instructions (usually1 mult/add per clock cycle) and no memory access costs (all data in cache). For example: Pentiumcan do 1 mult/add instruction per clock cycle. Thus a 3.2 GHz processor has a peak speed of 6.4GFlops. Generally peak not achieved. For efficient vector machines, may get 75%. For Pentium,on average, get 10%. Typically, to get fastest machine for given application, need to benchmark.

Cache: memory to which processor has ready access. Oversimplified model of a cache basedCPU: it is expensive to access data from main memory (8 or more clock tics for one read or write).But everytime this data is read or written to main memory, a copy is stored in the memory cache.So future references to this data are much quicker. More details in the next section.

2. Computer ArchitecturesSupercomputers are used by computational scientists to simulate large scale complex phenomena.There are physical limits on the speed of a single computer (heat dissipation, speed of light). Also,their cost increases more rapidly than their power ⇒ development of parallel computers. The cancome in the form of one machine with many processors (eg, SGI Origin Series), or in the formof clusters of many machines combined with intercommunication networks, often called switches(eg, Beowulf). Increasing performance and decreasing prices make it possible for PC clusters tocompete with traditional supercomputers.

2.1 CPU (Central Processing Unit) types

2.1.1 Vector processors:Vector machines introduced the age of supercomputing, with CRAY, then.. Still in use today (seebelow). Idea: operate on an array of similar data items at same time.

Vector processors are designed to process loops of the form:for i=1:Na(i)=b(i)+h*RHS(i)

end

(such as in our RK integrators, for example) at close to peak performance. To obtain peak perfor-mance, lets assume a “mult/add” instruction, which takes one clock cycle. For every mult/add, wealso need to read two data values, b(i) and RHS(i) from main memory, and write the answer a(i)back to main memory (h is reused and will stay in cach after first being read in). Thus we need toalso be able to sustain 3 floating point data read or writes per clock cycle. So to run at the peakprocessor performance, the memory speed needs to be at least 3/2 of the CPU speed. This is thedefining characteristic of vector processors: very high memory bandwidith, for data with regular,vector like access patterns. Vector processors typically cannot obtain such memory bandwidth for

12

irregular “indirect addressing” such as if b(i) was replaced by b(index(i)). Vector machines arespecially designed for scientific computing applications, and typically very expensive.

2.1.2 Cache based microprocessors:In a cache based processor design, such as typically found on PCs, the main memory speed istypically much lower than the processor speed. Hence a small amount of very high speed cachememory (≈ 1MB) is added to the processor. The cache is used by the processor to keep a copyof frequently accessed data, so that future access is very fast (after the first time the data is readfrom main memory)

main

memorycacheCPU

Memory access speed is measured by memory bandwidth. For example my latest desktop PC ($1000-2000, depending how much you can do yourself and how much service/warranty it comes with)has processor speed of 3.2 GHz (6.4GFlops peak) and 1GB memory at 400 Mz, so the memory is8 times slower than the processor. This is oversimplified, but for illustration purposes, lets assumeit takes 8 clock cycles to read a floating point number from main memory, but the processor canaccess data in cache memory at no cost.

Returning back to our time stepping loop:for i=1:Na(i)=b(i)+h*RHS(i)

end

note a couple of things: There is no cache reuse. As the data b(i) and RHS(i) is read in from mainmemory for the first time, it is used once and never used again. Furthermore, if N is large, the cachewill soon be filled up, and as the arrays are continued to be read from memory, the newer valueswill replace the older values that were saved in cache. Thus there is little chance for cache resusein whatever operations are done after this timestepping loop. Thus this loop could potentially run24x slower than the peak processor performance. (3 read/writes, at 8 clock cycles each = 24 clockcycles per mult/add.) Not possible to obtain anywhere near peak performance, since that requiresexecuting 1 mult/add per clock cycle.

In practice, there are some mitigating factores, such as data prefetching, but our simplifiedexplanation is the main reason cache based microprocessors cannot obtain anywhere near peakperformance on large vector loops.

The SE instruction sets inside a Pentium (mentioned in class) in which 4 operations can be doneat one time, can further increase peak performance, but do not solve the memory access problem.For applications written in vector style (in which there is little or no cache reuse), performanceis dominated by memory bandwidth. Also, the SE instructions I believe do only single precisionoperations, and most scientific codes require double precision.

13

2.2 Parallel Architectures

Shared memoryEach processor has access to all of a single, shared address space. See schematic diagram.Example:SGI Origin Series. Problem: difficult and expensive to make with more than a few hundreds ofprocessors. Not considered a cluster.

Address space

Processes

Distributed Memory (Clusters)Set of processors that have only local memory but are able to communicate with other processes bysending and receiving messages. Here, data transfer from the local memory of one process to thelocal memory of another requires operations to be performed by both processes. Big obstacle inearly 90’s: no portable libraries, therefore no portable codes. Different vendors, different libraries.In response to this problem, MPI was designed by a group of researchers that met in 1992 to beportable and efficient. Today it is the standard. The MPI Standard was completed in 1994 andupdated to include additional features such as parallel I/O, and dynamic process management, in1997.

Network

14

Combined modelsGroups of processes with shared memory and message passing between groups. These machinescan be treated just like a cluster by the user with MPI used for message passing. When sendinga message between two processors which have access to the same piece of shared memory, MPIwill make use of that shared memory to send the message, and MPI will use the network forother messages. A hybrid programming model is also possible: writing code that uses both sharedmemory and message passing. It can give a small improvement, but usually not enough to justifythe extra effort over writing a pure MPI code. We are going to focus only on the MPI programmingmodel.

2.2. Top 500 supercomputers (http://www.top500.org)Most of them them are clusters of small (2-8 cpus) shared memory “nodes”.[1] Japanese Earth Simulator. 640 nodes, each node has 8 vector processors and shared mem-

ory. achieves 36 TFlops on test case of Linpack matrix multiply. Nodes are connected via aproprietary interconnect.

[2] Thunder (Linux cluster), Livermore. 2048 nodes, 2 processors each. Intel Itanium chip. Inter-face between nodes: Quadrics. Achieves 20 TF.

[3] Unix Cluster, LANL. 2048 nodes, 4 processors each. CPU: Alpha chip. Interface: Quadrics.Achieves 14 TF.

[4] BlueGene/L. 2048 nodes, 2 processors each. IBM Power PC processors. Proprietary Torus.12 TF.

[5] Tungsten, NCSA (National Center for Supercomputing Applications). 2500 Pentium 4 pro-cessors. Myrinet Interconnect. 10 TF.

[6] Azul, HPC, UNM. 8 nodes, each 4 cpus. Intel Xeon chip, 550Mz. Interconnect: ethernet(conventional). Switched network.

Switched network: communications goes up through a tree of switches, nodes are leaves of the tree.Ring: nodes are connected to each other in a line.Mesh: nodes are connected to each other in a mesh in 2D.Torus: nodes are connected to each other in 3D, torus shape.

3. Beginning MPIMPI is a library. It specifies the names, calling sequences, and results of subroutines to be calledfrom Fortran programs, the functions to be called from C programs, and the classes and methods

15

that make up the MPI C++ library. The programs that users write in Fortran, C, and C++ arecompiled with ordinary compilers and linked with the MPI library.

We will introduce MPI functions a few at a time, and use them. Start with:MPI INIT Initiate an MPI computationMPI FINALIZE Terminate MPIMPI COMM SIZE Find out how many processes there areMPI COMM RANK Find out which process I amMPI BARRIER stop until all processes have arrived here

3.1 Example: getting started (Init, Finalize, Comm size, Comm rank, Barrier)

Using FORTRANThe way to use them, in FORTRAN

call MPI INIT(ierr)call MPI FINALIZE(ierr)call MPI COMM SIZE(comm,numprocs,ierrcall MPI COMM RANK(comm,myid,ierr)call MPI BARRIER(comm,ierr)

MPI INIT is required in every MPI program and must be the first MPI call (can be called onlyonce). Its only argument is an error code. Every Fortran MPI subroutine returns an error codein its last argument, which is either MPI SUCCESS or an implementation-defined code. We will besloppy and not test the return codes from our MPI routines, assuming that they will always beMPI SUCCESS. MPI SUCCESS is an integer variable defined in ’mpif.h’.

In MPI it is possible to group some of the processors into subgroups. For example, there isa function that computes the sum of a certain variable over all processes in a group. It may beconvenient to sum over a subset of all processes only. These subgroups of processors are identifiedby a communicator handle, the input variable comm in the above examples. All but the first two callstake this communicator handle as an argument. The communicator identifies the process groupconsidered. The default communicator is MPI COMM WORLD, which identifies the set of all processes.MPI COMM WORLD is another one of the items defined in ‘mipf.h’

MPI COMM SIZE returns the number numprocs of processes in the communicator comm.MPI COMM RANK returns the id of the current process. We think of the processes in any group as be-ing numbered with consecutive integers beginning with 0, called ranks. By calling MPI COMM RANK,each process finds out its rank in the group associated with a communicator. Remember: eachprocessor is going to run the same code. But different processors should be doing different things.So you have to tell them: find out who you are. If you are such-and-such, do that. Etc.

A simple FORTRAN code that uses these command (in file simple.f):

program mainuse mpi !alternatively type "#include mpif.h"implicit noneinteger numprocs,myid,ierrcall MPI INIT(ierr)call MPI COMM SIZE(MPI COMM WORLD,numprocs,ierr)call MPI COMM RANK(MPI COMM WORLD,myid,ierr)print("I am", myid, "of", numprocs)call MPI FINALIZE(ierr)stopend

16

What does it do? Try it. This is how to compile and run it. To be able to compile your code, youneed to make some modifications to the file .login.linux in your home directory (see instructions infile on course website). After you made these changes, logout and login and you are ready to useMPI.

To compile, use:mpif90 simple.f

or, specifying include and library directories:# pgf90 -I /usr/parallel/mpich-p4/include simple.f-L/usr/parallel/mpich-p4/lib -lmpich

To run on the frontend machine on 4 processorsmpirun -np 4 a.out

Now let us use the MPI BARRIER(MPI COMM WORLD,ierr) function in the above example tomake sure the print statements are performed in order. The MPI BARRIER function does not returnuntil all processes in MPI COMM WORLD have called it. It forces the processes to wait until all processeshave reached the barrier. Not usually needed in a program, but good for debugging. If one processis “stuck” somewhere, then none of the other processes will get through the barrier.

Using CThe primary difference between C and Fortran is that in C, error codes are returned as the valueof C functions instead of in a separate argument. The included file is, of course, different: ’mpi.h’instead of the mpi module. Finally, the arguments to MPI Init are different. Note that thearguments in C are the addresses of the usual main arguments argc and argv. A version in C ofthe above sample code is (in file simple.c):

#include "mpi.h"#include int main( int argc, char *argv[]>{

int numprocs,myid;MPI Init(&argc,&argv);MPI Comm size(MPI COMM WORLD,&numprocs);MPI Comm rank(MPI COMM WORLD,&myid);printf("I am %i of %i\n",myid,numprocs);MPI Finalize();return 0;

}

Careful: commands in C are case dependent. To compile C code use mpicc simplec.c, to runcode do same as described earlier.

17

Using C++Most C functions become members of C++ classes that one can identify informally in the C bindingsas objects. The callnames change. Here is the above program coded up in C++ (in file simple.cc):

#include "mpi.h"#include #include int main( int argc, char *argv[]>{

int rank,size;MPI::Init(argc,argv);size=MPI::COMM WORLD.Get size();rank=MPI::COMM WORLD.Get rank();printf("I am %i of %i\n",rank,size);MPI::Finalize();return 0;

}To compile C++ code use mpiCC simplec.cc.

3.2 Example: Computing π (Bcast, Reduce, Allreduce)Here we introduce the commands

MPI BCAST Sends the value of a variable to all other processesMPI REDUCE Performs an operation on data in all processes, returns result to oneMPI ALLREDUCE Performs an operation on data in all processes, returns result to all

All of these operations are so-called collective operations: they are called by all the processes in acommunicator. In general you want to minimize collective operations since they require all processesto synchronize at some level, that is, have to wait for all of them to go through this step, similar toMPI BARRIER. In parallel programming you dont want tight syncronization, you dont want everyoneto be waiting around for something.

MPI BCAST is an operation used to move data: the sender sender (identified by its rank) sends agiven variable var, with count many items, and datatype type to all processes in the communicatorcomm. As a result of this call, all processes end up with a copy of var. The data type specificationdiffers in Fortran and C and will is given below. Notes: all processes must make the call to Bcast.If a process does not join in the Bcast then the rest of the processes will wait. Process root willsend the same message to all other processes.

To understand how BCAST works, imagine you have something very important to say toeveryone. Then there are a number of ways the information can be disseminated. You could telleveryone individually. But clearly while you communicate with each person in turn everyone elseis twiddling their thumbs. Alternatively, you can start off a chain of communications which can bethought of in a tree-like sequence.

BCAST is essentially a bunch of send and receive commands (which we will talk about later),and you can implement it yourself using individual send and receive calls. Such an implementationmay be as efficient as the BCAST call, although generally more sophisticated MPI libraries canmake use of special properties of hardware together with good algorithms to be more efficient.

18

0

0

0

0

2

4

4

4 5

6

1 2 3 6 7

Fig: Example tree communication for BCAST

Note: there are different MPI libraries. On azul we are using MPICH, a free version of MPI thatcomes from Argonne National Laboratories. There are other free versions (LAM-Purdue, LAMPI-Los Alamos) and also commercial versions of MPI (MST, HP-Alaska, Cray, SGI, Quadrics, Myrinet)and high performance computing vendors usually write their own versions of MPI, customized fortheir hardware. All MPI libraries are supposed to adhere to the MPI Standard and have the sameAPI (application programming interface). Different versions of MPI can be optimized for differentnetworking fabrics (hardware). For example, the MPI written by Quadrics for use on quadricsnetworks has a very efficient broadcast that uses special features of the network switches.

MPI REDUCE does a collective computation operation oper on the data myvar in each processin the communicator comm. The output of the operation is placed in var in the receiving processreceiver only. The arguments count and type are the number of items sent and their data type.Most common possible operations are:

MPI SUM sum over all myvarMPI MAX max of all myvarMPI MIN min of all myvarMPI PROD product of all myvar

MPI ALLREDUCE does the same as MPI REDUCE, except that all processes receive the resultingvalue var. It is a combination of MPI REDUCE and BCAST, although more efficient. For example youcan write an efficient tree structure that does a REDUCE and communicates result to all.

Using FortranHere are the statements to call these functions in FORTRAN.

MPI BCAST(var,count,type,sender,comm,ierr)MPI REDUCE(myvar,var,count,type,oper,receiver,comm,ierr)MPI ALLREDUCE(myvar,var,count,type,oper,comm,ierr)

Most common possible data types areMPI INTEGERMPI DOUBLE PRECISIONMPI REALMPI CHARACTERMPI LOGICAL

19

Now let us use these operations to compute the integral

∫ 10

11 + x2

dx = atan(1) − atan(0) = π

using the trapezoid rule, in parallel. This is a “perfect” parallel program: it can be expressed witha minimum of communication, load balancing is automatic, and we can verify the answer. Here isa possible Fortran code (note: your code should have comments).

program mainuse mpiimplicit nonereal*8 pi25dtparameter(pi25dt=3.141592653589793238462643d0)integer i,n,myid,numprocs,ierrreal*8 a,b,h,x,f,sum,mypi,pi

f(a) = 4.d0/(1.d0+a*a)

call MPI INIT(ierr)call MPI COMM SIZE(MPI COMM WORLD,numprocs,ierr)call MPI COMM RANK(MPI COMM WORLD,myid,ierr)

c print *, ’Process ’, myid, ’ of ’, numprocs, ’ is alive’

10 if (myid.eq.0) thenprint*,’enter number of points n (divisible by numprocs)’read*,n

endifn=n/numprocs

call MPI BCAST(n,1,MPI INTEGER,0,MPI COMM WORLD,ierr)if (n.le.0) goto 30

a = dble(myid)/numprocsb = dble(myid+1)/numprocsh = (b-a)/n

sum = (f(a)+f(b))/2x = ado i=1,n-1

sum = sum+f(x)enddomypi = h*sumcall MPI REDUCE(mypi,pi,1,MPI DOUBLE PRECISION,MPI SUM,0,MPI COMM WORLD,ierr)

if (myid.eq.0) write(*,1000)pi,abs(pi-pi25dt)1000 format(’pi is approximately: ’,f18.16,’ Error is: ’,f18.16)

goto 1030 call MPI FINALIZE(ierr)

stopend

20

Using CBesides the usual changes in input arguments (see example below), the data types in C differ fromthe ones in Fortran. Most common C types are

MPI INT, MPI DOUBLE, MPI FLOAT, MPI CHAR

Code to approximate π, in C:

#include "mpi.h"#include int main( int argc, char *argv[] ){

int n, myid, numprocs, i, ierr;double pi25dt = 3.141592653589793238462643;double mypi, pi, h, sum, x, f, a,b;

MPI Init(&argc,&argv);MPI Comm size(MPI COMM WORLD,&numprocs);MPI Comm rank(MPI COMM WORLD,&myid);

while (1) {if (myid == 0) {

printf(" Enter total number of intervals (0 quits)");scanf("%d",&n);

}n=n/numprocs;

MPI Bcast(&n,1,MPI INT,0,MPI COMM WORLD);if (n==0)

break;else{

a = (double)myid/numprocs;b = (double)(myid+1)/numprocs;h = (b-a)/n;

sum = (4/(1+a*a) + 4/(1+b*b))/2;x=a;for (i=1;i

3.3 Parallel performance: Timing your code, Scalability calculations

For parallel programs, measuring speed of execution is part of testing to see whether a programperforms as intended. The function

double precision MPI WTIME() (in Fortran)double MPI Wtime() (in C)double MPI::Wtime() (in C++)

returns a double-precision floating-point number that is the time in seconds since some arbitrarypoint of time in the past. The point is guaranteed not to change during the lifetime of a process.Thus a time interval can be measured by calling this routine at the beginning and end of a programsegment and subtracting the values returned.

The values returned by MPI Wtime are not synchronized with other processes. That is, youcannot compare a value from MPI Wtime from one process with a value from another. Only thedifference in values taken on the same process has a meaning.

To measure the speedup of your program you need to time only the section that does internalcommunications and computation. You dont want to include time spent waiting for user input forexample. By obtaining the execution times using various numbers of processes, you can measurethe speedup, normally defined as

speedup =time for 1 process

time for p processes

A nearly perfect speedup would be a phrase like “speedup of 97.8 with 100 processors”.

LAB #8: Time the code that computes π by approximating an integral. Compute the speedupas a function of the number of processors, for fixed problem size using n points. Plotthe speedup vs number of processors.

Note: if you add the lines t1=W Time() and t2=W Time() just before MPI BCastand just after MPI Reduce, every processor will compute its time. You can chooseto print the runtime of only processor 0, the average runtime, or the max runtime.In the present case they should all be the same, since the problem is so perfectlyload-balanced. In general, the latter time may be the most honest/accurate one forscalability purposes.

Results: The following two figures show the results using n = 108 and n = 2 · 109. Theline of slope 1 corresponding to perfect speedup is also shown. The results werecomputed requesting either the minimal number of nodes needed (coloured curves,different colours correspond to different number of nodes requested) or a total of 6nodes (dashed line in Fig 2).

22

0 5 10 15 20 250

5

10

15

20

25speedup (m=108, requesting minimal number of nodes)

number of processors

spee

dup

0 5 10 15 20 250

5

10

15

20

25speedup (m=2*109, requesting minimal number of nodes)

number of processors

spee

dup

Conclusions: From this experiment I learnt that1 There is a problem with azul, it wont let you use 7 or 8 nodes.2 The correct speedup is achieved only if the problem is sufficiently big.3 There is no significant difference in the timings if you request more nodes than

needed. (It only blocks other peoples access to these nodes.)4 The startup time required to run the code increases the more nodes are requested,

and took up to about 20-25 sec.

23

There is one more item that is used to determine the scalability of your code, and that is theratio of calculation time over communication time. Assume m processors are used, each calculationtakes time Tcalc and each communication (sending one item from one processor to another) takestime Tcomp. In the above example (approximating π by a sum of n terms) there are a total of n+8operations plus n inside Reduce. While the exact number of communications in BCast and Reduceare not known to us, they can be estimated to be n each. This assumes a tree of the one shown inFig (p18) for both. Thus the ratio of communication to calculation time is

2mTcomm(N + 8m + 1)Tcalc

We can conclude that for fixed m, the ratio → 0 as the problem size N → ∞ (communication costsbecome insignificant. For fixed N , the ratio → 1/4 as the number of processors m → N . In eithercase, communication costs do not dominate the problem.

3.4 Example: Matrix-vector multiplicationWe’ll introduce two point-to-point operations (as opposed to collective operations): the basic block-ing send and receive.

MPI Send(address,count,datatype,destination,tag,comm)

(address,count,datatype) describes count ocurrences of items of the form datatype starting ataddress. destination is the rank of the destination in the group asociated with the communicatorcomm. tag is an integer used for message matching.

MPI Recv(address,maxcount,datatype,source,tag,comm,status)

(address,maxcount,datatype) describes the receive buffer as they do in the case of MPI Send. Itis allowable for less than maxcount occurrences of datatype to be received. The arguments tag andcomm are as in MPI Send, with the addition that a wildcard, matching any tag, is allowed. sourceis the rank of the source of the message in the group asociated with the communicator comm, ora wildcard matching any source. status holds information about the actual message size, source,and tag.

The following sample code computes matrix-vector multiplication. The algorithm is not the mostefficient way to do this computation, but the example illustrates how to use the calls. The algorithmis: processor zero distributes a subset of rows Ai to each other processor, as well as the vector b.Each processor computes Aix and returns it to processor 0. (For simplicity, assume A square.) Thesender sends the rownumber in the variable tag tag, the receiver, if using wildcards for tag andsource, can find out the actual tag and source value through status. The code is debugged usingthe result from a serial version.

Fortran pseudo-code

program main

. . . declare variables. . .

. . . initialize MPI, find rank, numprocs. . .

C ALL PROCESSORS KNOWrows=64cols=64n=rows/numprocs

24

if (myid.eq.0) thenC PROCESSORS 0 INITIALIZES a,b

. . . set b(j),a(i,j) i=1,rows,j=1,cols. . .

C PROCESSORS 0 SENDS Bcall MPI BCAST(b,cols,MPI DOUBLE PRECISION,0,MPI COMM WORLD,ierr)

C PROCESSORS 0 SENDS ROWS OF Ado i=1,numprocs-1

do j=i*n+1,(i+1)*n. . . set buffer(k)=jth row. . .call MPI SEND(buffer,cols,MPI DOUBLE PRECISION,i,j,MPI COMM WORLD,ierr)enddo

enddo

C PROCESSORS 0 COMPUTES ITS ENTRIES C(1:n), where C=A*B. . . set c(j)=sum a(j,k)*b(k). . .

C PROCESSORS 0 RECEIVES ENTRIES C(j) COMPUTED BY OTHERSdo j=n+1,numprocs*n

call MPI RECV(ans,1,MPI DOUBLE PRECISION,MPI ANY SOURCE, MPI ANY TAG,+ MPI COMM WORLD,status,ierr)

sender= status(MPI SOURCE)rownumber= status(MPI TAG)c(rownumber)=ans

enddo

C PROCESSOR 0 RECEIVES ENTRIES C(j) COMPUTED BY OTHERS

else

C PROCESSOR I RECEIVES Bcall MPI BCAST(b,cols,MPI DOUBLE PRECISION,0,MPI COMM WORLD,ierr)

C PROCESSOR I RECEIVES N ROWS OF Ado j=1,n

call MPI RECV(buffer,cols,MPI DOUBLE PRECISION,0,MPI ANY TAG,+ MPI COMM WORLD,status,ierr)

row(j) = status(MPI TAG)do k=1,cols

myrow(j,k)=buffer(k)enddo

enddo

C PROCESSOR I COMPUTES ITS ENTRIES C(J) AND SENDS BACKdo j=1,n

. . . comp ans=sum myrow(j,k)*b(k). . .call MPI SEND(ans,1,MPI DOUBLE PRECISION,0,row(j),MPI COMM WORLD,ierr)

enddo

endif

. . . finalize MPI, stop and end. . .

25

Scalability analysisNumber of computations: each dot product takes n multiplies and n1 additions. n times for a totalof n(2n + 1) comps. Messages passed (assume b already there): n2 + n. Ratio

(2n2 + n)Tcomm(n2 + n)Tcalc

= O(1)

That is, ratio independent of number of processors used. As problem size n → ∞, ratio remainsconstant. Communication costs cannot be decreased by increasing problem size. However, commu-nication costs also dont become prohibitive as problems size increases.

Until now, we modelled the communication time to send k messages by

T (n) = kβ ,

where β = Tcomp. In reality, a better model for one Send or Recv command is

T (n) = α + kβ .

where k is the number of messages sent. α is called the latency (the time it takes to get ready tosend the first message) and β is called the bandwidth (time per message after setup). Can you comeup with an experiment that lets you determine latency and bandwidth?

III. POISSON EQUATION

1. Computing ∆u in parallelGiven function in 2 variables u(x, y),

∆u = uxx + uyy ,

where subscripts denote partial differentiation, ux = ∂u/∂x, uxx = ∂2u/∂x2, In higher dimensions,generalizes. For example, if u(x, y, z) then ∆u = uxx + uyy + uzz, etc.

1.1 Finite difference approximation of ∆uIntroduce grid. Want to compute ∆u on a square domain (x, y) ∈ [a, b] × [c, d]. Discretize domainby mesh (xi, yj), i = 0, . . . , n, j = 0, . . . , m, where xi = a + i∆x, yj = b + j∆y. Often, choose∆x = ∆y = h so that equal resolution in both directions.

We now want to estimate ∆u at gridpoints, using finite difference approximation of derivatives.To do so, need to start with Taylor series. Consider function f(x) of one variable only. Letsapproximate second derivative f ′′(a) using Taylor series approximations

f(a + h) = f(a) + hf ′(a) + h2/2f ′′(a) + h3/6f ′′′(a) + O(h4)

f(a − h) = f(a) − hf ′(a) + h2/2f ′′(a) − h3/6f ′′′(a) + O(h4)Adding these two obtain

f(a + h) + f(a − h) = 2f(a) + h2f ′′(a) + O(h4)

or

f ′′(a) =f(a + h) − 2f(a) + f(a − h)

h2+ O(h2)

26

Thus we obtained a second order finite difference approximation for the second derivative. Higherorder approximations are obtained Using more Taylor series (about a± 2h for example) to removethe O(h4) term as well.

∆u(xi, yj) can be approximated by using the above 1D approximation in each of the x and ydirections. Let vi,j = v(xi, yj) for any function v of two variables. Then

(∆u)i,j =ui−1,j − 2ui,j + ui+1,j

∆x2+

ui,j−1 − 2ui,j + ui,j+1∆y2

+ O(∆x2, ∆y2)

1.2 Evaluating ∆u on gridAssume square domain n=m. We will write a routine that solves: given u on domain, compute∆u on interior. We will need to call this routine repeatedly to update u in the interior, for theiterative methods we will discuss next. We will first write a serial code, then a parallel code. Checkthe parallel code by comparing to the serial code and by checking that the approximation to thelaplacian is indeed second order.

Serial pseudocodeprogram serial. . . initialize a,b,c,d,m,n; set delx,dely. . .

C SET GRIDPOINTS. . .do i=0,n+1

x(i)=a+i*delxenddodo j=0,m+1

y(j)=c+j*delyenddo

C SET U(X,Y) AT GRIDPOINTS. . .call setu0(u,n,m,xcord,ycord,nmax)

C COMPUTE LAPLACIAN AT INTERIOR GRIDPOINTS. . .call lapu(u,lapu,n,m,delx,dely,nmax)call error(lapu,x,y,n,m,nmax)end

subroutine setu0(u,n,m,x,y,nmax)C WE WILL SET U IN INTERIOR AND GHOSTPOINTS. IF GHOSTPOINTS ON BOUNDARYC VALUES THERE WILL NEVER CHANGE IN ITERATIVE SCHEMEC PROBLEM: IN PARALLEL CODE, GHOSTPOINTS MAY BE INTERIOR

do i=0,n+1do j=0,m+1

u(i,j)=u0(x(i),y(j))enddo

enddo

subroutine lapu(u,lapu,n,m,delx,dely,nmax)do i=1,n

do j=1,mlapu(i,j)=(u(i,j+1)+u(i,j-1)-2*u(i,j))/dely**2

+ +(u(i-1,j)+u(i+1,j)-2*u(i,j))/delx**2enddo

enddo

27

The subroutine error tests the code by computing the maximum error abs(lapu(i, j)−exactlap(i, j))for all i,j and printing it. This maximum error should decrease like max(delx2, dely2).

Parallel pseudocodeFor the parallel code we will break up the data into np × mp = numprocs blocks, that will beprocessed by separate processors. Each block is identified by a pair of integers (ip, jp), whereip = 0, . . . , np − 1 and jp = 0, . . . , mp − 1. We will need a function that(1) given the rank of a processor finds its indeces (ip,jp)(2) given its indeces (ip,jp) finds the corresponding rank. It will be useful to return netative rank

if the indeces are out of bounds, that is ip < 0 or ip > np − 1 or jp < 0 or jp > mp − 1.

subroutine convert(rank,ip,jp,np,mp,icode)if (icode.eq.0) then ! given ip,jp, find rank

if (ip or jp out of bounds)set rank=-1 (or MPI PROC NULL)

elserank = jp*np+ip

endifelse if (icode.eq.1) then! given rank, find ip,jp

jp=rank/npip=rank-jp*np

elsestop ’error in convert’

endif

We also need to decide how to distribute the points within each block. Just as in the serial code,we want the points to be indexed by i=0,n+1, j=0,m+1, where the ghostpoints (boundary datagiven) are i=0,n+1, j=0,m+1.

So: in the x-direction there are n interior points in each block (boundary points overlap withneighbour). Thus there are n · np interior points and 2 boundary points, for a total of n · np + 2 orn · np + 1 intervals. Thus

∆x = (b − a)/(n ∗ np + 1) , ∆y = (d − c)/(m ∗ mp + 1)

To set the coordinates in each block we need the bottom left corner coordinates

xc = a + ip ∗ n ∗ ∆x , yc = c + jp ∗ m ∗ ∆y

Therefore the coordinates arex(i) = xc + i∆x i = 0, n + 1

y(j) = yc + j∆y j = 0, m + 1

Write a routine

subroutine setcoord(x,y,n,m,ip,jp,np,mp,a,b,c,d,delx,dely)

that computes the coordinates of the ip, jpth processor, and also returns delx and dely. Now, tothe main code:

28

program parallelC declare all variablesC initialize MPI, rank, numprocsC initialize np,mp, check that np*mp=numprocs. Find ip,jp.C find myright=rank of right neighbour (ip+1,jp)C find myleft, mytop,mybotC initialize a,b,c,d,n,m

call setcoord(x,y,n,m,ip,jp,np,mp,a,b,c,d,delx,dely)call setu0(u,n,m,x,y,nmax)

C THE FOLLOWING STEPS WILL BE PERFORMED REPEATEDLY IN THE ITERATIVE SCHEMESC IN THESE SCHEMES, U(INTERIOR) CHANGES, WHILE U(BOUNDARY) DOES NOT

call setghost(u,n,m,ip,jp,np,mp,buffer,nmax)call lapu(u,lapu,n,m,delx,dely,nmax)

We have everything except for setghost. Here, we need to keep in mind that (lateron) we will calllapu repeatedly. That means that u in the interior may change, while u on the boundary does not.So before every call to lapu we need to call setghost to reset the interior ghostpoints.

Here there are two options: OPTION 1

subroutine setghost(u,n,m,ip,jp,np,mp,buffer,nmax)

C send column i=n to myrightC receive column i=n+1 from myright

C repeat this process to left, top and bottom

With this routine everyone is trying to send to the right simultaneously. Since everyone who istrying to send cannot receive (these are blocking sends and receives), the program will lock up.Except that some of the processors cannot send to the right, namely the ones with ip = np − 1.Those will therefore skip this step and move onto the receive stage. This sets off a chain reactionand eventually everyone can execute their send and receive. While this is not very efficient (all arewaiting until its their turn) you may try to get this code to work without locking up.

Note that if I dont have a right neighbour, the routine convert will return negative rank. Youcan input a negative rank into MPI SEND and MPI RECV, which simply has the effect that nothingwill be sent or received (these routines send and receive only to processors with the specified rank,if this rank is not existent, then no send/receive).

However, I still needed an if statement. If rank(myright) positive, then place in buffer, sendand receive..

call convert(myright,ip+1,jp,np,mp,0)if (myright.ge.0)

do j=1,mbuffer(j)=u(n,j)

enddocall MPI SEND(buffer,m,MPI DOUBLE PRECISION,myright,1,

+ MPI COMM WORLD,ierr)call MPI RECV(buffer,m,MPI DOUBLE PRECISION,myright,

+ MPI ANY TAG,MPI COMM WORLD,status,ierr)do j=1,m

u(n+1,j)=buffer(j)enddo

29

endif

There is a better alternative that lets everyone work simultaneously and eliminates chance oflockup: if your ip is even, send first, then receive. Else receive first, then send.

Here is this second option OPTION 2

subroutine setghost(u,n,m,ip,jp,np,mp,buffer,nmax)

C if ip is even

C send column i=n to the rightC receive column i=n+1 from the right

C send column i=1 to the leftC receive column i=0 from the left

C else

C receive column i=0 from the leftC send column i=1 to the left

C receive column i=n+1 from the rightC send column i=n to the right

C endif

C NOW DO THE SAME IN THE Y-DIRECTION

2. Iterative Methods to solve Ax = b

2.1 The Jacobi algorithm ... For our purposes, rewrite algorithm as

x(k+1) = x(k) + D(−1)(b − Ax(k))

where D is the diagonal part of A. In this form we only need a routine that evaluates Ax. Stopiteration when max-norm of residual is less than a prescribed tolerance ².

2.2 Gauss-Seidel and SOR algorithms

2.3 Convergence criteria Define positive definite matrices.

2.4 Implementing Jacobi to solve ∆u = f

Discretizing the equation ∆u = f as described in 1.1, with ∆x = ∆y = h, leads to the linear system

Au = f

for the approximate solution uij ≈ u(xi, yj), where

u = (u11, u12, . . . , u1m, u21, u22, . . . , u2m, . . . , un1, . . . , unm)T

f = (f11, f12, . . . , f1m, f21, f22, . . . , f2m, . . . , fn1, . . . , fnm)T

30

fij = f(xi, yj), are vectors of length nxm and A is the matrix

Anm×nm =1h2

T II T I

I T I.. .. ..

I T II T

, Tn×n =

1 -41 -4 1

1 -4 1.. .. ..

1 -4 11 -4

and I = diag(1, 1, . . . , 1) is the nxn identity matrix. We will solve this system with Jacobi’s method.However, to do so, note that we never need to implement the full matrix A, nor do we need toknow its specific structure. We only need:

(1) A routine that evaluates Au. Which we have. The routine lapu (p27) returns the variable

lapu = Au + c

where c contains terms given by the boundary values u0k, unk, uj0, ujm. Remember that thesevalues are not part of the unknown vecotr u, but are known.

(2) To know that A is not positive definite (you can see this from the fact that the diagonalelements Ajj = eTj Aej < 0), but −A is. We will not prove this, just note that this is analogousto the discretization of the 1D Poisson equation −uxx = f considered earlier. For convergence,to solve lapu = f we need to apply Jacobi’s method to

−Au = −f + c

. The residual is then

residual = (−f + c) − (−Au) = (−f + c) − (−lapu + c) = lapu − f

(3) To compute D−1 we need the values of −A on the diagonal. They are

−Ajj = 4h2

Jacobi’s method thus reduces to

u(k+1) = u(k) +h2

4(lapu − f)

Following is an outline of a code in serial and in parallel to solve ∆u = f in the interior of a domainD, given u = g on the domain boundary.

Serial pseudocode:Initialize u on boundaryInitialize f in interiorfor k=1:kmax

call lapu % obtain lapu = Au + cset uij = uij + h

2

4 (lapuij − fij)if ||residual||∞ < ² DONE

31

Parallel pseudocode:Initialize u on boundary, if exterior boundaryInitialize f in interiorfor k=1:kmax

call setghost % set u on interior boundariescall lapu % obtain lapu = Au + cset uij = uij + h

2

4 (lapuij − fij)if ||residual||∞ < ² DONE

4. Conjugate gradient method

4.1 Positive definite matrices

Definition: A symmetric nxn matrix A is positive definite is xT Ax ≥ 0 for all nonzero vectorsx ∈

(2) The search directions are A − conjugate,

pTk Apj = 0, for j < k

As a result :(3) Note that xk ∈ Kk = Span{p0,p1,p2, . . . ,pk−1}, the subspace spanned by p0 through pk−1.

At each step, the error ||ek||A is minimized, where e = xk − x such that xk ∈ Kk, x is theexact solution, and ||e||A =

√eT Ae . (see picture)

⇒ The algorithm converges if A is symmetric positive definite, in less than or equal to n steps!4.4 Implementation to solve ∆u = f in parallel

4. Input/output for parallel codes

33

IV. FOURIER TRANSFORM

1. Discrete Fourier Transform (DFT)

1.1 Review: Vector spaces, basis, inner products

A vector space V is a set of objects, called vectors, that is closed under addition and multiplication bya scalar, where these two rules have to be defined and have to satisfy certain conditions. (For details,check your linear algebra book.) A basis for a vector space is a set of vectors B = {b1,b2, . . . ,bn}such that any vector a in V can be written uniquely as a linear combination of the elements of B,

a = c1b1 + c2b2 + . . . + cnbn

where the ci are scalars, usually either real or complex numbers. Every basis for a vector spacehas the same number of elements, and this number is called the dimension of the space. Examplesof common vector spaces are

Since eikx = cos kx + i sin kx this is a space of trigonometric polynomials with period 2π. Theparameter k is the wavenumber of the basis function, and equals the frequency of the oscillation incos kx or sin kx, that is, the number of oscillations per period. Define the inner product

〈f, g〉 = 12π

∫ 2π0

f(x)g(x) dx

where the overline denotes the complex conjugate. Note that the inner product of any two elementsin the basis is

〈 eijx√N

,eikx√

N

〉=

12πN

∫ 2π0

eijxe−ikx dx = . . . ={

0 if k 6= j1/N if k = j .

Thus the basis is orthogonal: any two basis elements are orthogonal to each other. This is animportant property which we will make use of.

Let f be a function in this vector space, that is

f(x) =N/2−1∑

k=−N/2ck

eikx√N

. (1)

The Fourier coefficients ck/√

N measure the energy that f has at the frequency k. What are theck? Take the inner product of f with a basis element

〈f(x),

eijx√N

〉=

〈 N/2−1∑k=−N/2

ckeikx√

N,eijx√

N

〉=

N/2−1∑k=−N/2

ck

〈 eikx√N

,eijx√

N

〉=

cjN

For the second equality we used the linearity of the inner product. For the last equality we usedthe fact that the inner product of basis elements is zero unless k = j, in which case it equals 1/N .Thus

cj =√

N〈f(x), eijx

〉=

√N

2π

∫ 2π0

f(x)e−ijx dx (2)

The cj is called the Fourier Transform of f(x).Now we come to the discrete part of the DFT. In real applications such as signal processing,

we do not have a mathemtaical formula for the signal function f(x). Rather the signal is sampledat discrete values xk. Can we find values xk such that the integral (2) can be evaluated exactly?This is the question that quadrature methods, such as Gauss quadrature, address. In this case theanswer is YES, if xk is equally spaced, xk = k 2πN , k = 0, N − 1, then the trapezoid rule evaluatesthe integral exactly! (Note that we excluded xN since by periodicity f(x0) = f(xN ).

Theorem: If f(x) is in the space spanned by BN , that is f(x) =∑N/2−1

l=−N/2 cleilx/

√N ,

∫ 2π0

f(x)e−ijx dx =N−1∑k=0

f(xk)e−2πijk/N2πN

This is a strong result. For any other xk we could only approximate the integrals numerically.Consequently, for f ∈ V

cj =1√N

N−1∑k=0

f(xk)e−2πijk/N (3)

35

Proof : Because of the linearity of the integral, it is sufficient to show that the statementholds for the basis functions. Since

〈eilx, eijx〉 ={

0 if l 6= j2π if l = j .

we need to show that

N−1∑k=0

ei2πjl/Ne−2πijk/N2πN

={

0 if l 6= j2π if l = j

or equivalentlyN−1∑k=0

e2πimk/N ={ 0 if m 6= 0

N if m = 0(4)

where m = l − j. Note that since both l and k range from −N/2 to N/2 − 1, themaximum value m attains is N −1 and the minimum value is −N +1. The secondequation in (4) is easy to show since if m = 0

N−1∑k=0

e2πimk/N =N−1∑k=0

1 = N

For the first equation in (4) we need a fact from calculus that is easy to show: if

s =N−1∑k=0

rk = 1 + r + r2 + . . . + rN−1

thenrs = r + r2 + . . . + rN−1 + rN

and s − rs = 1 − rN yieldings =

1 − rN1 − r .

Applying this to our sum we get

N−1∑k=0

e2πimk/N =N−1∑k=0

(e2πim/N )k =1 − (e2πim/N )N

1 − e2πim/N =1 − e2πim

1 − e2πim/N = 0

only if the denominator is not equal to zero! But as long as −N + 1 ≤ m ≤ N − 1and m 6= 0, which we showed to be the case, the denominator is not zero, and theresult holds. Note that if m is outside this range, that is l not in [−N/2, N/2− 1],that is f not in SpanB, then the result does not hold.

Note that this theorem states that the trapezoid rule with N points integrates the 2N − 1 basisfunctions emx, m = −N + 1, . . . N + 1 exactly. Such a result is known as quadrature: exact forbigger space than number of points.

36

Let fk = f(xk). Then the results so far state that

fk =1√N

N/2−1∑j=−N/2

cje2πijk/N , k = 0, . . . , N − 1 (5a)

cj =1√N

N−1∑k=0

fke−2πijk/N , j = −N/2, . . . , N/2 − 1 (5b)

Note that equation (5a) specifies a linear map Ac = f that maps the N values c = (c−N/2, . . . ,cN/2−1) to the N values f = (f0, . . . , fN−1). Equation (5b) states that this map is invertible andspecifies the inverse. Let us write these maps, called the inverse DFT (5a) and the DFT (5b), inmatrix notation.

But first let us rewrite these equations somewhat. Note that

fk =1√N

(−1∑

j=−N/2cje

2πijk/N +N/2−1∑

j=0

cje2πijk/N )

=1√N

(N−1∑

j′=N/2

cj′−Ne2πi(j′−N)k/N +

N/2−1∑j=0

cje2πijk/N )

=1√N

(N−1∑

j′=N/2

cj′e2πij′k/N +

N/2−1∑j=0

cje2πijk/N ) ,

where j′ = j + N . The last equation follows since

cj′−N =1√N

N−1∑k=0

fke−2πi(j′−N)k/N = c′j

Now we just drop the prime on the summation index to get the inverse and forwards DFTs:

fk =1√N

N−1∑j=0

cje2πijk/N , k = 0, . . . , N − 1 (6a)

cj =1√N

N−1∑k=0

fke−2πijk/N , j = 0, . . . , N − 1 (6b)

To write in matrix notation let ω = e2πi/N , that is, a primitive Nth root of unity, ωN = 1. Thenequations 6(a,b) are rewritten as

f = F−1c , c = F f

where F and F−1 are the N × N matrices

37

F−1 =1√N

1 1 1 . . . 11 ω ω2 . . . ωN−1

1 ω2 ω4 . . . ω2(N−1)

: : : :: : : :

1 ωN−1 ω2(N−1) . . . ω(N−1)2

F =1√N

1 1 1 . . . 11 ω ω2 . . . ωN−1

1 ω2 ω4 . . . ω2(N−1)

: : : :: : : :

1 ωN−1 ω2(N−1) . . . ω(N−1)2

Notice that F−1 = F ; such matrices are called unitary. Also, F is symmetric, FT = F .The discrete Fourier transform relates discrete values fk to the coefficients ck. We remark that

at the gridpoints, the basis functions

BN = { eikx

√N

, k = −N2

, . . . ,N

2− 1} .

span the same space as the basis

B′N = {cos kx, sin jx, k = 0, . . . ,N

2, j = 1, . . . ,

N

2− 1} .

You can see that the two bases have the same number of elements and that all the elements in B′Ncan be recovered from BN using cos kx = (eikx + e−ikx)/2, sin kx = (eikx − e−ikx)/(2i), and usingthe fact that for k = N/2, sin kx vanishes at the gridpoints. (The two bases are not equivalent inthe continuous case.) Thus the discrete Fourier series (6a) is equivalent to the alternate form youmay also be familiar with

fk = a0 +N/2∑j=1

aj cos jxk +N/2−1∑

j=1

bj sin jxk

where of course the coefficients aj , bj are different, given by some linear combination of the cj .Operation count for DFT: Given the N function values fk, computing the Fourier coefficients

cj/√

N by direct matrix multiplication requires O(N2) operations: A sum of N terms for eachj = 0, . . . , N . Can we compute these sums any faster? YES. One can take advantage of the specialstructure of the matrix F to obtain a fast algorithm called the Fast Fourier Transform (FFT).

Before going on to this, lets look at some examples of the DFT of discrete data values.

1.3 Signal processing: frequency analysis

If

f(x) =1√N

N/2−1∑k=−N/2

ckeikx =

N/2−1∑k=−N/2

f̂keikx

38

then f̂k are the Fourier coefficients. On the grid

f(xj) =1√N

N−1∑k=0

cke−2πikj/N =

N−1∑k=0

f̂ke2πikj/N (7a)

where

f̂j =1N

N−1∑k=0

fke2πijk/N , j = 0, . . . , N − 1 (7b)

The MATLAB ROUTINEs FFT(f,n) and IFFT(f,t) return the sums on the right hand side, respec-tively. Consider for example the function f(x) = sin kx = e

ikx−e−ikx2i . It has

f̂k =12i

, f̂−k = f̂N−k = − 12i , f̂j = 0 for j 6= ±k

The figure below shows the output of the following matlab script:

clear;clfn=64; k=16;dx=2*pi/n; x=0:dx:2*pi; x=x(1:n);

f=sin(k*x); c=fft(f,n)/n;subplot(2,2,1); plot(x,f),axis([0,2*pi,-2,2]); title(’ f(x)’)subplot(2,2,2); plot(0:n-1,abs(c)),axis([0,n-1,-0.1,0.6]); title(’ abs(fhat(k))’)

f=sin(k*x)+0.5*randn(1,n); c=fft(f,n)/n;subplot(2,2,3); plot(x,f),axis([0,2*pi,-2,2]); title(’ f(x)’)subplot(2,2,4); plot(0:n-1,abs(c)),axis([0,n-1,-0.1,0.6]); title(’ abs(fhat(k))’)

0 2 4 6−2

−1

0

1

2 f(x)

0 20 40 60−0.1

0

0.1

0.2

0.3

0.4

0.5

abs(fhat(k))

0 2 4 6−2

−1

0

1

2 f(x)

0 20 40 60−0.1

0

0.1

0.2

0.3

0.4

0.5

abs(fhat(k))

39

1.4 Interpolation and approximation

0 2 4 6−6

−4

−2

0

2n=4

0 2 4 6−6

−4

−2

0

2n=8

0 2 4 6−6

−4

−2

0

2n=16

0 2 4 6−6

−4

−2

0

2n=32

0 2 4 6−6

−4

−2

0

2n=4

0 2 4 6−6

−4

−2

0

2n=8

0 2 4 6−6

−4

−2

0

2n=16

0 2 4 6−6

−4

−2

0

2n=32

40

2. The Fast Fourier Transform (FFT)

The FFT should really be called the FDFT: it is a fast way to compute the discrete FourierTransform of a set of data, and its inverse.

Both the Discrete Fourier Transform (given fk, find f̂k) and its inverse (given f̂k, find fk)require computing N sums of the form

cj =N−1∑k=0

ake±2πijk/N =

N−1∑k=0

akωjk = f(ωj) (2.1)

for all j = 0, . . . , N − 1. Here ω = e±2πi/N is an nth root of unity, ωn = 1 and f is a polynomial.Using Horner’s algorithm, one can solve this problem with N floating point operations per coeffi-cient, that is, N2 operations. With the Fast Fourier algorithm one needs only N(r1 + r2 + . . . rp)operations where N = r1r2r3 . . . rp. To put it another way, the cost of calculating all n of the valuesof a polynomial f at the nth roots of unity is much less than n times the cost of one such calculation.For example, if N = 210 = 1024, then the number of operations becomes 210 · 20 = 20, 480 insteadof 220 = 1, 048, 576, giving a speedup factor of 51. The speedup factor increases as N increases.The algorithm was developed in 1965 by Cooley and Tukey but related ideas can be found in manyprevious works. In many areas of applications this algorithm has caused a complete change ofattitude toward what can be done using Fourier methods on a computer.

We illustrate the basic idea for the special case where N is a power of 2, say N = 2r. Letsbreak the sum (2.1) into two sums, containt respectively the terms where K is even and those wherek is odd,

cj =N/2−1∑

k=0

a2kωj2k +

N/2−1∑k=0

a2k+1ωj(2k+1)

=N/2−1∑

k=0

a2kω2jk + ωj

N/2−1∑k=0

a2k+1ω2jk (2.2)

Thus we reduced the problem to computing two sums of half the length. Notice that each of the sumsis a Fourier transform of a shorter length. If we had to evaluate each sum for all j = 1, . . . , N − 1no gain would have occured, since the total cost would still be two times N(N/2). However, letQ(j) denote the first of the sums in (2.2). Then Q(j) is a periodic function of j of period N/2,since

Q(j + N/2) =N/2−1∑

k=0

a2kω2(j+N/2)k =

N/2−1∑k=0

a2kω2jk+Nk =

N/2−1∑k=0

a2kω2jk(ωN )k = Q(j)

for all integers j, since ωN = 1. Thus we need to evaluate each sum only N/2 times. If we needthe value of Q for some j > N/2 − 1 then we can get that value by asking for Q(jmodN/2). Thuscomputing all the required values of the two sums, requires (N/2)2 · 2 = N2/2 operations. Thisprocess can now be repeated recursively for each of the two sums.

The following algorithm implements the FFT using recursive function calls. It is not the waythe FFT is implemented in practice, since the recursive calls are expensive, but it illustrates thepoint.

41

function FFT(n:integer,alf:complexarray):complexarray;% computes fast fourier transform of N = 2r numbers alfif n=1 then

FFT(0)=f(0)else

evenarray = alf(0:2:n-2)oddarray = alf(1:2:n-1)u(0:n/2-1)= FFT(n/2,evenarray)v(0:n/2-1)= FFT(n/2,oddarray)for j=0:n-1

tau=exp(2*pi*i*j/n)FFT(j)=u(j mod n/2) + tau*v(j mod n/2)

endend

What are the total number of floating point operations to compute the fft using this routine? Lety(k) denote the number of multiplications of complex numbers that will be done if we call FFT onan array whose length is n = 2k. The call to FFT(n/2,evenarray) costs y(k − 1) multiplications asdoes the call to FFT(n/2,oddarray). The for j=0:n loop requires n more multiplications. Hence

y(k) = 2y(k − 1) + 2k

If we change variables by writing y(k) = 2kzk, then we find that zk = zk−1 + 1. For k = 0, that is,n = 1, no operations are needed, so y(0) = z(0) = 0. The solution of

zk = zk−1 + 1 , z(0) = 0

is zk = k for all k ≥ 0, and therefore y(k) = k2k = n log(n). This provesTheorem: The Fourier transform of a sequence of n complex numbers is computed using only

O(n log n) multiplications of complex numbers by means of the procedure FFT, if n isa power of 2.

Reference: [5]. You can see that a similar algorithm can be implemented for any factor m otherthan 2. Some FFT codes only allow you to use n = 2k, others allow you to use more prime factors,such as n = 2k13k25k3 . Others allow you to use any value for n. Read documentation to find outabout any one particular code.

The manipulations above that lead to the FFT exploit a special property of the Fourier Trans-form. For other transforms no such “fast algorithm” is available. For example, the sphericalharmonic transform consists of writing functions defined on the surface of the spheres in terms ofspecial basis functions, and is needed in climate models. The absence of a “fast spherical harmonictransform” limits its use. This is one of the open problems.

In MATLAB, the commands we have been using, fft(f) and ifft(f), use a fast FourierTransform algorithm that allows for any value of n. For an FFT code in Fortran or C, seehttp://www.fftw.org/ for the fastest Fourier Transform in the West.

42

3. The DFT in 2D3.1 Derivation of the 2D DFT

Let’s remind us again of the 1D DFT, then derive it for 2D functions. Consider a function f(x) ofone variable. Let fj = f(xj) be a sequence of N function values, where xj = jT/N , j = 0, N − 1.Then

fj = f(xj) =N/2−1∑

k=−N/2cke

2πikxj/T =N−1∑k=0

cke2πikxj/T

if and only if

ck =1N

N−1∑j=0

fje−2πijxk/T

The coefficients f̂k = ck are the Fourier coefficients of the function fp(x) =∑N/2−1

k=−N/2 f̂ke2πikx/T .

Now consider a function of two variables, f(x, y). Let fjl = f(xj , yl) be a set of N · M functionvalues, where xj = jTx/N , j = 0, . . . , N − 1, and yl = lTy/M , l = 0, . . . , M − 1. Then

fj,l = f(xj , yl) =N/2−1∑

k=−N/2ck(yl)e2πikxj/Tx =

N−1∑k=0

ck(yl)e2πikxj/T

and ck,l = ck(yl) =M/2−1∑

m=−M/2dk,me

2πimyl/Ty =M−1∑m=0

dk,me2πimyl/Ty

if and only if

ck =1N

N−1∑j=0

fj,le−2πijxk/Tx

dk,m =1M

M−1∑l=0

ck,le−2πilym/Ty

This is equivalently to the following statement, in final form:

fj,l = f(xj , yl) =N/2−1∑

k=−N/2

M/2−1∑m=−M/2

dk,me2πikxj/Txe2πimyl/Ty =

N−1∑k=0

M−1∑m=0

dk,me2πikxj/Txe2πimyl/Ty

if and only if

dk,m =1

MN

M−1∑l=0

N−1∑j=0

fj,le−2πikxj/Txe−2πimyl/Ty

The coefficients f̂k,m = dk,m are the Fourier coefficients of the function

fp(x, y) =N/2−1∑

k=−N/2

M/2−1∑m=−M/2

f̂k,me2πikx/Txe2πimy/Ty (2)

43

Example: Consider f(x) = 3e2πixe6πiy + ie−2πixe2πiy. This function is written in the form (2),with period T = 1, from which we can read off its Fourier coefficients:

f̂1,3 = 3 , f̂−1,1 = i , f̂k,m = 0 otherwise

3.2 Computing the 2D DFT in MATLABLet f = [fj,l] be a matrix. The call d = fft2(f)/(N ∗ M) returns the 2-dimensional DFT of f ,computed with a FFT algorithm. That is, d = [dk,m] is a matrix containing the correspondingFourier coefficients. The call f = N ∗ M ∗ ifft2(d) returns the inverse 2-dimensional transform:given the Fourier coefficients, it returns the function values. Remember that MATLAB indexes thevariables differently, using the following correspondence

Matlab index real index

1 ≤ j, l ≤ N/2 ←→ 0 ≤ j, l ≤ N/2 − 1

N/2 + 1 ≤ j, l ≤ N ←→ −N/2 ≤ j, l ≤ −1

Matlab index real index

1 ≤ k, m ≤ N ←→ 0 ≤ k, m ≤ N − 1

Exercise : Let f(x) = cos πx sinπy. Find its Fourier coefficients f̂k,m, 0 ≤ k, m ≤ 3 analyti-cally (using known identities to write cosπx and sinπy in terms of exponentials),and in matlab, making sure they agree.

3.3 Solving differential equations in 2D (periodic case)We already saw in section 1.5 that the Fourier Transform can be used to quickly solve differentialequations accurately, if the solution is known to be periodic. In particular, we solved the 1Dproblem u′′ = f(x) with periodic boundary conditions using the Fourier Transform. Here, we willconsider the equivalent 2D problem

∆u = f(x, y) , (x, y) ∈ [0, T ] × [0, T ] ,

with periodic boundary conditions. Here f(x, y) is a given periodic function, and we wish to findthe unknown u(x, y). For simplicity, we assume the domain is square. Extensions to rectangulardomains are obvious. We already solved this problem using the Jacobi and the Conjugate Gradientmethod, which work for general f and general boundary conditions. For periodic f and periodicboundary conditions the method in this section, using Fourier Transforms, is much more accurateand faster.

Discretize the domain by xj = yj = 2πj, j = 0, . . . N−1. Assume Fourier series representationsfor u and f

u(x, y) =N/2−1∑

k=−N/2

M/2−1∑m=−M/2

ûk,me2πikx/T e2πimy/T , f(x, y) =

N/2−1∑k=−N/2

M/2−1∑m=−M/2

f̂k,me2πikx/T e2πimy/T

44

where f̂ is obtained using fft2(f). Then

∆u =N/2−1∑

k=−N/2

M/2−1∑m=−M/2

−(2π/T )2(k2 + m2)ûk,me2πikx/T e2πimy/T

Note that by differentiating the Fourier Series we assume that the series approximates u well inbetween gridpoints!, this will be the case if u is pe

index []pavan/classnotes.pdf · 2004. 12. 1. · conjugate gradient method:::::xx 3.1 positive...

Documents