Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-1
Numerical Algorithms
Chapter 11
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-2
Numerical Algorithms
In textbook do:
• Matrix multiplication
• Solving a system of linear equations
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-3
a0,0 a0,1
a1,0
a0,m-2
an-1,0
a0,m-1
an-2,0
an-1,m-1an-1,m-2
an-2,m-1
a1,1 a1,m-2 a1,m-1
an-2,1 an-2,m-2
an-1,1
Row
Column
Matrices — A ReviewAn n × m matrix
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-4
Matrix Addition
Involves adding corresponding elements of each matrix to form theresult matrix.
Given the elements of A as ai,j and the elements of B as bi,j, eachelement of C is computed as
ci,j = ai,j + bi,j
(0 ≤ i < n, 0 ≤ j < m)
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-5
Matrix Multiplication
Multiplication of two matrices, A and B, produces the matrix C
whose elements, ci,j (0 ≤ i < n, 0 ≤ j < m), are computed as follows:
where A is an n × l matrix and B is an l × m matrix.
ci j, ai,kbk,jk 0=
l 1–∑=
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-6
× =A B C
Matrix multiplication, C = A × B
i
j
ci,j
Row
ColumnMultiply Sum
results
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-7
× =A b c
i ci
Rowsum
Matrix-Vector Multiplicationc = A × b
Matrix-vector multiplication follows directly from the definition of
matrix-matrix multiplication by making B an n × 1 matrix (vector).
Result an n × 1 matrix (vector).
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-8
Relationship of Matrices to Linear Equations
A system of linear equations can be written in matrix form:
Ax = b
Matrix A holds the a constants
x is a vector of the unknowns
b is a vector of the b constants.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-9
Implementing Matrix Multiplication
Sequential Code
Assume throughout that the matrices are square (n × n matrices).
The sequential code to compute A × B could simply be
for (i = 0; i < n; i++)for (j = 0; j < n; j++) {c[i][j] = 0;for (k = 0; k < n; k++)c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
This algorithm requires n3 multiplications and n3 additions, leading
to a sequential time complexity of Ο(n3). Very easy to parallelize.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-10
Parallel Code
With n processors (and n × n matrices), can obtain:
• Time complexity of O(n2) with n processorsEach instance of inner loop independent and can be done by aseparate processor
• Time complexity of O(n) with n2 processorsOne element of A and B assigned to each processor.
Cost optimal since O(n3) = n × O(n2) = n2 × O(n)].
• Time complexity of O(log n) with n3 processorsBy parallelizing the inner loop. Not cost-optimal since
O(n3)≠n3×O(log n)).
O(log n) lower bound for parallel matrix multiplication.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-11
Partitioning into Submatrices
Suppose matrix divided into s2 submatrices. Each submatrix has n/s × n/s elements. Using notation Ap,q as submatrix in submatrix rowp and submatrix column q:
for (p = 0; p < s; p++)for (q = 0; q < s; q++) {Cp,q = 0; /* clear elements of submatrix */for (r = 0; r < m; r++)/* submatrix multiplication &*/Cp,q = Cp,q + Ap,r * Br,q;/*add to accum. submatrix*/
}
The lineCp,q = Cp,q + Ap,r * Br,q;
means multiply submatrix Ap,r and Br,q using matrix multiplicationand add to submatrix Cp,q using matrix addition. Known as block
matrix multiplication.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-12
Block Matrix Multiplication
× =
Sum
A B C
p
qMultiply results
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-13
a0,0 a0,1 a0,2 a0,3
a1,0
a2,0
a3,0
a1,2a1,1
a2,1
a3,1
a2,2
a3,2 a3,3
a1,3
a2,3
b0,0 b0,1 b0,2 b0,3
b1,0
b2,0
b3,0
b1,2b1,1
b2,1
b3,1
b2,2
b3,2 b3,3
b1,3
b2,3
a0,0 a0,1
a1,0 a1,1
b0,0 b0,1
b1,0 b1,1
a0,2 a0,3
a1,2 a1,3
b2,0 b2,1
b3,0 b3,1
(a) Matrices
(b) Multiplying A0,0 × B0,0
a0,0b0,0+a0,1b1,0 a0,0b0,1+a0,1b1,1
a1,0b0,0+a1,1b1,0 a1,0b0,1+a1,1b1,1
A0,0 B0,0 A0,1 B1,0
a0,2b2,0+a0,3b3,0 a0,2b2,1+a0,3b3,1
a1,2b2,0+a1,3b3,0 a1,2b2,1+a1,3b3,1
+
× + ×
=
=
a0,0b0,0+a0,1b1,0+a0,2b2,0+a0,3b3,0 a0,0b0,1+a0,1b1,1+a0,2b2,1+a0,3b3,1
a1,0b0,0+a1,1b1,0+a1,2b2,0+a1,3b3,0 a1,0b0,1+a1,1b1,1+a1,2b2,1+a1,3b3,1
= C0,0
Submatrix multiplication
×
to obtain C0,0
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-14
b[][j]
a[i][]Row i
Column j
c[i][j]
Processor Pi,j
Direct Implementation
One processor to compute each element of C - n2 processors wouldbe needed. One row of elements of A and one column of elementsof B needed. Some of same elements sent to more than oneprocessor. Can use submatrices.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-15
P0 P1 P2 P3
P0
P0 P2
c0,0
a0,0 b0,0 a0,1 b1,0 a0,2 b2,0 a0,3 b3,0
×
+
×××
+
+
Performance ImprovementUsing tree construction n numbers can be added in log n stepsusing n processors:
Computational timecomplexity of Ο(log n)
using n3 processors.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-16
App
Aqp Aqq
Apq
i j
i
j
Bpp
Bqp Bqq
Bpq Cpp
Cqp Cqq
Cpq
P1 P3P2P0
P0 + P1
P4 + P5 P6 + P7
P2 + P3
P5 P7P6P4
Recursive Implementation
Apply same algorithm on each submatrix recursivly.
Excellent algorithm for a shared memory systems because oflocality of operations.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-17
Recursive Algorithmmat_mult(App, Bpp, s){if (s == 1) /* if submatrix has one element */
C = A * B; /* multiply elements */else { /* continue to make recursive calls */
s = s/2; /* no of elements in each row/column */P0 = mat_mult(App, Bpp, s);P1 = mat_mult(Apq, Bqp, s);P2 = mat_mult(App, Bpq, s);P3 = mat_mult(Apq, Bqq, s);P4 = mat_mult(Aqp, Bpp, s);P5 = mat_mult(Aqq, Bqp, s);P6 = mat_mult(Aqp, Bpq, s);P7 = mat_mult(Aqq, Bqq, s);Cpp = P0 + P1; /* add submatrix products to */Cpq = P2 + P3; /* form submatrices of final matrix */Cqp = P4 + P5;Cqq = P6 + P7;
}return (C); /* return final matrix */}
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-18
Mesh Implementations
• Cannon’s algorithm
• Fox’s algorithm (not in textbook but similar complexity)
• Systolic array
All involve using processor arranged a mesh and shifting elements
of the arrays through the mesh. Accumulate the partial sums at
each processor.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-19
Mesh ImplementationsCannon’s Algorithm
Uses a mesh of processors with wraparound connections (a torus) to shift the Aelements (or submatrices) left and the B elements (or submatrices) up.
1.Initially processor Pi,j has elements ai,j and bi,j (0 ≤ i < n, 0 ≤ k < n).2. Elements are moved from their initial position to an “aligned” position. The
complete ith row of A is shifted i places left and the complete jth column ofB is shifted j places upward. This has the effect of placing the element ai,j+iand the element bi+j,j in processor Pi,j,. These elements are a pair of thoserequired in the accumulation of ci,j.
3.Each processor, Pi,j, multiplies its elements.4. The ith row of A is shifted one place right, and the jth column of B is shifted
one place upward. This has the effect of bringing together the adjacentelements of A and B, which will also be required in the accumulation.
5. Each processor, Pi,j, multiplies the elements brought to it and adds theresult to the accumulating sum.
6. Step 4 and 5 are repeated until the final result is obtained (n - 1 shifts withn rows and n columns of elements).
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-20
B
A
Movement of A and B elements
j
i
Pi,j
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-21
B
A
Step 2 — Alignment of elements of A and B
j
i
bi+j,j
ai,j+i
i places
j places
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-22
B
A
Step 4 — One-place shift of elements of A and B
j
i
Pi,j
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-23
c0,0 c0,1 c0,2 c0,3
c1,0 c1,1 c1,2 c1,3
c2,0 c2,1 c2,2 c2,3
c3,0 c3,1 c3,2 c3,3
b3,0b2,0b1,0b0,0
b3,3b2,3b1,3b0,3
b3,2b2,2b1,2b0,2
b3,1b2,1b1,1b0,1
a0,3 a0,2 a0,1a0,0
a3,3 a3,2 a3,1 a3,0
a2,3 a2,2 a2,1a2,0
a1,3 a1,2 a1,1a1,0
Systolic array
Pumpingaction
One cycle delay
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-24
c0
c1
c2
c3
b3b2b1b0
a0,3 a0,2 a0,1 a0,0
a3,3 a3,2 a3,1 a3,0
a2,3 a2,2 a2,1 a2,0
a1,3 a1,2 a1,1 a1,0
Pumpingaction
Matrix-Vector Multiplication
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-25
Solving a System of Linear Equations
an−1,0x0 + an−1,1x1 + an−1,2x2 … + an−1,n−1xn−1 = bn−1
.
.
.a2,0x0 + a2,1x1 + a2,2x2 … + a2,n−1xn−1 = b2a1,0x0 + a1,1x1 + a1,2x2 … + a1,n−1xn−1 = b1a0,0x0 + a0,1x1 + a0,2x2 … + a0,n−1xn−1 = b0
which, in matrix form, is
Ax = b
Objective is to find values for the unknowns, x0, x1, …, xn−1, givenvalues for a0,0, a0,1, …, an−1,n−1, and b0, …, bn .
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-26
Solving a System of Linear Equations
Dense matrices
Gaussian Elimination - parallel time complexity O(n2)
Sparse matrices
By iteration - depends upon iteration method and number ofiterations but typically O(log n)
• Jacobi iteration• Gauss-Seidel relaxation (not good for parallelization)• Red-Black ordering• Multigrid
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-27
Gaussian Elimination
Convert general system of linear equations into triangular system ofequations. Then be solved by Back Substitution.
Uses characteristic of linear equations that any row can be replacedby that row added to another row multiplied by a constant.
Starts at the first row and works toward the bottom row. At the ithrow, each row j below the ith row is replaced by row j + (row i) (−aj,i/ai,i). The constant used for row j is −aj,i/ai,i. Has the effect of makingall the elements in the ith column below the ith row zero because
a j i, a j i, ai i,a j i,–
ai i,------------
+ 0= =
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-28
Clearedto zero
Alreadyclearedto zero
Row i
Column i
Column
Row
Gaussian elimination
Row jStep through
aji
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-29
Partial Pivoting
If ai,i is zero or close to zero, we will not be able to compute the
quantity −aj,i/ai,i.
Procedure must be modified into so-called partial pivoting by
swapping the ith row with the row below it that has the largest
absolute element in the ith column of any of the rows below the ith
row if there is one. (Reordering equations will not affect the system.)
In the following, we will not consider partial pivoting.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-30
Sequential Code
Without partial pivoting:
for (i = 0; i < n-1; i++) /* for each row, except last */for (j = i+1; j < n; j++) {/*step thro subsequent rows */m = a[j][i]/a[i][i]; /* Compute multiplier */for (k = i; k < n; k++)/*last n-i-1 elements of row j*/a[j][k] = a[j][k] - a[i][k] * m;
b[j] = b[j] - b[i] * m;/* modify right side */}
The time complexity is O(n3).
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-31
Already
Row i
Column
Row
Broadcastith row
n − i +1 elements(including b[i])
clearedto zero
Parallel Implementation
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-32
AnalysisCommunication
n − 1 broadcasts performed sequentially. ith broadcast contains n −i + 1 elements.
Time complexity of Ο(n2) (see textbook)
Computation
After row broadcast, each processor Pj beyond broadcast processorPi will compute its multiplier, and operate upon n − j + 2 elements ofits row. Ignoring the computation of the multiplier, there are n − j + 2multiplications and n − j + 2 subtractions.
Time complexity of Ο(n2) (see textbook).Efficiency will be relatively low because all the processors beforethe processor holding row i do not participate in the computationagain.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-33
Broadcast
P0 P1 P2 Pn-1
rows
Row
Pipeline implementation of Gaussian elimination
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-34
P0
P1
P3
P2
0
n/p
2n/p
3n/p
RowStrip Partitioning
Poor processor allocation! Processors do not participate incomputation after their last row is processed.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-35
P0
P1
0
n/p
2n/p
3n/p
Row
Cyclic-Striped Partitioning
An alternative which equalizes the processor workload:
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-36
Iterative Methods
Time complexity of direct method at Ο(N2) with N processors, is
significant.
Time complexity of iteration method depends upon:
• the type of iteration,• number of iterations• number of unknowns, and • required accuracy
but can be less than the direct method especially for a few
unknowns i.e a sparse system of linear equations.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-37
Jacobi Iteration
Iteration formula - ith equation rearranged to have ith unknown on
left side:
Superscript indicates iteration:
is kth iteration of xi , is (k−1)th iteration of xj.
xki
1
ai i,--------- bi ai j, x j
k 1–
j i≠∑–=
xki x j
k 1–
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-38
Example of a Sparse System of Linear Equations
Laplace’s Equation
Solve for f over the two-dimensional x-y space.
For a computer solution, finite difference methods are appropriate
Two-dimensional solution space is “discretized” into a large number
of solution points.
f2∂
x2∂
---------f
2∂
y2∂
---------+ 0=
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-39
∆ ∆
f(x, y)
Solution space
y
x
Finite Difference Method
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-40
If distance between points, ∆, made small enough:
Substituting into Laplace’s equation, we get
Rearranging, we get
Rewritten as an iterative formula:
f k(x, y) - kth iteration, f k−1(x, y) - (k − 1)th iteration.
f2∂
x2∂
---------1
∆2------ f x ∆ y,+( ) 2 f x y,( )– f x ∆– y,( )+[ ]≈
f2∂
y2∂
---------1
∆2------ f x y ∆+,( ) 2 f x y,( )– f x y ∆–,( )+[ ]≈
1
∆2------ f x ∆ y,+( ) f x ∆– y,( ) f x y ∆+,( ) f x y ∆–,( ) 4 f x y,( )–+ + +[ ] 0=
f x y,( ) f x ∆ y,–( ) f x y ∆–,( ) f x ∆+ y,( ) f x y ∆+,( )+ + +[ ]
4------------------------------------------------------------------------------------------------------------------------------------=
fk
x y,( )f
k 1–x ∆– y,( ) f
k 1–x y ∆–,( ) f
k 1–+ x ∆ y,+( ) f
k 1–+ x y ∆+,( )+[ ]
4--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------=
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-41
x1 x4x3x2 x8x7x6x5 x9
x31 x34x33x32 x38x37x36x35 x39
x41 x44x43x42 x48x47x46x45 x49
x51 x54x53x52 x58x57x56x55 x59
x61 x64x63x62 x68x67x66x65 x69
x71 x74x73x72 x78x77x76x75 x79
x11 x14x13x12 x18x17x16x15 x19
x21 x24x23x22 x28x27x26x25 x29
x81 x84x83x82 x88x87x86x85 x89
x60
x70
x80
x90
x40
x50
x30
x20
x10
x91 x94x93x92 x98x97x96x95 x99 x100
Boundary points (see text)
Natural Order
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-42
Relationship with a General System of Linear Equations
Using natural ordering, ith point computed from ith equation:
or
xi−n + xi−1 − 4xi + xi+1+ xi+n = 0
which is a linear equation with five unknowns (except those withboundary points).
In general form, the ith equation becomes:
ai,i−nxi−n + ai,i−1xi−1 + ai,ixi + ai,i+1xi+1+ ai,i+nxi+n = 0
where ai,i = −4, and ai,i−n = ai,i−1 = ai,i+1 = ai,i+n = 1.
xixi n– xi 1– xi 1+ xi n++ + +
4--------------------------------------------------------------------------=
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-43
−4−4
−4
−4
−4
1 1 11 1
1 1
1 11
ai,i ai,i+nai,i−1 ai,i+1ai,i−n1 1ith equation
11
11
1
1
×
x1
=
To includeboundary values
11
A x00
00
and some zeroentries (see text)
x2
xN
xN-1
Those equations with a boundary point on diagonal unnecessary
for solution
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-44
Point
Point to be computed
computed
Sequential order of computation
Gauss-Seidel RelaxationUses some newly computed values to compute other values in thatiteration.
Basic formnot suitablefor parallelization
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-45
Gauss-Seidel Iteration Formula
where the superscript indicates the iteration.
With natural ordering of unknowns, formula reduces to
xki = (−1/ai,i )[ai,i−n x
ki−n + a i,i−1 x
ki−1 + a i,i+1 x
k−1i+1+ a i,i+n x
k−1i+n ]
At the kth iteration, two of the four values (before the ith element)taken from the kth iteration and two values (after the ith element)taken from the (k−1)th iteration. We have:
xki
1ai i,--------- bi ai j, x
kj ai j, x
k 1–j
j i 1+=
N
∑–j 1=
i 1–
∑–=
fk
x y,( )f
kx ∆– y,( ) f
kx y ∆–,( ) f
k 1–+ x ∆ y,+( ) f
k 1–+ x y ∆+,( )+[ ]
4--------------------------------------------------------------------------------------------------------------------------------------------------------------------=
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-46
Red-Black OrderingFirst, black points computed. Next, red points computed. Blackpoints computed simultaneously, and red points computedsimultaneously.
Red
Black
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-47
Red-Black Parallel Code
forall (i = 1; i < n; i++)forall (j = 1; j < n; j++)if ((i + j) % 2 == 0) /* compute red points */f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]);
forall (i = 1; i < n; i++)forall (j = 1; j < n; j++)if ((i + j) % 2 != 0) /* compute black points */f[i][j] = 0.25*(f[i-1][j] + f[i][j-1] + f[i+1][j] + f[i][j+1]);
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-48
Higher-Order Difference Methods
More distant points could be used in the computation. The followingupdate formula:
fk
x y,( ) =
1
60------ 16 f
k 1–x ∆– y,( ) 16 f
k 1–x y ∆–,( ) 16 f
k 1–+ x ∆ y,+( ) 16 f
k 1–+ x y ∆+,( )1
1-+
f–k 1–
x 2∆– y,( ) fk 1–
x y 2∆–,( )– fk 1–
x 2∆+ y,( )– fk 1–
x y 2∆+,( )–
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-49
Nine-point stencil
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-50
Overrelaxation
Improved convergence obtained by adding factor (1 − ω)xi to Jacobior Gauss-Seidel formulae. Factor ω is the overrelaxation parameter.
Jacobi overrelaxation formula
where 0 < ω < 1.
Gauss-Seidel successive overrelaxation
where 0 < ω ≤ 2. If ω = 1, we obtain the Gauss-Seidel method.
xik ω
aii------ bi aij xi
k 1–
j i≠∑– 1 ω–( )xi
k 1–+=
xik ω
aii------ bi aij xi
kaij xi
k 1–
j i 1+=
N
∑–j 1=
i 1–
∑– 1 ω–( )xik 1–
+=
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-51
Multigrid Method
First, a coarse grid of points used. With these points, iterationprocess will start to converge quickly.
At some stage, number of points increased to include points of thecoarse grid and extra points between the points of the coarse grid.Initial values of extra points found by interpolation. Computationcontinues with this finer grid.
Grid can be made finer and finer as computation proceeds, orcomputation can alternate between fine and coarse grids.
Coarser grids take into account distant effects more quickly andprovide a good starting point for the next finer grid.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-52
Coarsest grid points Finer grid pointsProcessor
Multigrid processor allocation
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, 2004 Pearson Education Inc. All rights reserved.
slides11-53
(Semi) Asynchronous Iteration
As noted early, synchronizing on every iteration will cause
significant overhead - best if one can is to synchronize after a
number of iterations.