multiprocessor computer architectures : algorithmic design ... · multiprocessor computer...
TRANSCRIPT
Loughborough UniversityInstitutional Repository
Multiprocessor computerarchitectures : algorithmicdesign and applications
This item was submitted to Loughborough University's Institutional Repositoryby the/an author.
Additional Information:
• A Doctoral Thesis. Submitted in partial fulfilment of the requirementsfor the award of Doctor of Philosophy of Loughborough University.
Metadata Record: https://dspace.lboro.ac.uk/2134/10872
Publisher: c© A.S. Roomi
Please cite the published version.
This item was submitted to Loughborough University as a PhD thesis by the author and is made available in the Institutional Repository
(https://dspace.lboro.ac.uk/) under the following Creative Commons Licence conditions.
For the full text of this licence, please go to: http://creativecommons.org/licenses/byncnd/2.5/
Dx crcn 10
LOUGHBOROUGH UNIVERSITY OF TECHNOLOGY
LIBRARY
AUTHOR/FILING TiTlE i . _____________ ~ ~ .9_"::"_ 'I ___ A~ _____   ;
          . ACCESSION/COPY NO.
o '+0:> 0 I SS"3 0  ..           VOL. NO. CLASS MARK
Ul 1994
30 JUN
040015530 3
111111 11111111111111111111111111
MULTIPROCESSOR COMPUTER ARCHITECTURES:
ALGORITHMIC DESIGN AND APPLICATIONS
BY
AKEEL S. ROOM!, B.Sc.,M.Sc.
A Doctoral Thesis
Submitted in Partial Fulfilment of the Requirements
For the Award of Doctor of Philosophy
of Loughborough University of Technology
August, 1989.
SUPERVISOR: PROFESSOR D. J. EVANS; PH. D:','6?'s6:; ," •• ,.;.:r
Department of Computer Studies c;.;
.~
© by A.S. Roomi, 1989.
! I ;~;;;~ UnNers; . of TQChnology Library
Dale J'.... '! 0
CERTIFICATE OF ORIGINALITY
This is· to certify that I am responsible for the work submitted I
in this thesis, that the original work is my own except as specified
in acknowledgements or in footnotes,. and that neither the thesis nor
the original work contained therein has been submitted to this or
any other institution for a higher degree.
A.S. RooMI.
To
My Paren ts ,
My Wife, Berjua~,
and my Children,
Amar, Maytham and Heatr.or,
with love,
AkeeL
'.
ACKNOWLEDGEMENTS
It is the special privilege of authorship, that at the end of
one's exertions, it is possible to pause, look back and formally
thank the many people without whose active participation, the
scheduled completion of a doctoral thesis would have been impossible.
Professor D.J. Evans, Head of the Parallel Processing Centre,
Loughborough University of Technology, in addition to being an effective
and extremely conscientious supervisor and Director, is also endowed
with a gently persuasive manner. This combination was ideal in guiding
me through the perilous area of parallel computers. I am most grateful
to him for his painstaking comments and useful suggestions throughout
the last three years.
My profound gratitude to the Ministry of Higher Education, Iraqi
Government, for the award of a three year scholarship to enable me to
undertake this research.
I am grateful to the staff and research students of the Department
of Computer Studies for their kind cooperation during my research. I
. wish to thank in particular, Dr. W.S. Yousif, for his useful suggestions
during the early stage of this work.
Finally, I would like to express my appreciation·to my wife,
Berjual, who provided an environment and the encouragement essential
to the completion of the work.
ABSTRACT
The contents of this thesis are concerned with the implementation
of parallel algorithms for solving partial differential equations
(POEs) by the Alternative Group EXplicit (AGE) method and an
investigation into the numerical inversion of the Laplace transform
on the Balance 8000 MIMO system.
Parallel computer architectures are introduced with different
types of existing parallel computers including the Oata":Flow computer'
and VLSI technology which are described from both the hardware and
implementation points of view. The main characteristics of the Sequent
parallel computer system at Loughborough University is presented, and
performance indicators, i.e., the speedup and efficiency factors are
defined for the measurement of parallelism in the system. Basic ideas
of programming such computers are also outlined.
Basic mathematical definitions and a general description and
classification of POE's and its related discretised matrix are
introduced in Chapter 3.
In Chapter 4, the parallel version of the AGE method is developed.
for one and two dimensional elliptic POE's. The AGE method is
suitable for parallel computers as it possesses separate and independent
tasks. Therefore three synchronous and asynchronous strategies have
been used in the implementation of the method. The timing results
and the efficiency of these implementations were compared. A
. computational complexity analysis of the parallel AGE method is also
included.
The eigenvalues and the corresponding eigenvectors of the Sturm
Liouville problem are found by using the AGE method with different
boundary conditions.
In Chapter 5, the three parallel AGE strategies are also
implemented on time dependent POE's. The parallel AGE method was
applied on the second order parabolic equation with Oirichlet
boundary conditions and the diffusionconvection equation with a
comparison presented. Then the parallel AGE method on a two
dimensional parabolic equation and a hyperbolic equation is discussed.
A parabolic POE with derivative boundary condition is also solved
using the AGE method. A new AGE formula based on a O'Yakonov splitting
of the matrix is used to solve one and two dimensional parabolic POE's.
By comparing and analysing the results, yields an algorithm with
reduced computational complexity and greater accuracy for multi
dimensional problems.
Finally, in Chapter 6, the problem of numerically inverting the
Laplace transform has been investigated and numerical results were
obtained to compare th", different methods. An idea to improve the
accuracy by. imposing a Romberg integration is suggested. Attempts to
a~celerate the convergence of the slowly converging series is also
investigated. A parallel algorithmic form of an accurate method for
the numerical inversion of the Laplace transform is implemented.
The thesis concludes by summarizing the main results obtained
and suggestions for further work are included.
CONTENTS
PAGE
CHAPTER 1: PARALLEL COMPUTER ARCHITECTURES. AN INTRODUCTION
1.1 Introduction 1
1.2 Via Parallelism 4
1.3 Architectural Classification Schemes 8
1.3.1 Flynn 's Parallel Computen; Classification 8
1.3.2 Feng's Parallel Computers Classification 10
1.3.3 Shore's Parallel Computers Classification 13
1.3.4 Handler's Parallel Computers 16 Classification
1.3.5 Other Parallel Computers Classification 17
1.4 Pipeline Computers
1.5 SIMD
1.6 MIMD
1.7 DataFlow Computers
1.8 VLSI Systems and Transputers
1.8.1 Transputer System
1.9 The Balance 8000 System
CHAPTER 2: PARALLEL PROGRAMMING AND LANGUAGES
2.1 Introduction
2.2 Parallel Programming
2.2.1 Implicit Parallelism
2.2.2 Explicit Parallelism
2.3 Programming the Balance System
2.3.1 Multitasking Terms and Concepts
19
21
23
32
35
38
41
47
49
52
54
60
61
2.3.2 Data Partitioning with DYNIX
2.3.3 Function Partitioning with DYNIX
2.4 Parallel Algorithms
2.4.1 The Structure of Algorithms for Multiprocessor Systems
CHAPTER 3: BASIC MATHEMATICS, GENERAL BACKGROUND
3,1 Introduction
3.2 Classification of Partial_Differential Equations
3;3 Types of Boundary Conditions
3.4  Basic Matrix Algebra
3.4.1 Vectors and Matrix Norms
3.4.2 Eigenvalues and Eigenvectors
PAGE
66
70
73
78
80
81
84
87
91
93
3.5 NUMERICAL SOLUTION OF PDE'S BY FINITE DIFFERENCE METHOD 95
3.5.1 Finite Difference Approximation
3.5.2 Derivation of Finite Difference Approximations
3.5.3 Consistency, Efficiency, Accuracy and Stability
3.6 Methods of Solution
3.6.1 The Direct Methods
3.6.2 The Iterative Methods
3.6.3 The Block Iterative Methods
3.6.4 Alternating Direction Implicit (ADI) Methods
3.6.5 Alternating Group Explicit (AGE) Method
95
103 
107
109
109
III
 116
116
123
PAGE
CHAPTER 4: STEADY STATE PROBLEMS, PARALLEL EXPLORATIONS
4.1 Introduction 131
4.2 Parallel AGE Exploration 132
4.3 Experimental Results for the One Dimensional Problem 137
4.4 Experimental Results for the Two Dimensional Problem. 150
4.5 The AGE Method for Solving Boundary Value Problems with Neumann Boundary Conditions 172
4.5.1 Formulation of the Method
4.5.2 Numerical Results
4.6 The AGE Method for Solving Sturm Liouville Problem
4.6.1 Method of Solution
4.6.2 Numerical Result
4.7 Conclusions
CHAPTER 5: . TIME DEPENDENT PROBLEMS, PARALLEL EXPLORATIONS
5.1. Introduction
5.2 . Experimental Results for the DiffusionConvection Equation
172
177
181
182
185
187
189
190
5.3 Experimental Results for the TwoDimensional Parabolic Problem 202
5.4 Experimental Results for the Second Order Wave Equation
5.5 The Numerical Solution of OneDimensional parabolic Equations by the AGE Method with
220
D' Yakonov Splitting 224
5.5.1 The AGE Method
5.5.2 Numerical Results
225
229
5.6 The Numerical solution of TwoDimensional parabolic Equation by the AGE Method with
PAGE
D' Yakonov Splitting 234
5.6.1 Numerical Results 240
5.7 A New ?trategy for the Numerical Solution of the Schrodinger Equation 244
5.7.1 OUtline of the Method
5.7.2 Numerical Results
5.8 Conclusions
245
248
251
CHAPTER 6: NUMERICAL INVERSION OF THE LAPLACE TRANSFORMATIONS, SOME INVESTIGATIONS AND PARALLEL EXPLORATIONS
6.1 Introduction
6.2 The Numerical Inversion of the Laplace Transform
6.3 Numerical Experiments
6.3.1 The Implementation of the Fast Fourier
253
255
258
Transform Technique 268
6.4 Parallel Implementation of the Numerical Inversion of the Laplace Transform 271
6.5 Conclusions 275
CHAPTER 7: CONCLUSIONS AND FINAL REMARKS 276
REFERENCES 281
APPENDIX A: A LIST OF SOME SELECTED PROGRAMS 290
CHAPTER 1
PARALLEL COMPUTER ARCHITECTURES,
AN INTRODUCTION
Anyone who says he knows how computers
should be buil~ should have his head examined.
Computer Architecture
J.E. Thornton.
1
1 .. 1 INTRODUCTION
Highperformance, flexible and reliable computers are increasingly
in demand from many scientific and engineering applications, which may
be 4equired to be solved in real time. Since conventional computers
have a limited speed and reliability, the satisfaction of these
requirements can only be achieved by a highperformance computer
system. The achievement of high performance not only depends on using
faster and more reliable hardwar~ devices, but also on different
computer architectures and processing techniques. Therefore, parallel
computer systems need to be developed further.
In earlier times, relays (in the 1940s) and vacuum tubes (in the
1950s) were used as switching devices and they were interconnected
with wires and solder joints. Central Processing Unit (CPU) structure
was bitserial and arithmetic was done on a bitbybit fixed point
basis. By the early 1960's, transisters (invented in 1948) were used
i~ computer circuits .. Passive components such as resistors and
capacitors were also included in these circuits. All of these devices
were mounted on some kind of circuit boards, the most complex of which
consisted of a number of layers of conductors. and insulating material.
These provided interconnections between the elementary devices as well
as their mechanical support. Many improvements to ··computer architectures
were subsequently carried out. For example, Sperry Rand built a
computer system with an independent 1/0 processor which operated in
parallel with one or two processing units. Core memory was still
used in many computer systems. Then, solidstate memories replaced
the core memories.
By the late 1960's, Integrated Circuits (ICs) were in use,
,,'
2
followed by Large Scale Integrated (LSI) techniques, providing on one
silicon chip several transistors, the required resistors and capacitors
as well as interconnection paths.
Following the rapid advance in LSI technology, the Very Large Scale
Integration (VLSI) circuits have been developed with which enormously
complex digital electronic systems can be fabricated on a single chip
of silicon. Devices which once required many complex components can
now be built with just a few VLSI chips, reducing the difficulties in
reliability, performance and heat dissip'ation that arise from standard
smallscale and mediumscale integrate.
Until a few years ago, the current state of electronic technolo~y
was such that all factors affecting computational speed were almost
minimized and any further computational speed increase could only be
achieved through both increased switching speed and increased circuit
density. Due to the basic physical laws, the intended breakthrough
seemed unlikely to be achieved mainly because we are fast approaching
the limits of optical resolution. Hence, even if switching times are
almost instantaneous, distances between any two points may not be
small enough to minimize the ,propagation delays and thus improve
'computational speed. Therefore, the achievement of even faster
computers is conditioned by the use of new approaches that do not depend'
on breakthroughs in device technology, but rather on imaginative
applications of the skills involved in computer architecture.
Obviously, one approach to increasing speed is through Parallelism.
The ideal objective is to create a computer system containing p
processors, connected in some cooperating fashion, so that it is p
times faster than a computer with a single processor. These parallel
computer systems or multiprocessors as they are commonly known, not
only increase the potential processing speed, but they also increase
the overall throughput, flexibility, reliability and provide fault
tolerance in case of processor failures'"
3
4
1.2 VIA PARALLELISM
Parallelism, the notion of the parallel way of thinking was
conceived long before the emergence of truly parallel computers. It
is thought that the earliest reference to parallelism is in Le Manebrea's
,publication', entiJ:led "Sketch of the Analytical Engine" invented by
C. Babbage. There, reporting on the utility of the conceived machine,
he wrote:
"Likewise when a long series of identical computations is to
be performed, the machine can be brought into play sO,as to
give several results at the same time, which will greatly
abridge the whole amount of the processes".
Babbage's notion was neither implemented in the final design of
his calculating engine or elsewhere, due to the lack of technological
development accordingly, though, the notion of the parallel way of
thinking had been conceived.
The division of ~omputer systems into generations is determined
by the device technology, system architecture, processing mode and
languages used. 'We are currently in the fourth generation, while the
'. fifth generation is on the horizon.
The first generation (19381953). The first electronic digital
computer, ENIAC (Electronic Numerical Integrator And Computer), in 1946
marked the beginning of the first generation of computers. Using
vacuum tubes, magnetic drums as central memories and electronic valves
as their switching components, with gate delay times of approximately 1 ~s.
* Following Babbage' s lecture in Turin, describing his "difference engine"
a young Italian engineer wrote a detailed account of the machine in
French (published in October 1842). Ada, Lady Lovelace translated the paper into English.
5
The second generation (19521963). Transistors were invented in
1948, while the first transistorized digital computer (TRADIC) , was
built by Bell Laboratories in 1954. The propagation delay times of.
using the germanium transistor is approximately 0.3 ~s. Assembly
languages were used until the development of highlevel languages,
FORmula TRANslation (FORTRAN), in 1956 and ALGOrithmic Language (ALGOL),
·in 1960.
The third generation (19621975). This generation was marked by
the use of SmallScale Integrated (551) and MediumScale Integrated
·(MSI) circuits as the basic building blocks. Highlevel languages
were greatly enhanced with intelligent compilers during this period.
Multiprogramming was well developed to allow the simultaneous
execution of many program segments interleaved with I/O operations.
Virtual memory was developed by using hierarchically structured memory
systems. The propagation delay was about 10 ns, and later, around the
1970's, it became slightly less than 1 ns.
The fourth generation (l972present). This generation is
characterised by enhanced levels of circuit integration through the
use of LSI circuits for both logic and memory sections. High~level
languages were extended to handle both scalar and vector data; Most
operating systems were timesharing, using virtual memories.
Vectorizing compilers appeared in the second generation of vector
machines like the Crayl (1976) and the Cyber205 (1982). Highspeed
main frames and supercomputers appeared as multiprocessor systems, like
the Univac 1100/80 (1976), Fujitsu M382 (1981), the IBM 3081 (1980),
and the Cray XMP (1983). A high degree of pipelining and multi
processing is greatly emphasized in commercial supercomputers.· A
Massively Parallel Processor (MPP) was customdesigned in·1982.
All these various multiple processor architectures can be
categorized in four distinct organizations: Associative, Parallel,
Pipelined and Multiprocessors.
An attempt by Hockney and Jesshope [Hockney, 1981] to summarize
the principal ways to introduce the notion of parallel processing at
the hardware level of the various computer architectures, res~lts in:
1. the application of pipe liningassembly linestechniques in
order to improve the performance of the arithmetic or control
units. A process is decomposed into a certain number of
elementary subprocesses each of which being capable of
execution on dedicated autonomous units;
2. the arrangement of several independent units, operating in
parallel, to perform some basic prinCipal functions such
as logic, addition or multiplications;
3. the arrangement of an array of processing elements (PE's)
executing concurrently the same instruction on a set of
different data, where the data is stored in the PE's
private memories;
4. the arrangement of several independen t processors, working
6
in a cooperative manner towards the solution of a single task
by communicating via a shared or common memory, each of them
being a complete computer, obeying its own stored instructions.
To illustrate alternative hardware and software approaches, in
the following sections, we shall select principal significant
architectures, which differ sufficiently from each other. Specifically
for the Multiprocessor class, the Balance 8000, parallel processing
system, at Loughborough University of Technology, is described in
more detail, due to the fact that it was extensively used during the
carrying out of the present research.
7
8
1.3 ARCHITECTURAL CLASSIFICATION SCHEMES
To date many classification schemes have been proposed. In this
section we shall briefly present the theoretical concepts of the
architectures taxonomy given by Flynn (1966), Feng (1972), Shore (1973)
and Handler (1977).
1.3~1 Flynn's Parallel Computers Classification
In 1966, M.J. Flynn [Flynn, 1966] classified computer organizations
into four categories according to the multiplicity of instructions and
data streams. For convenience he adopted two definitions: the
instruction stream, as a sequence of instructions which are to be
executed by the system, and the data stream, as a sequence of data
called for, by the instruction stream.
Flynn's four machine organizations as shown in Figure 1.1 are:
1. Single Instruction Single Data (SISD);
2. Single Instruction Multiple Data (SIMD);
3. Multiple Instruction Single Data (MISD);
4. Multiple Instruction Multiple Data (MIMD).
SISD:
SIMD:
This is the classical von Neumann model. A single stream of
instructions operates on a single data stream. It may have more.
than one functional unit operating under the supervision of one
control unit. Examples are IBM 3600/91, CDCNASF, Fujitsu
FACOM230/75.
This is the class to which array processors and pipeline
. processors belong. All the processors elucidate the same
instructions and perform them on different data. Because of
9
8I8D 8IMD
C C
T [ J. 1
Pl
P2
P n
P
I 1 1 N
8 I J r 8
1 8
2 8
n
MISD MIMD
I C C
1
c:5 C2
C n
T
Q Cl C n
Pi P2 P
n Pi P2 P
n
N N
I S
1 I 5J EJ S
n
FIGURE 1.1: Flynn's Parallel Computer Classification
rc: control unit; P: processor; N: data organisation network; S: store).
10
their simple form machines of this kind can have a large
number of processors. Examples are ICL/DAP. IlliacIV. STARAN.
MISD: (Chains of processors), there are n processor units, each
receiving distinct instructions operating over the same data
stream. No real embodiment of this class exists.
MIMD: This is the multiple processor version of "SIMD. All processors
elucidate different instructions and operate on different data.
Most multiprocessor systems and multiple computer systems can
be classified in this category. Examples are C.mmp. Balance
8000. 21000. Cray2.
1.3.2 Feng's parallel Computers Classification
In his classification TseYun Feng [Feng. 1972] has proposed the
use of degrees of parallelism in various computer architectures. The
Tfl2ximum paralleUsm degree P, is defined as the maximum number of binary
digits (bits) that can be processed within a unit time by a computer
"" system. " If p. are the number of bits that can be processed within the 1.
ith
processor cycle and T is the processor cycle indexed by i=1.2 ••..• T.
then the average parallelism degree."p is defined by. a
T
L Pi i=l
= "'=T (1.3.1)
typically. p.~p. Accordingly the utilization rate ~ of a computer 1.
system within T cycles is,
T
Pa L Pi
i=l IJ. = (1.3.2) p T.p
11
Figure 1.2 emphasizes the classification of computers by their
maximum parallelism degrees, where the horizontal axis shows the
word length n, while the vertical axis corresponds to a bitslice*
length m.
If c be a given computer, then the maximum parallelism degree
p·(c) is represented by the product of the word length n and the bit
slice length m; that is,
p(c) n.m . (L 3.3)
Obviously., p(c) is equal to the area of the rectangle defined by the
integers nand m.
There are four types of processing methods that can be seen from
Figure 1. 2:
1. WordSerial and BitSerial (WSBS);
2. WordParallel and BitSerial (WPBS);
3. WordSerial ·and BitParallel (WSBP);
4. WordParallel and BitParallel (WPBP).
WSBS: which is the conventional serial computer (vonNeumann);
One bit (n=m=l) is processed at a time.
WPBS: (n=l,m>l) I since an m bitslice is processed at a time, so it
is termed bitslice processing. Examples are STARAN, MPP.
WSBP: (n>l, m=l) , found in most existing computers. Since one word of
n bits is processed at a time, hence, it has been called word
* The bitslice is a string of bits, one from each of the words at the
same vertical bit position. As an example, the TIASC has a word
length of sixtyfour and four arithmetic pipelines, and each pipe has
eight pipeline stages, so there are thirtytwo bits per each bitslice
in the four pipes.
16384   .. MPP
.5
:S IJ' <: Q)
..:I Q) U • .1 .< ., 1> • .1 <Il
: (1,16384)
288  _ . ...1
256
64
16
1
• Staran (1,256)
__ t __ _
1
C.DDDp   ~(16,16)
, 
1 PDPlli'
1 (16,1) I' ,
16
Word Length (n)
PEPE "(32,32) , ,
, 1 
i _____ _
I
, , I
12
Illiac IV  • (64,64)
, ,IBM 370/168 1 Crayl
<t : (32,1) 1 (64,1)
32 64
FIGURE 1.2: Feng's Parallel Computer Classification System
WPBP:
slice processing. Examples are IBM 370/168UP, CDC 6600.
(n>l, m>l) , known as fully parallel processing, in which an
array of n·m bits is processed at one time. Examples are
TIASC, C.mmp, Balance 8000, 21000.
1.3.3 Shore's Parallel Computers Classification
13
In 1973 Shore [Shore, 1973] presented a classification of parallel
computer systems based on their constituent hardware components. There
are six different types of machines according to his proposal, and all
existing computers could belong to one of them. Figure 1. 3 shows the
six different types and what they are.
1. Machine 1, consists of an Instruction Memory (IM) , a single Control
Unit (CU) , a Processing Unit (PU) , and a Oata Memory (OM). Examples
are the Crayl, COC 7600, etc.
2. Machine 11, is obtained from Machine I by simply changing the way
the data is read from OM. Machine 11 reads a bit from every word
in the memory, instead of reading all bits of a single word.
Examples are ICL OAP, STARAN, etc.
3. Machine Ill, this machine is derived from the combination of
Machines I and I1a There are two PUs, one horizontal and one
vertical. An example is Sander's Associates OMEN 60 .
. 4. Machine IV consists of a single CU and as many as possible,
independent PE's, each of which has a PU and OM. Communication
between these components is restricted to take place only through
the CU. Example is PEPE.
5. Machine V is derived from Machine IV by adding interconnections
between processors. An example is ILLIAC IV computer.
14
IM CU IM
Horizontal CU PU
Wordslice Vertical Byte slice. OM PU OM
1. Machine I 2 . Machine 11
CU IM
L G Horizontal PU PU PU
CU PU 1 2 n
Vertical PU OM
OM .OM OM 1 2 n
3. Machine III 4 • Machine IV
FIGURE 1.3: Shore's Parallel Computer Classification
PU 1
OM 1
PU 2
DM 2
CU
5. Machine V
CU
PU + OM
6. Machine VI
PU n
OM n
FIGURE 1.3: Shore's Parallel Computer Classification (continued)
15
6. Machine VI, the difference between this machine and the previous
machines,is that the PU's and the DM are no longer individual
hardware components, but instead they are constructed on the same
re board. Examples are the associative memories and associative
processors.
1.3.4 Handler Parallel Computers Classification
Wolfgang Handler [Handler, '1977J suggested his classification
outline for identifying the parallelism degree and pipelining degree
built ,into the ,hardware structures of a computer system. There are
three subsystem levels of parallelpipeline proces'sing according to
hi~ classification:
1. Processor Control Unit (PCU);
2. Arithmetic Logic Unit (ALU);
3. BitLevel Circuit (BLe).
PCU and ALU are well defined. Each PCU corresponds to one
processor. The ALU is equivalent to the PE in SIMD array processors.
The BLe corresponds to the combinational logic circuitry needed to
perform Ibit operations in the ALU. Let C be defined as a computer
'system, then C can be characterized by a triple containing six
independent entities, as defined below:
16
T(C) ; <KxKl,DxDl,wxWl> (1.3.4)
where,
D ; the number of ALU's (or PE's) under the control of one PCU;
K the number of processors (PCU's) within the computer;
W the word length of an ALU (or PE);
Dl = the number of ALU's that can be pipelined;
Kl = the number of PCU's that can be pipelined;
Wl = the number of pipeline stages in all ALU's (or PE's).
17
AS an example, the Texas Instrument's Advanced Scientific Computer
(TIASC) has one controller controlling four arithmetic pipelines, each
has 64bit word lengths and .8stages. Thus, we have
T(ASC) = <lxl,4 xl,64 x8> = <1,4,64x8> . (1.3:5)
·1.3.5 Other Parallel Computers Classification
There are some other classification approaches, less significant
than the former four, based mainly upon the notion of parallelism.
Hobbs, et al [Hobbs, 1970] in 1970, distinguished the parallel
architectures into Multiprocessors, Associative processors, Network or
Array processors and Functional ma~hines.
Murtha.and Beadles [Murtha, 1964] based their taxonomy view upon
the parallelism properties, attempting to underline the differences
between multiprocessors and a highly parallel organization. According
to them the parallel organization could be classified into the general
purpose network computers, the specialpurpose network computers with
global parallelism and finally the nonglobal semiindependent network
computers with local parallelism.
18
1.4 PIPELINE COMPUTERS
In this section, the structures of pipeline computers and vector
processing principles are studied. Pipelining offers an economical way
to realize temporal parallelism in digital computers. To achieve
pipe lining , one must subdivide the input task (process) into a sequence
of subtasks, each of which can be executed on a dedicated facility,
called a stage or station. Stations are connected via buffers or latches.
The pipeline computers have distinct pipeline processing capabilities,
'e.g. there can be pipe lining between the processor and the I/O unit.
Within the processor, there can be pipelining between instruction.
A pipeline processor consists of a sequence of.processing circuits,
called stages, through which a data stream passes. Each stage does
some partial processing, on the data and a final result is obtained
after the data has passed through all these stages of the pipeline.
Figure 1.4 exemplifies a typical pipeline computer. This diagram
shows both scalar arithmetic pipelines and vector arithmetic pipelines.
The instruction PU is itself pipelined with three stages.
As an example, consider the process of performing the instruction
(F), decode the instruction (D), fetch the operand (0), and finally its
execution (E).
In a nonpipelined computer, the above steps E~St be completed
be~ore the next instruction can be issued as shown in Figure 1.5,
011 Instruction processing I ·
FIGURE 1.5: NonPipelined Processor.
1/
,
Scalar processor
Scalar data r1 SPl l K
If    cal 1 SP2 I      ..  .    I
I ar .... 1 I'egi I
1 ~ter I
I I
Main I ~ spn1   Memory Instruc Instru: Scalar   
tion tion K Scalar pipeline
fetch decode fetch
0 IS + (F) (D) (0)
Vector processor
K vector   
J fetch I VPl
  , 1 I J vp2 1 kiect 1
Instruction processing I ~ ~            ~ or
K Regi, I
sten I
data 1 VPm Vector
Vector pipeline
FIGURE 1.4: Functional structure of a Modern Pipeline Computer with Scalar and vector Capabilities.
(IS: Instruction stream; 0: Operand fetch;
K: ControZ signaZ)
19
I
J
J I I ,
While in the pipel·ined computer the four stages F,D,O, and E are
executed in an overlapped manner (Figure 1.6). After constant time
intervals, the output of one stage is switched to the next station.
A new instruction is fetched (F) in every time cyCle, and stage (E)
produces an output every time cycle, because the time to perform an
instruction consists of multiple pipeline cycles ..
Pipe lining can be presented at more than one level in the design
of computers. Ramamoorthy and Li [Ramamoorthy, 1977] introduced many·
theoretical considerations of pipelining and presented a survey of
comparisons between various pipeline machine designs.
Stations
SI S2 S3 S4
FIGURE 1.6: A Pipelined Processor.
20
21
1.5 SIMD
The SIMD (array processor) is a synchronous parallel computer with
"multiple arithmetic logic units (ALU) , called processing elements (PEs).
These identical PEs, arranged in an array form and controlled by a single
control unit, which decodes and broadcasts the instructions to all
processors within the array. Each PE has its own private memory which
provides it with its own data stream. Hence the PEs are synchronized
to perform the same function at the same time. ·Two essential reasons
for building array processors are firstly, economic for it is cheaper
to build P processors with only a single control unit rather than P
similar computers. The second reason concerns interprocess~r
communication, the communication bandwidth can be more fully utilized.
An array processor computer is depicted in Figure 1.7.
The advantages of SIMD computers seem to be greatest when applied
to the solution of problems in matrix algebra or finite difference
methods for the solution of the partial differential equation. It is
known that most algorithms in this area require the same type of
operations to be repeated on large numbers of data items. Hence, the
problem can be distributed to the processors that can run simultaneously.
Obviously, a complete interconnection network, where each processor
"is connected to all other processors, is expensiye and unmanageable by
both the designer and the user of the system. Therefore some other
interconnection patterns are proposed to be of primary importance in
determining the power of parallel computers.
The interconnection pattern used in ILLIAC IV is that the
processors are arranged in a twodimensional array where each processor
22
is connected to its nearest four neighbours. Also processors can be
arranged as a regular mesh in a perfect shuffle pattern or in various
specialpurpose configurations for merging, sorting or other specific
applications.
I/O
I control I I2roc~S:S:QX:
CU (Scalar processing)
I Control I memoQ!
Data bus I I contra 1 I        I
1
1 PEl I PE2 PEn I
I I
Iprocesso~ ~rocessorl Iprocesso3 I
~ I   + I I
I memory I I memory I I memory I I I
(Array I processing) I
I 1
I
'"
InterPE connection network (data routing)
FIGURE 1.7: Array Processors Computer.
23
1.6 MULTIPROCESSING SYSTEMS (MIMD)
In this section our emphasis is on m~ltiprocessing computers. The
American National Standards Institute (ANSI) , defined the multiprocessor
as: "A computer employing two or more processing units under integrated
control", which is hardly complete since the two most significant
images for this type of computer, i.e. the sharing and interaction
concepts were not included.
In 1977 Enslow [Enslow, 1977] offered a complete definition of.a
multiprocessor system depending on· its characteristics. As well as the
ANSI definition, he included the following conditions:
1. All the processors with approximately comparable capabilities, must share
access to a cocmon memory, I/O channels, control units and devices.
2. The entire complex is under the control of a single operating system
providing the interaction between processors and their.program at
the task, instruction and data levels. The basic diagram of a MIMD
machine of P processors is illustrated in Figure 1.8. Although
each processor includes its own CU, a high level CU may be used to
control the transfer of data and to assign tasks and sequences of
operations between the different processors.
The .universal class of MIMD comp)lters may be classi·fied, depending
on the amount of interactions, into two main classes:
1. A tightlycoupled system;.
2. A looselycoupled system.
The tight~ycoup~ed processors, as illustrated in Figure 1.9a,
where the processors operate under the strict control of the bus
24
CU connection 1           I
, I I
, ) CUl ~ ,~ CUP E
I I
1 ,..
I '" , , I .~ I
I I I
1 , I I
1 I I I I , , ."
=1 Channel H Input .~
~  _.
Channel H OUtput I> Interconnection network
(switch)
k   , Secondary I Channel 1"1 memory
i j
I
! .I , I    I , I
.c , ) Control flow
I l Data flow Main memory
I
FIGURE 1.8: A Typical MIMD System.
25
assignment scheme which is implemented in hardware at the bus/processor
interface.
The looselycoupled processor, as shown in Figure 1.9b, where the
communication and the interaction between processors took place on the
basic of information exchange.
The tightlycoupled multiprocessor has a noticeable performance
advantage over the.looselycoupled multiprocessor.
Examples of some multiprocessor systems which we shall briefly
discuss to exemplify the machin~s characteristics are;
1. The S.l system;
2. The Neptune system.
The S.l multiprocessor system which can be described as a high
speed generalpurpose multiprocessor developed at LLN laboratory. The
S.l is implemented with the S.l uniprocessor called Mark IIAs, as
illustrated in Figure 1.10. This structure consists of 16 independent
Mark 'IIA uniprocessors which share 16 memory bank's via a crossbar
switch. Each processor has a private cache which is transparent to the
user. Eachuniprocessor, crossbar switch and memory bank is connected
to a diagnostic processor which can probe, report and change the internal
state of all modules that it monitors. For the S.l system, there'ex~s
a single user operating system multiuser operating system and advanced
operating system. It also supports multitasking by the division of
problems into cooperating tasks.
The Neptune system, is another system built at Loughborough in 1981
under professor Evans, and used extensively for the development of
parallel MIMD algorithms (see Barlow et al [Barlow, 1981J).
26
Shared memory

J Processor Processor Processor
1 2 3
a. Tightlycoupled
. Memory Memory 0 1 2 3 .
I i I
: ;
;
Processor Processor Processor 1 2 3
b. Looselycoupled
FIGURE 1.9: Multiprocessor System.
27
11 Memory 0    Memory 15
4
Controller ( Hg~agnostic I ~~agnost~c Controller 15 rocessor rocessor .J
I
r I I I TI I
 Crossbar
 _ switch Diagnostic processor
T i Uniprocessor 0 Uniprocessor 15
Data InstructioIl cache cache
M
F
P
I
A
Diagnostic I/O
~  ! I/O
store store processor
0 7
Mass storage
Real time units
I/O +< I/O !'> I/O process
or 0
I Peripheral equipment
procelE  or
,
1
Data Instruction cache cache
.
M
114 F
~  P
I
A
Diagnostic I/O I/O
store 1' .... store processor
0 7
Mass storage
Real time units
I/O +l I/O ~ I/O process
<)r  0
Peripheral equipment
rrocess <)r 7
FIGURE 1.10: The Structure of the S.l Mark IIA Multiprocessor
The Neptune system constitutes four Texas Instruments 990/10
minicomputers configured, as illustrated in Figure 1.11. The
instruction sets include both 16bit and 8bit byte addressing
capability. The system comprises of four processors each with its
private memory of 128Kb and one processor (po) has a separate 10Mb
disc drive. The access to the local memory is made by a local bus
(or TILINE), which is attached to each processor, ~ince the TILINE
coupler is designed so that the shared memory follows continuously
28
from the local memory of each processor, thus, each processor can
access 192kb of memory. Each processor runs under the powerful DXIO
uniprocessor operating system which is a generalpurpose, multitasking
system. The DXIO operating system has been adapted to enable the
processors to run independently and yet to cooperate in running programs
with data in the shared memory.
Although the processors are identical in many hardware features,
they are different in their speed. The relative speed of processors
PO,Pl,P2 and P3 are 1.0, 1.037, 1.006 and 0.978 respectively. This
Will, however, reduce the efficiency of the system and decrease the
performance measurements of an algorithm with synchronization.
Various interconnection networks, which is the main factor for
multiprocessor hardware system organization, with different character
istics such as bandwidth, delay and cost, ranging from the shared
common bus to the crossbar switch, have been suggested. Enslow in 1977
[Enslow, 1977] identified three intrinsic different organizations,
namely:
1. The timeshared common bus;
29
I I I
Shared I memory
I I P
2 I I M2
I I
..
~ (
I P
l I \. Disc
I
I I I Bc )\ I I ~ Disc
~ W I P: processor
I M: memory
( Disc )
FIGURE 1.11: The Neptune Configuration
30
2. Crossbar switch networks;
3. Multiport memory.
The timeshared common bus which represents the simplest inter
connection system for either single or multiple processors. It consists
of a common communication path connecting all the functional units,
which are a number of processors, memories, and I/O devices. The system
capacity is limited by the bus bandwidth and system performance may be
demoted by adding new functional units.
To overcome the insufficienty of the timeshared bus organization,
the crossbar switch is used. The crossbar switch provides a separate
.path for every processor, memory module, and I/O unit. So that if the
multiprocessor system contains P processors and M memories, the cross
bar needs (pxM) switches. In fact, it i·s difficult to have a large
system based on the crossbar switch concept because the complexity
2 grows at the rate of O{n ) for n devices. The important characteristics
of these systems are the extreme simplicity of the switchtofunctional
unit interfaces and the ability to support concurrent transfers for all
memory modules.
In the multiport memories systems, the functions of control,
switching between processors, and priority arbitration are centralized
at the memory interface. Hence every processor has a private bus to
every passiv~ unit, i.e. memory and I/O units.·
The prinCipal distinguishing features of such. a system are
expensive memory control, expansion from uniprocessor to multiprocessor
system using the same hardware, a large number of cables and connectors
and system limitation by memory port deSign.
Besides the three presented interconnection networks, there are
many others which can be valuable for multiprocessor organization such
as the·Omega network [Lawrie, 1975], the Augmented Data Manipulator
[Siegel, 1979] and the Delta network [Patel, 1981].
In Section 1.9 we will also present the Balance 8000 system as a
multiprocessor system.
31
32
1.7 DATAFLOW COMPUTERS
A new approach to parallel processing i.e. Data Flow is briefly
outlined in this section, whereas in the previous sections, the computer
architectures reviewed are known as control flow (CF) (von Neumann)
machines. In the CF computer, the program is stored in the memory as a
serial procession of instructions. Hence, this is one of the essent~al
difficulties in the utilization of the parallelism of algorithms in the
. ·CF model of computation.
An alternative architectural model for computer systems is
composed to assert the parallelism of algorithms, this model is the
DataFlow (DF) (also known as a DataDriven) model of computation. In
the DF system, the course of computation is controlled by the flow of
data in the program. In other words, an operation is executed as and
when its operands are available. This means that the sequence of
operations in the DF system respond to the preference constraint
imposed by the algorithm used rather than by the location of the
instructions in the'memory. On that account, the DF computer can
perform Simultaneously in parallel as many instructions as it is given
and distributes the result to all subtask instructions which make use
of this partial result as an operand.
The DF computers can be grouped, depending on the problem. tackled,
into two main classes:
1. The static structure;
2. The dynamic structure.
In the static structure, the loops and subroutine calls are
unfolded at compile time so that each instruction is performed only
once. Figure 1.12 illustrates the static structure DF machine, which
consists of the following components: a store which holds the
instruction cells (packets) having space for the operation, operands
with their pointers to the successors, and a set of operating units to
execute the operations. The maximum· throughput is determined by the
speed and the number of operating units, the memory bandwidth and by
the interconnection system. The most significant factors that reduce
the throughput are the degree of concurrency available in the program;
the memory access and interconnection network conflicts and finally
the broadcasting of the results.
33
In the dynamic structure, the operands are labelled so that a
single copy of the same instruction can be used several times for
different instances of the loop or subroutine. Figure 1.13 shows the
dynamic structure DF· machine. The main components of the Manchester
DF computer [Gurd, 1985] are the token queue that stores computed
results, the token matching unit that combines the corresponding tokens
into instruction arguments, the instruction store that contains the
readytoexecute instructions, the operating units and the I/O switch
for communication with the host. The degradation factorsare similar
to those of the static case except the additional overhead in token
label matching.
Because of the above recount discredit factors, DF systems are
only captivating for cases in which the con currency exhibited is of
several hundred instructions long. The significant advantages of·the
DF computers are the exploitation of the concurrency at a low level of
the performance hierarchy, since it allows the maximum utilization of all
the available concurrency.
34
Operating units
~
I Inst . cell I ~ ... ... g ~ ...,
Q) I Q) c: c:
c: c: 0 I 0 .... . ... ..., I
..., 01 III
:!Ol ... ..., ... . ... ..., I cell I .Q
Ul Inst. ... .... ..: Q
FIGURE 1.12: The Static DataFlow Computer.
hOr Token queue
1 To
I/O switch Matching unit OVerflow unit
r Fr om host
Instruction store
Processing units
FIGURE 1.13: The Dynamic DataFlow Computer.
35
1.8 VLSI SYSTEMS AND TRANSPUTERS
Attributable to current maturity in hardware technology, Large
Scale Integrated (LSI) circuitry has become so closeknit that a single
silicon LSI chip may contain tens of thousands of transistors. Whereas
the rapid advance in LSI technology leads to the Very Large Scale
Integrated (VLSI) circuits where the number of transistors that the LSI
circuits contain will be increased by another factor of 10 to 100. This
advent in the LSI and VLSI chips has given a large boost to the research
and development of array processor and multiprocessor architectures.
The key factors of VLSI technology are: its capacity to implement
enormous numbers of devices on a chip, its low cost and the high degree
of integration. While the main VLSI problem is to overcome the design
complexity. The size of wires and transistors approach the limits of
photolithographic constancy, for it becomes literally impossible to
achieve further diminutiveness and actual circuit area becomes a key
issue. In addition, the chip area is also limited in order to maintain
high chip yield and the number of pins is limited by the finite size
of the chip perimeter. These restrictions form the basis of the VLSI
paradigm.
The separation between the processor from its memory and the
limited opportunities for synchronous processing are the principal
problem in the conventional (von Neumann) computers. The VLSIdesigns
put forth more flexibility than conventional (von Neumann) systems to
overcome these difficulties, since memory and processing architectures
can be implemented with the same technology and in close vicinity.
Many authors investigate the requirements of parallel architectures
for VLSI, among those are Kung [Kung, 1982], Dew [Dew, 1982] and
Seitz [Seitz, 1982].
Dew classified VLSI architectures as:
1. Simple and regular design, where it accommodates a few modules
which are replicated many times while the grain of the modules
depends on the application.
36
2. The design must have a very high degree of parallelism through both
pipelining and multiprocessing.
3~ Communication and switching, since the major differences.between
VLSI design and the earlier digital technologies is that the
communication paths will dominate both the area and the time delay.
This is because the speed of the devices increases as the feature
size decreases while the propagation time along a wire does not.
One of the novel ideas to emerge from VLSI research into parallel
architectures, is that of systoZic array processors. The concept of
systolic architectures, pioneered by Kung [Kung, 1982], is basically
a gen·eral methodology of directly mapping .algorithms onto an array of
PEs. The significant differences between a systolic array processor
and multiprocessor lattice are that in a systolic array, the
communication is only to neighbouring processing cells (i.e. no global
bus), the communication· with the outside world occurs only at the
boundary cells; and the processing cells are Ithardwired" and not
programmed from the host. The fundamental principle of asystolic
architecture, the systolic array in particular, is illustrated in
Figure 1.14. By replacing a single PE with an array· of PEs, a higher
computation throughput can be achieved without increasing the memory
bandWidth.
37
Memory
PE
a. The Conven~ional Organization.
Memory
LlPE PE PE PE PE PE
b. A Systolic Array Processor
FIGURE 1.14: Systolic Design Principle.
One problem correlated with the systolic array systems, is that
the data and control movements are manipulated by global timing
reference beats, so in order to synchronize the cells,:. extra delays are
often used to ensure correcttiming~ To overcome this conflict Kung
[Kung, 1985J suggested to take advantage of the data and control flow
locality, inherently possessed by most algorithms. This enables a data
driven, selftimed approach to array processing. Ideally, such an
approach constitutes the requirement of correct timing by correct
sequencing a
38
1.8.1 Transputer System
Another important development, which is set to make a major impact
in the field of multiprocessor systems is the INMOS Transputer [INMOS,
1984). What distinguishes the INMOS transputer from its competitors is
that it has been designed to exploit VLSI. The transputer is a single
chip microprocessor containing a memory, processor and communication
links for connection to other transputers, which affords direct hardware
support for the parallel language OCCAM. The structure of a transputer
is given in Figure 1.15. In the transputer INMOS have taken as many
components of a traditional vonNeumann computer as possible and
implemented them on a single lcm2
chip, while at the same time
presenting a high level of support for a synchronous view of· computation.
The communication between VLSI devices has a very much lower bandwidth
than communication between subsystems onchip. Transputer systems
communicate asynchronously and therefore they can be synthesized with
each processor. In transputerports, all components perform
concurrently; each of the four links and the floatingpoint coprocessor
can all execute useful work while the processor is performing other
instructions.
Since the transputer is designed for the OCCAM language, so
concurrency may be described between transputers in the system or
within a single transputer, which means the transputer can support
internal concurrency. The processor contains a scheduler which
enables any number of processes to run on a single transputer sharing
processor time, while each link presents two unidirectional channels
for point to point communication. Processes are held on two process
l
Reset Ana1yze Error Boot from rom c1x vcc
·'gnd
not mem not mem not mem rd not mem rf
mem wai mem con f
nj
l

:=
) System 7 services
Onchip ram /
I
~
Application and I
Specification I Interface
K
FIGURE 1.15: Transputer Architecture.
/ /
/ .I
I /
/ /
J I
Memory
I
Processor
Link interface
Link interface
Link interface
Link interface
Event
.
39
~i no
~, ut 0
; n 1
ut 1 .01
~i n 2
ut 2 ~o
~i n 3
ut 3
~~. vent ego vent ck. em ego em ran.
a m r m g
?
queues. The active queue, which holds the process being executed and
any other active processes waiting to be executed while the inactive
queue holds those processes waiting on an input, output or timer
interrupt. Since there is a·little status to be saved, so the
process swap times are very small depending on the instruction being
executed. To carry out adding and deleting processes from the active
process queue, there are two further microinstructions. These are
start process and end process. Input message and output message
instructions are for communication between processes on the transputer
or between processes on different transputers.
40
41
1.9 THE BALANCE 8000 SYSTEM
The Balance 8000 is an expandable, highperformance, MIMD parallel
computer that employs from 2 to 12 32bit CPUs in a tightlycoupled manner,
using a new processor pool architecture. By sharing its processing
load among up to 12 architecturally identical microprocessors and
employing a single copy of a Unixbased operating system, the Balance
8000 system eliminates the barriers associated with multiprocessor
systems. It delivers up to 5 million instruction/s(MIPs), and its
power grows almost linearly when more processors are added. To make
the most. efficient use of its multiprocessing power, the system
dynamically balances its load, that is it automatically and continuously
assigns jobs to run on any processor that is currently idle or busy
with an assigned lowerpriority job.
At the same time, the system is easily extendible. We can add
CPU's, memory, and I/O subsystems within a node, or more nodes within
a distributed network, or more distributed and localarea networks _
all with no changes in software. From the hardware point of view,
the system consists of a pool of 2 to 12 processors, a highbandwidth
bus, up to 28 Mbytes of primary storage, a diagnostic processor, up to
4 highperformance I/O channels, and up to 4 IEEE796 (Multibus) bus
couplers. It is managed by a version of the UNIX 4.2 BSD operating"
system, enhanced to provide compatibility with UNIX system V and to
exploit the Balance parallel architecture. Figure 1.16 evinces the
main functional blocks of the Balance 8000 system.
Each processor in the pool is a subsystem containing three VLSI
parts, a 32bit CPU, a hardware floatingpoint accelerator, and a paged
virtual memory management unit. Each two subsystems are on one circuit
f Custom devices
Multibus f interface
board
Disk
Multibus adapter board
Ot'O Terminal
mux
I \
212 32bits
CPUs
Memory 228
Megabytes
SB 8000 bus I ,
I I I I
,   ''I
; Special I I purpose I , accelerators I
I '      '
, , I , , , , ,
r ~  I
I Special , , I purpose I controllers I , 1_ _______ I
FIGURE 1.16: Balance 8000 Block Diagram
SCED board
Ethernet
I
/
\
I
Ul :1
.Q
H Ul U Ul
}
\ System console
.J I
r::::: =>
Disk
..... /
db
43
board. To reduce all the processor waitperiods and minimizing the
bus traffic in the system, each processor contains a cach~ memory.
The twoway set associative cache consists of 8 Kbytes of very high
speed memory accessed instructions and data, which means that,
requests for the same data are satisfied from the cache, rather than
from the primary storage.
Designing a cach~ for a processor pool architecture is difficult
for several reasons. / Since data in each cache represents a copy of
some data in the primary memory, it is important that all copies and
/
the original remain the same, even when a cache is updated. To ensure
that, the system employs a writethrough mechanism, in which each write
cycle goes through to the bus and memory, in addition to updating the
appropriate cach~. Also, if two processors have both recently read
/
the same data into their respective caches, and one of them updates
/ its cache, the second processor can not use its new state data,
because of the caches buswatching logic. As illustrated in Figure
1.17, this logic continuously monitors all write cycles on the bus
/
and compares addresses with those in its own cache to see if any writes
affect its own contents. / When such an address appears, the cache
invalidates the entry in question.
The last component of the processor subsystem is the "System Link
and Interrupt ControZZer" (SLIC) chip. The SLIC appears with every
processor in the system, as well as on every memory controller, I/O
channel, and bus controller board. Communication between the SLICs is
accomplished with a simple commandresponse packet carried over a
dedicated bus. The controller serves several functions. Firstly it is
the key element of the system's global interrupt system. Secondly, its
CPU 1
/
Cache 1 / Cache controller
. ~'."
CPU 2
rr
Block marked invalid
I " ,.,.; ...... .. ::'~
FIGURE 1.17: BusWatching Logic Monitor
Main memory
Block updated
" " ~
, I
.1 : )1 :
r;'
I'
Processor interface
Interrupt controller
I I
SLIC 1
~ Ll~~~f,~,~·~~~§·~~_~·~~ ___ I~/_O ____ J~
~; .
Semaphore cache
Receiver Transmi t ter~ contention
SLIC.1 bus
resolution
Start bit
SLIC packet
Message D t priority a a
Processor interface
Interrupt controller
I I
SLIC N
Error control
Semaphore. cache
Receiver ~ransmitter ~ contention
resolution
J
I/O 
I
____ ~~SLI~~ClOCk_________________ . ____ ~~
FIGURE 1.18: SLIC Chips
46
purpose is to manage a cach~ of singlebit unitsecaphores. Finally,
the controller serves as a convenient communication path among modulesa
For example, system diagnostic and debugging routines take modules on
and offline using the SLIC bus, which carries error management
information.
The DYNIX operating system is an enhanced version of UNIX 4.2 BSD
that can also emulate UNIX system V at the systemcall and command
levels. To support the Balance multiprocessing architecture, the DYNIX
operating system kernel has been made completely &~areable so that
multiple CPUs can execute identical system calls and other kernel code
simultaneously. The DYNIX kernel also adjusts the memory allocation
for each process to moderate the process's paging rate and to turn
virtual memory performance for the entire system.
Applications programming on the Balance 8000 system is supported
by compilers for the main programming languages. We can use a single
language or a combination of languages to suit. our application. Later
in Chapter 2, we will explain in detail information on the language
tools available for use with the DYNIX operating system.
· CHAPTER 2
PARALLEL PROGRAMMING AND LANGUAGES
One must have. a good memory to be ab~e
to keep the promises one makes.
F.W. Nietzsche.
2.1 INTRODUCTION
The recent advances in hardware technology and computer
architecture, lead to faster and more powerful parallel computer
systems, from which we benefit with a considerable throughput and
speed when they are applied to solve large problems. Problems for
parallel computer systems require some extra programming facilities
which come under the heading of parallel programming, to distinguish
it 'from the conventional programming of singleprocessor computers.
AS explained by the various architectures of existing parallel
computers, parallelism can be achieved in a variety of ways.
Attempting to summarize all these possible known ways of achieving
parallelism and categorize them into several distinct levels, w~
obtain:
a. Job level { b. Program level { c. Instruction level
d. Arithmetic and bit level
{
between jobs
between phases of. a job
between parts of a program
within Doloop
between phases of instruction execution
between elements of a vector operation
within arithmetic logic circuits
The design of algorithms for parallel computers is greatly
influenced by the computer architectures and the highlevel
47
languages which have been used.
This chapter will elucidate parallel programming and parallel
algorithms.
48
2.2 PARALLEL PROGRAMMING
The two new concepts behind the recent ideas of parallel
programming theory are parallelism and asynchronism of programs.
Gill [Gill, 1958] defined parallel programming as the control of
two or more operations which are performed virtually simultaneously,
and each of which entails following a stream of instructions.
49
There seem to have been two trends in the development of high
level languages. Those that owe their existence to an application or
class of applications, such as FORTRAN, COBOL and C, and those that
have been developed to further the art of computer science, such as
. ALGOL , LISP, PASCAL and PROLOG. The development of the former to
some extent has been stifled by the establishment of standards.
Conversely, to some extent, the lack of standards and the desire to
invent have led to the proliferation of versions of the latter. An
attempt to produce a definitive language; incorporating the 'best'
features of the known art and to bind these into an allembracing
standard was made by the U.S. Department of Defense, and the
resulting language ADA [Tedd et aI, 1984], has been adopted.
Although con currency is addressed in ADA, there is now far more
practical experience of concurrency and new languages have been
developed, such as OCCAM, which treat concurrency in a simpler,
more consistent, and more formal manner.
The numerous and vastly different applications and underlying
models of parallelism will require radically different language
structures. Hockney and Jesshope [Hockney, 1988] suggested three
major divisions in language development:
1. Imperative languages.
2. Declarative languages.
3. Objective languages.
Imperative language is one in which the program instructs the
computer to perform sequences of operations, or if the system allows
it disjoint sequences of instructions operating concurrently_
Imperative languages have really evolved from early machine code,
by successive abstraction away from the hardware and its limited
control structures. This has had beneficial effects, namely the
improvement of programmer productivity and the .pcrtability obtained
by defining a machine independent programming environment. An
imperative language, however, even at its highest level of
abstraction, will still reflect the algorithmic steps in reaching a
solution. In addition to the retention of this notion of sequences,
these languages also retain a strong flavour of the linear address
space; still found in most machines. Harland [Harland, 1985]
introduced the concurrency, where many disjoint sequences of
instructions may proceed in parallel. By abstracting concurrency,·
the nondeterministic sharing of the CPU's cycles could be obtained.
The declarative style of programming has had the most profound.
effect on computer architecture research during this period. This
style of programming does not map well onto the classical vonNeumann
architecture, with its heavy use of dynamic data structures. It is
also based on a more mathematical foundation, with the aim of moving
away from descriptions of algorithms towards a rigorous specification
of the problem. These languages are based either on the calculus of
50
functions, lambda calculus, or on a subset of predicate logic. Since
these declarative languages are based on mathematics, so it is
·possible to formally verify the software systems created with them.
A further advantage of the declarative approach is that such
languages can supply implicit parallelism, as well as implicit
sequencing [Shapiro 1984] •
51
Objective language is based on two main techniques, encapsulation
and inheritance, which is a more pragmatic foundation than the rigour
of logic or functional languages. Hence it can provide a potential
solution to the software problem. It also provides a model of
computation which can be implemented on a distributed system.
Encapsulation is the most straightforward and is often used as a
good programming technique [Booch, 1986]. Encapsulation hides data
and gives access to methods or procedures to access that data which
is only provided through the shared procedures.
By encapsulating a programmer's efforts in the creation of such
constrained objects, a mechanism must be provided in order to enhance
that object, which is the second technique of objective languages.
Inheritance allows the programmer to create classes of objects, where
those classes of objects may involve common access mechanisms or
common data formats. The mechanism for implementing inheritance is
to replace the procedure call to an object by a mechanism involving
message passing between objects.
There are at least three emerging parallel software design
approaches based upon the concealment of the parallelism by the
hardware structure. In other terms, for some architectures, the
parallelism is hidden by the hardware itself whilst for others it is
revealed to the user so that appropriate decisions are made as and
when needed. The first of these approaches, is the automatic
translation of sequential programs or the implicit parallelism.
The second approach is explicit parallelism, in which the programmer
manages the concurrency of the applications by coding directly in a
concurrent language. The third approach, advocated by Backus and
Dennis [Backus, 1978] is based on the functional language model and
is implemented on most DF computers. Relyirig on the programmer's
ability, the former method could rapidly become unworkable as it is
impossible to keep "juggling" with a large nuinber of tasks. The
functional approach, which is the most natural form of handling
parallelism can achieve the highest degree of concurrency, since the
instructions are scheduled for execution directly by the availability
of their operands. However, the high cost of implementing this
unstructured lowlevel con currency makes this method of less
importance, at least for the pr~sent moment.
The explicit and implicit parallelism detection approaches will
be discussed in paragraphs (2.2.1) and (2.2.2).
2.2.1 Implicit parallelism
52
Many of the existing sequential software exhibits naturally some
form of synchroneity which needs only to be identified and then
exploited in the. design of parallel algorithms. The implicit approach
is one of the approaches to parallelism that relies on the implicit
detection of parallel processable tasks within a sequential algorithm.
This approach of parallelism is associated with sophisticated
compiling and supervisory programs and their related overheads. Its
effectiveness lies in that it is independent of the programmer, and
existing programs need not be modified to take advantage of inherent
parallelism. However, in implicit parallelism, it will be necessary
to analyse the program to_see how it can be divided into tasks, and
then the compiler could detect parallel relationships between tasks
as well as carrying out the normal compiling work for a program to be
run on a serial computer.
53
Different methods have been developed which possess the feasibility
of automatically recognising parallelism in computer programs.
Bernstein [Bernstein,· 1966] proposed a method based on set theory.
His presented theory is based on four different ways of utilizing a
memory location by a sequence of instructions or tasks. These
conditions are:
1. The location is only fetched during the execution of a task.
2. The location is only stored during the execution of a task.
3. The first operation involving this location is a fetch.
One of the succeeding operations stores in this location.
4. The first operation involving this location is store. One
of the succeeding operations fetches from this location.
Following this work, Evans and Williams [Evans, 1978] have
described a method of locating parallelism within ALGOLtype
programming language , where some constructs such as loops, if
statements, and assignment statements are studied.
One of the most studied detection schemes that has been given
much consideration is the implicit detection of the inherent
parallelism within the computation of arithmetic expressions.
Because of the sequential nature of most of the uniprocessor systems,
the runtime of any arithmetic expression computation is always
proportional to the number of operations. This runtime can be
further reduced on a parallel system by concurrently processing many
parts of the expression. In fact, the commutativity and associativity
were extensively used in order to reduce the height of the computational
tree representation. For example,· consider the expression,
(a*b*c*d*e*f*g*h)
which can be rearranged in a form suitable for parallel processing,
«(a*b)*(c*d))*«e*f)*(g*h)))
As it can be seen in Figure 2.1 and 2.2 which depicts the tree
representation of the above expression for a sequential and a parallel
processor respectively, the runtime was reduced by four time units.
Many algorithms have previously been proposed for recognizing
parallelism at the expression level, some of which are those
suggested by Squire [Squire, 1963], Hellerman [Hellerman, 1966],
Stone [Stone, 1967], Baer and Bovert [Baer, 1968), Kuck [Kuck, 1977],
and Wang and Liu [Wang, 1980].
2.2.2 Explicit Parallelism
In explicit parallelism, the programmer has to specify explicitly
those tasks that can be executed synchronously by means of special
parallel constructs added to a highlevel programming. language.
Although these programming constructs can be time consuming and
difficult to implement they can offer significant algorithm design
flexibility.
Level 7
Level 6
Level 5
Level 4 f
Level 3 e
Level 2
Level 1 c
Level 0 a b
FIGURE 2.1: Binary Tree Representation of the Expression (a*b*c*d*e*f*g*h) for a Serial Computer.
a b c d e f
9
FIGURE 2.2: Binary Tree Feoresentation of the Expression (a*b*c*d*e*f*g*h) for a Parallel Computer.
55
9 h
56
Significant research has been done on this approach with a
particular interest on those parallel task issues such as task
declaration, activation, termination, synchronisation, and
communication. In other terms, a synchronous program consists of
sequential processes that are carried out simultaneously. These
processes cooperate on common tasks by exchanging data through shared
variables.
Dijkstra [Dijkstra, 1965J proposed the utilization of semaphores
and introduced two new primitives (P and V) that greatly simplified
the processes of synchronisation and communication. A software
·implementation of these two primitives in terms of an indivisible
instruction, the testandset instruction, was installed in many
systems.
Dennis and van Horn [Dennis, 1966J proposed a very straightforward
mutual exclusion lock out mechanism. Critical regions are enclosed
within a LOCK WUNLOCK W pair, where W is an arbitrary onebit
variable.
The parallelism in the explicit approach can also be indicated
by using language constructs that exploits the parallelism in
algorithms, Anderson [Anderson, 1965J introduced five parallel
constructs, the FORK, JOIN, TERMINATE, OBTAIN and RELEASE statements
which are presented below in an ·ALGOL68 format;
label:
label: TERMINATE L ,L2
, .•. ,L ; 1 n
The FORK statement initiates a separate control for each of the
n segments with the labels L .• Only local labels must be used and ~
their scope is defined as the block scope in which this statement
is declared. The next sequence of paths may only be initiated when
all the forked paths of the previous level have completed their
performances.
The JOIN statement which is associated with the FORK statement
and must occur in the same level in the program is used to terminate
the parallel processes that have been forked. This action is
57
implemented by including a code that causes test bits to be available,
thus allowing the forked paths to be ~ynchronised after they are
completed (Figure 2.3).
The TERMINATE statement is used to explicitly terminate program
paths which have been dynamically activated by the fork statement,
thus avoiding the creation of a backlog of meaningless incomplete
activations. Really the JOIN and TERMINATE statements are control
counters, decreasing by one after the execution of one statement,
comparing each time to zero; if different than zero, the path is
terminated and the processor is free to execute the next path in the
queue, otherwise, the processor goes to the next program segment.
The OBTAIN statement is used to provide exclusive access to the
listed variables by a single process. It is used to avoid mutual'
interference by lockingout other parallel program paths from the use
of these variables. If this statement occurs in a block then the
variables should be the same variables occurring in higher level
blocks.
) )
58
FIGURE 2.3: The Fork/Join Technique.
The RELEASE statement which is implemented with the OBTAIN
statement. It releases those variables ·that have been locked out by
an OBTAIN statement.
More specifically the OBTAIN/RELEASE concept is an approach
implemented to assist in solving the synchronization problem.
59
The above presented statements are directly implemented as library
functions and supplied with enough information to control parallel and
multiprogramming activities. In general, the FORK statement would be
substituted at the compilation time by a special code that when
executed would create as many parallel paths as the number of labels
following the FORK statement. Each of these paths is assigned to the
available processors and usually the first path is carried out by the
same processor that carries out the FORK statement itself. If the
number of created processes ~s greater than the number of available
processors, the excess paths are kept in a queue until a processor
becomes free.
The labels used in a parallel program are crossreferenced at
the compilation time by. arranging them on a forward reference list
which is loaded by all the labels contained in the labels list of an
instruction.
To recapi·tulate, in the explicit approaCh the process of
parallelization is under the entire responsibility of the programmer,
·a fact which jeopardizes program determinancy.
60
2.3 PROGRAMMING THE BALANCE SYSTEM
The hardware configuration and the operating system of the Balance
MIMD system was described in Chapter 1. This section describes the
programming languages supported by Sequent for use with the DYNIX
operating system. These languages include C, FORTRAN77, pascal; and
assembly languages.
The Balance system supports the two basic kinds of parallel
programming: multiprogramming.and multitasking.
Multiprogramming is an operating system feature that allows a
computer to execute multiple unrelated programs concurrently. The
multiuser, multiprogramming UNIX environment adapts quite naturally
to the Balance multiprocessing architecture and automatically
schedules processes for optimal throughput. In other versions of the
UNIX operating system, executable processes wait in a run queue; when
the CPU suspends or terminates execution of one process, it switches
to the process at the head of the run queue. DYNIX balances the
system load among the available processors, keeping all processors
busy as long as there is enough work available, thus using the full
computing capability of each processor.
Multitasking is a programming technique that allows a single
application to consist of multiple processes executing concurrently.
The DYNIX operating system automatically does multitasking for some
applications.
The Balance language software includes multitasking extensions
to C, pascal, and FORTRAN. The DYNIX Parallel programming Library
(PPL) includes routines to create, synchronize, and terminate parallel
processes from C, Pascal and FORTRAN programs. The DYNIX gprof
61
utility creates a programs execution profile, a listing that shows us
which subprograms (subroutines or functions) account for most of a
program's execution time.
2.3.1 Multitasking Terms and Concepts
In the DYNIX operating system, a new process is created by using
a system call called a fork. The new (or child) process is a
duplicate copy of the old (or parent) process, with the same data,
register contents, and program counter. If the parent has files open
or has access to shared memory, the child has access to the same files
and shared memory.
A UNIX forking operation is relatively expensive (about 55
millisecond). Therefore, a parallel application typically forks as
many processes as it is likely to need at the beginning of the program,
and does not terminate any process until the program is complete,
since the process can wait in a busy loop during certain code sequences.
Typically, multitasking programs include both shared and private
data. Shared data is accessible by both parent and child processes,
while private data is. accessible by only one process. There are
several advantages to sharing data. Firstly it uses less memory than
having multiple copies. Secondly it avoids the overhead of making
copies of the data for each process. Finally it provides a simple
and efficient mechanism for communication between processes. If the
program includes any shared data, the process's virtual memory space
also contains a shared data area and a shared heap (Figure 2.4).
Tasks can be scheduled among processes using three types of
algorithms,
Shared Memory
Shared
Data
Process 1
Private Data
, ! Private i
I Heap ;
! __ I  ..  r ,    
I
I Shared 1
i Data i
    ,  .J  ,
i 1 Shared I i i , Stack I _____ ..J
,;
I Shared 1<+'1 Heap
, I I
i_  l.J l'
Private Stack
FIGURE 2.4: Process Virtual Memory Contents.
62
1. prescheduling;
2. static scheduling;
3. dynamic scheduling.
In prescheduling, we need to determine the task division, before
compiling the program. Prescheduled programs cannot automatically
balance the computing load according to the data or the number of
CPUs in the system. Therefore, this method is appropriate only for
applications where each process is performing a different task.
63
In static scheduling, the tasks are scheduled by the processes at
run time, but they are divided in some predetermined way. For example,
if a program includes lOOiterations, using static scheduling, each
process might execute 10 iterations of the loop, if the program uses
10 processes.
In dynamic scheduling, each process schedules its own tasks at
run time by checking a task queue or a "domenext" array index.
For example, a dynamically scheduled program might perform a matrix
multiply, with each process computing three matrix elements and then
returning for more until all the work is done.
Dynamic scheduling produces· dynamic load balanCing, i.e. all
processes keep working as long as there is work to be done, while
static scheduling produces static load balancing, i.e. the division
of tasks is statically determined, several processes may stand· idle
while one processor completes its share of the job.
In any place where two or more parallel processes can read and
write the same data structure constitutes a dependency, because the
results of the program depend on when a given process references that
64
data structure. To ensure correct results, code sections containing
dependencies cannot be executed simultaneously by multiple processes,
where the code section, that contains dependencies are critical regions.
There are two basic types of dependencies:
1. access dependencies;
2. order dependencies.
Access dependencies can yield incorrect results if two or more
processes try to access a shared data structure at the same time,
whilst the order dependencies can yield incorrect results if two or
more processes try to access a shared data structure at the same time
or in the wrong order.
The best way of handling dependencies in a program is rewrite
the code to eliminate them. The dependencies are sometimes inherent
in the application. In these instances, we must set up the processes
so that they communicate with each other to execute the dependent code
sections. This communication can ,be set up by using mechanisms
called semaphors and locks)· dlhere the .semaphore is a shared data
structure used to synchronize the actions of multiple cooperating
processes. Meanwhile the lock ensures that only one process at a time
can access a shared data structure. The lock has two values: lock
and unlock. Before attempting to access a shared data structure, a
process waits until the lock associated with the data structure is
unlocked, indicating that no other process is accessing the data
structure. The process then locks the lock, accesses the data
structure and unlocks the lock. While a process is waiting for a
lock to become unlocked, it spins in a tight loop, producing no work 
hence the name "spinlock ll• This spinning is also referred to as a
busy wait.
Figure 2.5 illustrates how a lock is used to prevent multiple
processes from executing a dependent section simultaneously.
acquire lock execute
release lock
Pi 
P2
p3
dependent section
I
I , wai t for I execute ___________ 1 _______________ _ ,
dependent: section I
lock acquire lock
release lock
I wait for lock ,execute
_______________ 1 __________ _
65
)
,
release lock
)
acquire dependent lock section
time )
FIGURE 2.5: Role of Lock in Protecting Dependent Sections.
An Event is something that must happen before a .task or process
can proceed. Events have two values: posted and cleared. One or
more processes wait for an event until another process posts the
event and hence the waiting processes proceed. It is required to be
cleared by the waiting process; by a master process or by another
process.
I/O in parallel programs is complicated. These complications
can usually be reduced by performing I/O only during sequential
phases of the program or by designating one process as a server to
perform all I/O.
2.3.2 Data Partitioning with DYNIX
This subsection explains how we structure FORTRAN programs for
data partitioning. It also explains how we shall use the DYNIX PPL
to execute loops in parallel. The data partitioning method is
sometimes called microtasking. Microtasking programs create multiple
independent processes to execute loop iterations in parallel. It has
the following characteristics:
1. the parallel processes share some data and cr"eate their
own private copies of other data;
2. the division of the computing load adjusts automatically
to the number of available processes;
3. the program controls data flow and synchronization by using
the tools specially designed for data partitioning.
The microtasking program works as follows:
a. Each loop to be executed in parallel is contained in a
subprogram •
b. For each loop, the program calls a special function which
forks a set of child processes and assigns an identical
copy of the subprogram to each process for parallel
execution. The special function creates a copy of any
private data for each process.
c. Each copy of the subprogram executes some of the loop
iterations.
d. " If the loop being executed in parallel is not completely
independent, the subprogram may contain calls to functions
that synchronize the parallel processes at critical points
by using locks, barriers, and other semaphores.
66
e. When all the loop iterations have been executed, control
returns from the subprogram. At this point, the program
either terminates the parallel processes or leaves them to
spin in a busywait state until they are needed again.
67
There are three sets 'of routines in the DYNIX PPL, a microtasking
library, a set of routines for general use with data partitioning
programs, and a set of routines for memory allocation in data
partitioning programs.
The microtasking library routines allow us to fork a set of
child processes, assign the processes to execute loop iterations in
parallel, and synchronize the processes as necessary to provide proper
data flow between loop iterations. Table 2.1 lists the microtasking
routines in the PPL.
The data partitioning routines include a routine to determine
the number of available CPUs and several process synchronization
routines that are more flexible than those available in the micro
tasking library. Table 2.2 illustrates the generalpurpose data
partitioning routines in the PPL.
The memory allocation routines allow a data partitioning program
to allocate and deallocate shared memory and to change the amount of
shared and private memory assigned to a process. Table 2.3 shows the
memory allocation routines in the PPL. More detail on PPL can be
found in Section 3p in Volume 1 of the DYNIX Programmer's Manual.
Before we convert a loop into a subprogram for data partitioning,
we must analyze all the variables in the loop and determine two things:
68
ROUTINES DESCRIPTIONS
m fork Execute a subprogram in parallel
Return process identification number
Return number of child processes
Terminate child processes
m lock Lock a lock
m multi End single_process code section
m next Increment global counter
Suspend child process execution
Resume child process execution
Set number of child processes
Begin singleprocess code section
Check in at barrier
m unlock Unlock a lock
TABLE 2.1: Parallel Programming Library Microtasking Routines
69
ROUTINES DESCRIPTIONS
: Return number of CPUs on_line
s init barrier Initialize a barrier
s init lock Initialize a lock
s lock or s clock Lock a lock
S LOCK C macro for s lock
s unlock Unlock a lock
s UNLOCK C macro for s unlock
5 wait barrier Wait at a barrier
TABLE 2.2: Parallel Programming Library DataPartitioning Routines
ROUTINES DESCRIPTIONS
brk or sbrk Change private data segment size
shbrk or shsbrk Change shared data" segment size
shfree Do_allocate shared data memory
shmalloc Allocate shared data memory
TABLE 2.3: Parallel Programming Library MemoryAllocation Routines
a. Which data can be shared between parallel processes and
which must be local to each parallel process.
b. Which variables cause dependencies· or criticaL regions,
code sections which can yield incorrect results when
executed in parallel.
To complete the development of our data partitioning program, we
need to do the following things:
1. Invoke the appropriate compiler with the proper options to
link the program with the PPL.
2. Execute the program and check the results.
3 .. If necessary, we use the DYNIX parallel symbolic debugger,
Pdbx to debug the program.
Thus, if we need to compile and link a FORTRAN program, we enter
the following command:
fortraneF!SHCOM!program.name lpps,
70
this command compiles a FORTRAN source file and links the·object.code
with the PPL, producing an executable file named (a.out). It also
places all COMMON blocks declared with the (F) option into shared
memory. The (e) option makes the FORTRAN code compatible with the C
subroutines in the PPL. To execute the program, simply enter the name
of the executable file as· a DYNIX command. The default file name is
a. out.
2.3.3 Function Partitioning with DYNIX
This subsection describes the facilities provided by the Balance
system to support function partitioning applications. Function
71
partitioning involves creating multiple processes and having them
perform different operations on the same data set. The processes may
be created within a single program or they may be independent programs
created at the operating system level. The DYNIX PPL contains several
routines that can be used for function partitioning applications.
The fork( ) system call creates a duplicate copy of the current
process. The parent process sets up a shared memory region and one or
more locks, then forks one or more child processes to Shar~ the work.
The children inherit the parent's complete memory image', including
access to shared memory and·locks. Child processes are identical to the
parent and they can be designed to choose their own tasks based on the
order of their creation. The new process is created from a file that
. contains either executable object code or a shell script. The PPL
routines s init barrier and s wait barrier initialize a barrier and  cause processes to spin until all related processes arrive at the
synchronization point. Processes can send and receive signals among
themselves, and'to handle special events such as terminal interrupts.
If a child process determines that the parent will not need any help
for a significant amount of time, the child can relinquish its processor
for use by other applications. The parent can send the child a wakeup
signal when required. Since a Balance system can have multiple
processes running simultaneously, some programs that use UNIX signals
may behave differently on the Balance system than on a uniprocessor.
For example, if each of P child processes sends a signal to their
parent, the parent will not necessarily receive P signals. This type
of race condition is also possible on uniprocessors, but may not
manifest itself until the program is parted to a multiprocessor.
72
The simplest and most efficient mechanism for interprocess
communication is a semaphore in the Balance shared memory. The
interprocess communication subsystem provides the ability to transfer
data directly between processes using system calls. Finally the
interprocess communication facilities are extremely useful for certain
types of applications.
73
2.4 PARALLEL ALGORITHMS
Long before parallel computers had been constructed many
researchers have studied parallel algorithms (since the 1960's).
However, designing parallel algorithms became more important and
interesting as the development of parallel computer architecture
advanced. Therefore, a variety of algorithms have been designed from
different view points and for the various parallel architectures which
were described in Chapter 1.
The structure of parallel algorithms has to be such that they are
fully independent computations. Although this case cannot always be
encountered, some algorithms may already contain independent
computations that are not needed to be reorganized. Such algorithms
are said to have inherent parallelism.
Stone [Stone, 1973] highlights some of the problem areas in
parallel computation, these include the necessity to rearrange the
data in memory for efficient parallel computation: the recognition that
efficient sequential algorithms are not necessarily efficient on
parallel computers and conversely, that sometimes inefficient sequential
algorithms can lead to very efficient parallel algorithms and lastly
the possibility' of applying transformations to sequential algorithms
to yield new algorithms suitable for parallel execution.
Kung [Kung, 1980] identified three orthogonal dimensions of the
space of parallel algorithms: concurrency control, module granularity
and a communication geometry. Concurrency control is needed in
parallel algorithms to ensure the correctness of the concurrent
execution, because more than one task module can be executed at a
time. The module granuU2rity.of a parallel algorithm reflects whether
74
or not the a·lgorithm tends to be communication intensive. This must
be taken into consideration for efficiency reasons. If the task
modules of a parallel algorithm are connected to represent intermodule
communication, then a geometric layout of the resulting network is
referred to as the communication geometry of the algorithm.
As was exemplified by the various architectures of existing
parallel computers, parallelism can be achieved in a variety of ways,
the ·same holds with parallel algorithms, since a close correspcndence
must exist between architectures and algorithms. Consequently, since
there exists such a variety, the question to answer now is, "How to
choose amongst the different alternatives·in order to solve a specific
problem, or what type of problem is better adapted to a given
architecture?"
Since, in most cases, performance is the reason why parallelism
is being investigated, it must be considered as a very critical issue.
The study of how to design algorithms, for different parallel
architectures, might reveal that an algorithm requires a peculiar
feature of that architecture to run efficiently ..
As mentioned previously, in SIMD computers, the number of
processors tends to be large compared with that of MIMD computers.
In general, we can say that the algorithm designed for SIMD comput·ers·
requires a high degree of parallelism because this type have up to·
order pm,· i.e. o (pm) processors, while a MIMD computer has up to O(p)
processors, where p is the number of subtasks of the problem and m is
an integer index representing problem complexity. This does not mean
that an algorithm designed for a MIMD computer cannot be run on a
SIMD computer.
75
Since the processors of a MIMD computer are asynchronous, they
need not necessarily be involved on the same problem. On the other
hand, the processors of a SIMD computer are synchronous and hence they
cannot be used to run independent computations that are not identical
and they must remain idle when not required.
The performance of an algorithm is defined by the absolute
arithmetic answer to some relevantly set quantities, such as, the
Computation time, Speedup, and Efficiency of the algorithm. In real
. machines, actual computation times are often proportional to the total·
. number of arithmetic operations in the programs, Whilst, in cases of
programs with little arithmetic, are proportional to the number of
memory accesses, or the n~mber of I/O transmissions. We will use T p
to denote the computation time on a computer with P processors, and
Tl to denote the computation time of an uniprocessor computer.
The term Speedup (S ) of a P processor computer over a sequential p
computer is defined as,
S Tl
~l = , P T p.
and the Efficiency (E p
) is defined as,
S E = J>.
~ 1 P P
,
and to compare two parallel algorithms for the same problem the
following measure of effectiveness (F ) is introduced, p
where,
•
F P
c p
S =...l:'.
c p
PT P
~ 1 ,
(2.4.1)
(2.4.2)
(2.4.3)
(2.4.4)
measures the cost of the algorithm.
The following simple example is given for illustration. Suppose
that it is required to form the inner or scalar product,
16 A 2
i=l a.b. ~ ~
To carry this out with a single processor sequentially requires 16
multiplications and 15 additions. If we take the number of °additions
(assume 1 multi. ~ 1 addition)", as a measure of the time needed, then
76
normalising by assuming the time for a single addition to be the unit,
owe have Tl =31. If two processors were available we would form the
two products·,
simultaneously, requiring 15 time units, and then form,
d
at the next stage, requiring a further time unit. Thus A would be
obtained in T2
=15+1=l6 time units.
If now we had three processors A could be formed in the following
three stages:
Cl = albl+a2b2+a3b3+a4b4+a5b5
C2 = a6b6+a7b7+aebe+agbg+alOblO
C3 allbll+a12b12+a13b13+a14b14+a15b15
d l = C
l+C
2 d2
= c3+a16b16 ,
el
dl
+d2
= A •
This requires T3
=9+3+l=13 time units.
If four processors were available then A could be formed in the
following stages:
This requires T4
=7+1+1=9 time units.
A table of the performance measures can now be constructed.
Table 2.4 shows that with increasing p, S increases steadily, while . p
E decreases. p FpTl' however, has a maximum when p=8 which indicates
that p=8 is the optimal choice of number of processors for this
calculation.
P T C S E P P P P
·1 31 31 1 1
2 16 32 1.93 0.96
3 13 39 2.38 0.79
4 10 40 3.10 0.77
8 6 48 5.16 0.64
16 5 80 6.2 0.38
16 TABLE 2.4: The Performance Measure of I aib
i.
i=l
F T =S E pIp P
1
1.85
1.88
2.38
3.30
2.4
77
78
We can find in the literature that many algorithms are suitable for
SIMD computers. Examples of such algorithms have been introduced by
Miranker [Miranker, 1971], Stone [Stone, 1971,1973b] and Wyllie
[Wyllie, 1979]. On the other hand, although implementing algorithms
on MIMD computers are more difficult than their implementation on SIMD
computers, many parallel algorithms have been developed to run on MIMD
computers. 'For example, Muraoka [Muraoka, 1971] showed how
parallelism 'is exploited in applying algebraic expressions on a MIMD
computer. Baudet et al [Baudet, 1980] has developed a variety of
parallel iterative algorithms suitable for MIMD computers.
2.4.1 The Structure of Algorithms for Multiprocessor Systems
A parallel algorithm for a multiprocessor is a set of n concurrent
processes which may operate simultaneously and cooperatively to solve
a given problem. Synchronization and the exchange of data is needed
between processes to ensure that the parallel algorithm works correctly
and effectively to ,solve a given problem. Therefore, at some stage in
the performance of a process there may be some pcints where the
processes communicate with other processes. These points are called
the "interaction points". The interaction pcints divide a process
,into stages. Hence, at the end of each stage, ,a process may
communicate with some other processes before the next stage of the
computation is initiated.
Parallel algorithms for multiprocessors may be classified into
asynchronous and synchronous parallel algorithms. Due to the inter
actions between the processes, some processes may be blocked at
certain times. The parallel algorithm in which some processes have to
wait on other processes is called a synchronized algorithm. The
weakness of a synchronized algorithm is that all the processes that
have to synchronize at a given point wait for the .slowest amongst
them. To overcome this problem, an asynchronous algorithm is
suggested. For a parallel asynchronous algorithm, processes are
generally not required to wait for each other and communication is
achieved by using global variables stored in shared memory. Small
delays may occur because of concurrent accesses to the shared memory.
/
79
CHAPTER 3
BASIC MATHEMATICS, GENERAL BACKGROUND
Most nwnerical analysts have
no interest in arithmetic.
B. Parlett.
3.1 INTRODUCTION
In the present Chapter we shall scrutinize the necessary
preliminary mathematical background of differential equations, and
review all the related notations, conditions, and concepts essential
for their proper use. The mathematical formulation of any important
 scientific and engineering problem.· involving rates of change with
respect to two or more independent variables, leads to Partial
Differential Equations (POE's) or a set of such equations. Another
class of differential equations which govern physical systems is
Ordinary Differential Equations (ODE's) in which only one independent
variable is present in the differential equations.
80
With the use of automatic digital computers becoming widespread,
the numerical methods are found to be an attractive alternative, since
the analytical solution for the majority of these equations is
.extremely difficult or too inconvenient to be obtained. A number of
approaches have been developed over the years for the treatment of
POE's, the most important and wideiy used of these are the methods of
finite elements and finite differences. A compleoentary description of
the finite difference methods will now be given.
81
3.2 CLASSIFICATION OF PARTIAL DIFFERENTIAL EQUATIONS
The most general mathematical form of the secondorder PDE's of
two independent variables, x and y, (these may both be space
coordinates, or one may be a space coordinate and the other the time
variable), with a dependent variable U, which can be expressed as,
a2 . 2 + B ____ U + ~ +
axay a/ ~+
ax Eau
ay +Fu+G= O. (3.2.1)
Equation (3.2.1) can be further classified according to the nature
of its coefficients:
1. Linear, if the coefficien~s A,B,C, ... ,G, are constants or
functions of one or both independent variables x and y.
2. Nonlinear, if any of the coefficients A,B,C, ..• ,G, are
functions of the dependent variables, U, or its derivatives.
3. SemiLinear, if the coefficients A,B, ... ,G, are functions of
the independent variables x and y only.
4. Quasilinear, if the coefficients A,B, and C, are functions
of x,y,U, ;~, and ;~ but not of secondorder derivatives.
5. Homogeneous, if G=O, otherwise it is called inhomogeneous.
6. Se IfAdjoint , if it can be replaced. by
a au + (C(y))·+ FU + G ay ay
o
Additionally, equation (3.2.1) can be classified into three
particular types according to whether the discrLffiinant (~4AC) is
greater than, equal to, or less than zero. The general second
order quasilinear PDE has the form,
2 ~+
axay
2 c~ al
+ G o , (3.2.2)
82
au au where A,S,e, and G may be functions of x,y,U, and  but not the ax ay
secondorder derivatives. In this section, let us consider the
following denotation for the first and secondorder derivatives,
2 a
2u a
2u au au a u = M; = N; 2 = P; = Q, and  = S. ax ay axay 2 ax ay
Let R be a curve on the xy plane on which the values of its derivatives
above satisfy equation (3.2.2). Therefore, the differentials of M and
N in directions tangential to R satisfy the equations,
aM dx aM Pdx + Qdy dM = +  dy , ax ay (3.2.3)
and
dN aN dx aN Qdx + Sdy = +  dy = , ax ay (3.2.4)
where
AP + BQ + CS + G = 0 , (3.2.5)
and ~ is the slope of the tangent to R at point p(x,y).
Elimination of P and S from (3.2.5) using (3.2.3) and (3.2.4)
results. in,
~(dMQdY) + BQ + d~(dNQdx),+ G = 0 ,
i. e. I
B (dy) + C} _ {A dM ~ + ~ + Gdy} = 0 dx dxdx dx dx . (3.2.6)
Now, by choosing the curve R so that the slope of the tangent,
at every point on it, is a root of the equation,
B (~) + C = 0 J (3.2.7)
hence, the Q term is eliminated.
Therefore, equation (3.2.6) leads to,
83
A dM ~ + CdN + Gdy dx dx dx dx ° . (3.2.8)
Consequently, it is apparent that every point p(x,y) of the
solution domain there are two directions, given by the roots of equation
(3.2.7), along which there is a relationship, given by equation (3.2.8),
between the total differentials dM and dN with respect to x and y.
The directions given by the roots of equation (3.2.7) are called
characteristic directions and the PDE is considered to be hyperbolic,
parabolic, or elliptic according to whether these roots are real and
distinct, equal, or complex, respectively, i.e. according to whether
2 > B4AC = 0. <
The above classification scheme is rather interesting, since the
coefficients A·,B and C are functions of the independent variables x and
y and/or the dependent variable U, thus this classification depends in
general on the region in which the PDE is defined.
For example, the differential equation,
° , 2 2
is hyperbolic in the region when xy >0, parabolic along. the
2 2 2 2 boundary x y =0, and elliptic in the region where x y <0.
Typical, parabolic and hyperbolic PDE's result from diffusion,
equalization or oscillatory processes and the usual independent
variables are time and space. On the other hand the elliptic PDE is
generally associated with steadystate or equilibrium problems.
3.3 TYPES OF BOUNDARY CONDITIONS
The solution of a PDE has to satisfy, in particular, some
boundary conditions arising from the formulation of the problem
itself. Usually, the elliptic PDE's are classified as boundary value
problems, since as in Figure 3.1, the boundary conditions. are given.
round the (closed) region.
y
U=f on aR br,
O~~__. a
x
FIGURE 3.1: Boundary Value Problem.
The parabolic and hyperbolic types of equations are either
initial value problems or initial boundary value problems, where the
84
initial or/and boundary conditions are supplied on the side of the open
region, and the solution proceeds towards the open side (see Figure
3.2a,b and 3.3a,b).
In accordance with the specific type of the occurred boundary
conditions defined on the boundary oR of the region R, many different
basic categories of problem can be distinguished. These are:
1. DirichZet probZem, where the solution u is specified at each
point on oR.
. U(O,t) given
for t>o
t Open
1 aU =k2 at U(a,t) given
for t>O
° I..Jal,~ .... x
U(x,O) given
on [O,a)
(a)
t
85
L~ ___________ + X
U'(x,O) given on the
entire initial" line
(b)
FIGURE 3.2: Initial Boundary Value Problem for a Parabolic Equation
U(O,t) given
for t>o
t
°
Open
U(a,t) given for t>O
t
U(O,t) given
for t>o
~~~. x aU .
U and at gl.ven a
on
t=ox [0, aJ
(a)
2 1 a U
c 2 at2
L~x aU
U and at given on
t=Ox[O,w)
(b)
FIGURE 3.3: Initial Boundary Value Problem for Hyperbolic Equation
2. Neumann problem, where values of the normal derivatives ~~ are
aU given on aR, where ~ denotes the directional derivative of U
along the outward normal to aR.
3. Mixed problem, where the solution u is specified on part of aR
and ~~ is specified on the remainder of aR.
4. Periodic problem, the solution u has to satisfy the periodicity
conditions, for example,
I aul ul x = u x+£' an x
aul = an x+£
where £ is the period, x and x+£ are on aR.
5. Robin's problem, where a combination of U and its derivatives
are given along the boundary, i.e., aU + eau given on aR. an
86
87
3.4 BASIC MATRIX ALGEBRA
Numerical approaches such as the finite difference and finite
... """~ element methods to solve ODE's and. PDE'sTyields a system of linear,
simultaneous equations which can be represented as a matrix system.
Methods of solving such systems depend on some matrix properties, for
example, irreduoibility, diagonal dominanoe, and positive definiteness
of the coefficient matrix of the system.
A matrix is defined as a twodimensional array with each element
denoted as a, "where i and j represent the row and the column 1.,J
respectively of the array in which the element appears. A matrix A,
say, is of size (nxm) if it has n 'rows and m columns, and can be
denoted'by,
A = [a, ,1 = l. , J
rall la21 I I
I I I
~~l
a  ______ a 12 Im
I a _
n2 a run
when n~l, then we'have a row veotor, and for m=l, then it is a
(3.4.1)
oolumn veotor. The vectors are usually denoted by a small underlined
letter with a singl'e subscript, such as b, represents the i th element l.
of the vector b. The vector £, for example, whose elements are b ,b , . 1 2
... ,b is of order n and is. denoted by, n
b2 ,
b , (3.4.2) , I ,
ti n
A square matrix is a matrix of order n, where n=m. In this
thesis, all matrices used are square matrices, unless otherWise
stated. The set of elements a, .' where i=1,2, ... ,n of the matrix J.,J.
A (3.4.1) is the diagonaZ of A. The transpose of the matrix A is
denoted by AT and is obtained by interchanging the rows and columns
of A. The determinant of A is denoted by det(A) or jAI. An inverse,
1 A , of a given matrix A, if it exists, is a square matr·ix such" that,
1 1 A A AA = I ,
where I is the identity (unit) matrix whose order is the same as that
of A and is defined as follows,
a .. = 1, for all i=1,2, ... ,n, J.,J.
a .. ; 0, for all i,j;1,2, ... ,n and i,lj. J.,J
If A possesses an inverse then it is nonsin~AZar othe~ise it
is singuZar. On the other hand, A is singular if IAI;o, and non
Singular if IAI,Io.
If the entries of a matrix A are complex numbers, the conjugate
of A is the matrix A whose entries are the conjugates of the
corresponding entries of A, i. e., if A; [a. .1 then A; (a. .1. The 1,J 1,J
Hermitian transpose (conjugate transpose) of A, d~oted by AH, is the
transpose of A, i.e.,
H T T A ; (A) ; A
The sum of the diagonal elements of a matrix A is called the trace of n
A, denoted by tr(A) , i.e. , tr (A) ; I a. .. i=1 1.,1.
88
A permutation matrix P=[P .. J, is a matrix which has elements of l.,)
zeros and ones only with exactly one nonzero element in each row and
column. For any permutation matrix P we have,
T = P P I ,
hence 1 = P
Definition 3.1:
1.
2.
3.
4.
5.
The matrix A=[a .. 1 is said to be: l.,)
T Symmetric, if A=A ;
. 1 T Orthogonal, if A =A
.. l.. f H Herm~t1an, A =A;
Null, if a .. =0, (i=j=1,2, ... ,n); l.,)
Sparse, if a relatively large number of its elements a. . are l. , )
zero;
6. Dense (full), if a relatively large number of its elements a. . l.,)
are nonzero;
7. Diagonal, if a. .=0 for i;ij; l. , )
8. Lower triangular, if a =0 for i,j
9. Upper triangular, if a. .=0 for l., )
Definition 3.2:
The matrix A=[a .. 1 is called: l.,)
i<j;
Dj.
1. Banded, if a .. =0 for jijj>r, where (2r+l) is the bandwidth of A; l.,)
2. Tridiagonal, if r=l;
3. Quindiagonal, if r=2;
4. Block diagonal, if each Di
, where i=1,2, ... ,r (Figure 3.4d) is a
square matrix.
89
rx I x x
x x x X \ 0 \ \ \ C
\ \ \ \ \ \ \ \
\ " \ \
A = \ , A = \ , , r=l
\ \ \ , , , \ \ \
\ \
" \
0 \ 0 \ ,
\ \ \ , X X X X
X X x
a. Diagonal matrix b. Tridiagonal matrix
x x x Dl
x x x x 0 D2 0 \
x X X X x \ \ \ \ \
\ \ \ \ \ \ A = \ \ \ \ , r=2, A = \ \ \ \ \ \ \ , \ \ \ \ , , , , \ ,
\ x x x x x \
0 \ x x x x \
0 ,
j \
X x D n
c. QUindiagonal matrix d. Block diagonal matrix
FIGURE 3.4: Types of Banded Matrices (where x denotes a nonzero element)
Defini tion 3.3:
An (nxn) matrix A is diagonally dominant if,
la, ,I ~ ~,~
n
I j=l j;o!i
and for at least one i,
1 a , ,I, for all 1 ~ i~n , ~,J
90
la .. 1 > ~,~
la .. 1 ~,J
Definition 3.4:
If a matrix A is Hermitian, and
for all ':0.+0, ·then A is positive defipite. A is nonnegative definite
Definition 3.5:
. (1) (2) (3) A sequence of matr1ces A lA lA , ... , of the same dimension
is said to converge to a matrix A, if and only if,
Urn 11 AA (k) i I = ° , k
where 11 11 is a suitable norm.
Let A be a square matrix, ·then A converges to zero if the sequence
. (1) (2) (3) of matr1ces A lA lA , ... , converges to the null matrix, 'and is
divergent otherwise.
Definition 3.6:
A matrix A=[a .. 1 of order n has property A if there exists two ~, )
disjoint subsets 51 and 52 of W={1,2, ... ,n} such that if iFj and if
either ai,jFo or aj,iFO, then i E 51 and j E 52 or else i E 52 and
j E 51'
Definition 3.7:
A matrix A of order n is consistently ordered if for some t
there exist disjoint subsets Sl'S2, ... ,St of W={1,2, ... ,n} such that
t
L Sk=W' and such that if i and j are associated, then j E Sk+l if k=l j>i and j E Sk_l if j<i where Sk is the subset containing i.
3.4.1 Vectors and Matrix Norms
92
The measure of the size or magnitude ofoa vector matrix, is called
its norm and is denoted by I I 11. If u is a vector then its norm.is
denoted by I I~I I, and it is a nonnegative number satisfying the
following three axioms:
1. II~II =0 for ~=O, and II~II >0 if ~;!o;
2. Ila~1 i=lal.II~11 for any complex scalar a;
3 .. 11~+~II~II~11+i I~II for the vectors u and v.
Similarly,. as above we can define the matrix norm. The norm of
a matrix A of order n, is denoted as I IAI I, satisfying the following
axioms:
1. I IAI I~O, with equality only when A is the null matrix;
2. 11 aA 11 = I a I . 11 A 11, for any complex scalar a;
3. IIA+BII~IIAII+IIBII, for any matrices A and B (the triangle
inequality);
4. 11 AB I I ~ 11 A 11 . 11 B I I, for any matrices A and B (the Schwarz
inequa li ty) .
There are common types of norms. Some of them are:
i. The Euclidean norm, denoted by IIAII ' is given by, E
\' 2 ! =(Lla .. 1) .
. . 1.,J ~,J
ii. The spectral or L2
.norm, denoted by IIAlls' is given by,
iii. The Ll norm, denoted by IIAI11
, is given by,
n
2 la. .1 . i=l 1., J
max j
iv. The L= norm, denoted by ! IAI loo' is given by,
L ~
3.4.2 Eig·envalues and Eigenvectors
93
Suppose that A is (nxn) matrix and u be a nonzero column vector
of order n. If there exists a scalar A such that,
Au = AU (3.4.3)
then Ais called an eigenvalue of A and u its corresponding eigenvector
of A.
Equation (3.4.3) can he written as,
(AAI)U = 0 . (3.4.4) .
The nontrivial solution, ~O to this matrix equation exists if and
. only if the matrix of the system is singular, i.e.,
det(AAI) o . (3.4.5)
Equation (3.4.5) is called the characteristic equation of A and
the lefthand side is called the characteristic polynomial of A, which
94
can be written as,
nl n n a +a l+ ... +a 11 +(1) 1 o 1 n o . (3.4.6)
n Since the coefficient of 1 is not zero. equation (3.4.6) has
always n roots (complex or real) which are the n eigenvalues of the
matrix ~. namely. 11 .1 2 •...• 1n
(not necessarily all distinct). each
of them possessing a corresponding eigenvector. Wilkinson
[Wilkinson. 1965J de·scribed many methods in detail for obtaining the
eigenvalues, along with the corresponding eigenvectorsw
95
3.5 NUMERICAL SOLUTION OF FOE'S BY FINITE DIFFERENCE METHOD
Numerical methods for the solution of FOE's have been studied
extensively for many years and a number of approaches have been
developed. The most widely used are the finite element and. finite
difference methods. Finite difference methods are found to be
discrete techniques where the domain of interest is represented by
a set of points (nodes): The information between these points is
commonly obtained using Taylor series expan~ions.
3.5.1 Finite Difference Approximation
It is necessary for understanding the finite difference technique
to consider the nomenclature and fundamental concepts encountered in
this form of approximation theory. The basic concept is to subdivide
the domain of the solution of the given FDE by a net with a finite
number of mesh points.' The derivative at each point is then replaced
by a finite difference approximation. In this subsection, we shall
develop certain finite difference representations for one and two
independentvariable system using Taylor series expansions.
Let us first, consider u(x), in which u is a continuous function
of the single independent variable x. By discretizing the x domain
.... into a set of points (nodes) such that,
u(X,) _ u(ih) _ u, , i~O,1,2,3, .. , ~ ~
and replacing the location x, by ih, the nodal coordinates are ~
(3.5.1) .
specified as the product of the integer i and grid spacing h, where
h is assumed constant and to be a small quantity less than unity (see
Figure 3.5).
u
OL~h~2~h3~h~4·h45~h~~~n+h~X
FIGURE 3.5: Finite Difference Discretization of u=u(x) using
constant Mesh Spacing h.
y
} k
x •
h i, j+l
(a)
0il,j i,j
i, jl
(b)
FIGURE 3.6: TwoDimensional Finite Difference Grid.
96
o i+l,j
97
in the twodimensional case the function u(x,y) may be
specified at any nodal location as,
u(x, ,y;) ~ J
_ u(ih,jk)  i=O,l,2, ... U. . I
~,J
j;O,1,2, .. , (3.S.2)
The spacing in the xdirection is h and in the ydirection, k. The
integers i and j refer to the location of u along the x and y co
ordinates, respectively (see Figure 3.6).
The Taylor series expansion for u(x) can be written at the
point x, as, ~
u (x , ) +hu' I, h2
Uxxli
h3
u(x+h) ; + +  u I + · .. ~ x ~ 21 3! xxx.
~
(3.S.3a)
h 2 h 3 , u(xh) ; u (x,) hu I + u I uxxxl, + · ..
~ xi 2! xx. 31 ~ ~
(3.S.3b)
Rearranging these equations, we may write,
u(x,+h)u(x,) h h 2
uxl, ~ ~
; u I  u I + h 2! xx i 3! xxx.
~ ~ (3.S.4a)
u(x,)u(x,h) h h2
uxl. ~ ~
; +  u I u I + · .. h 2! xx. 3! xxx. ~ ~ ~
(3.S.4b)
Therefore, two possible approximations to the first derivative of U
. at x, are given by (3.S.Sa,b). ~
u (x, +h) u' (x, )
Uxl, " ~ ~
h ~
or
U (x,) u (x, h)
Uxli
" ~ ~
h 
u u i+l i
h (3.S.Sa)
U,U, 1 ~ ~
h (3.S.Sb)
Because the series has been arbitrarily truncated, there is clearly
an error, E., say, associated with this approximation. This error ~
can be characterized by the first and largest term of the truncated
series, which yields,
O(h)
We say that this error is of order h, O(h). The O(h). error is
in absolute value smaller than ch, (c is constant) for sufficiently
small h.
Adding (3.5.4a) and (3.5.4b) and solving for u I ' results into x i
u u i+l iI
98
u I = x i 2h (3.5.6)
2 with the truncation error of order O(h ).
While subtracting (3.5.4b) from (3.5.4a) and solving for
we obtain,
u I = xxi
U. l2u.+u. 1 ~+ 1. 1.
h2
2 with the truncation error of order O(h ).
We can proceed further to derive many finite difference
u I ' xx . ~
(3.5.7)
approximations for u(x,y). New concepts involve the use of u. . instead ~,J
of u. and the fact that a partial derivative with respect to x implies ~
that y is held constant. Using (3.5.5a) in conjunction with Figure
3.6b as an illustration, we can write directly,
au .. ~ ax
u. I .u .. 1.+,] 1.,)
h + O(h) (3.5.8a)
au, , ~ ay
u u i,j+l i,j
k
99
+ 0 (k) • (3.5.8b)
So, in the approximation to u! ,we hold the subscript j x , , ~,J
constant while in u! we hold i constant. The twodimensional y , ,
. 1. , J equivalent of (3.5.8a,b),
u 2u +u '1 ' '1
uxxl. ~
1.+ 1. 1.
h2
is obtained in the same way as,
u I yy , , ~, J
u 2u +u i+l,j i,j il,j
h2
2 + 0 (h ) , (3.5.9a)
(3.5.9b)
The important extension is how to derive mixed derivatives, if we
need the finite difference representation for u , xy
u I xy , , ~,J
[au I I ax ay, '
~,J
A finite differenc~ representation of (3.5.10) is now readily
obtained using (3.5.6) as follows,
...£..[l'o! I = J:..[aul ax ay " 2h ay , 1 ' 1.,J 1.+ ,]
au ayl'l,l
1. , J
(3.5.10)
1 .2h
u, 1 ' 1u , 1 ' 1 [ 1.+ ,]+ 1.+ ,J2k
k2
+u I + ... 3! YYY, 1 '
1. + I J
=
u, 1 ' lu, 1 ' 1 1. ,]+ 1. ,J+
2k k 2
  u I + 3! YYY, 1 '
1. ,J
1 u, 1 ' lu, 1 ' 1u , 1 ' l+u , 1 ' [ 1.+ ,J+ 1. ,]+ 1.+ ,] 1. ,]
2h 2k
2 2 O(h ) + 0 (k ) ,
••• 1
and when hand k are equal, this equation becomes
In Table 3.1, a list of the most frequently used finite
difference approximation for u(x,y) are illustrated.
Derivative
u I . xx .. 1.,J
Finite Difference Approximation
u u i+l,j i,j
h
u u i,j il,j
h
u'l,U'l' 1.+,] 1 ,J 2h
u +4u 3u i+2,j i+l,j i,j
2h
Ui+l,j+lUi_l,j+l+Ui+l,j_lUi_l,j_l
4h
u, 1 ,2u, ,+u, 1 ' 1.+,] 1.,] 1. ,J
h2
u +16u 3OU +16u u i+2,j i+l,j i,j il,j i2,j
12h2
Ui+l,j+lUi+l,jlUil,j+l+Uil,j+l
4h2
100
(3.5.11)
Order of Error
O(h)
O(h)
TABLE 3.1: Finite difference approximation in two independent
variables with k=h
101
The twodimensional finite difference approximations can be
extended in a straightforward manner to three space dimensions, or
two space dimensions and time. The approximation of derivatives in
the third space dimension is analogous to that presented earlier.
To conclude this subsection, we illustrate now the finite
difference approximation is used to represent PDE's. Let us consider
the heat flow parabolic equation,
u xx (3.5.12)
and the Laplace or Poisson elliptic equation,
u + u xx yy
o (3.5.13a)
or u + U = f (x ,y)
xx yy (3.5.13b)
Temporarily, we shall ignore the initial and boundary conditions
and concentrate on the derivation of some finite difference forms for
the POE which wili hold within the domain of interest.
A finite difference representation of (3.5.12) is readily
. obtained using information from Table 3.1. A termbyterm
substitution yields,
u. , lu, , 1.,]+ 1.,)
k
u, 1 '  2u , , +u, 1 ' 2 1.+,] 1.,] 1. ,]
2 + 0 (k ,h ) .
h
Neglecting the truncation error, the finite difference equation at
the level (j+l) becomes,
AU + (12A)U, , + AU, 1 ' i+l,j 1,J 1. ,J
(3.5.14)
where A The error associated with (3.5.14) is of order
102
2 O(k,h ). The discrete diagram for (3.5.14) appears in Figure 3.7a.
The Laplace equation (3.5.13a) can be approximated in exactly
the same manner using Table 3.1,
u 2u +u i+l,j i,j il,j +
h2
u 2u +u i,j+l i,j i,jl
k2
2 2 + O(h ,k ) ; o ,
which can be rearranged to yield,
1 2[u. 1 .+u. 1 .1 h ~+,J 1,]
1 + "2[U' '+l+u .. 11 k 1,J 1.,J
(3.5.15)
The discrete diagram for equation (3.5.15) is given in Figure 3.7b.
When k;h, equation (3.5.15) reduces to,
u. 1 . +u. 1 . +u. . l+u . . 1 1.+,) 1.,J 1,J+ 1,)
which is also secondorder accurate.
i, j+l
o il,j i,j i+l,j il,j
(a)
(3.5.16)
i, j+l
i,j i+l,j
i,j1
(b)
FIGURE 3.7: Computational Molecules for Finite Difference Approximations
103
3.5.2 Derivation of Finite Difference Approximations
Previously, we have shown that by using truncated Taylor series
expansions, we can replace a PDE with a finite difference approximation.
The truncated terms lead to the truncation error, which provides a
measure of the accuracy of the finite difference approximation. For
any PDE there are many possible finite difference representations, and
the schemes with smaller truncation error are to be preferred. This
is not always true, since other features in addition to the truncation
error make a particular finite difference approximation a feasible
candidate for a computation.
Let us consider equation (3;5.12) as a model for developing the
finite difference approximations. There are two types of approximation,
explicit and implicit. Table 3.2 shows some of the wellknown finite
difference methods. The truncation error and the stability require
ment of each method are also included.
FINITE DIFFERENCE FORM
Cwssical Explicit
l
ui , j+l = (12A)ui,j+A(ui+l,j+ui_l,j)
. DufortFrankel Explicit
2.
(1+2A)ui ,j+l = 2A(ui+l,j+Ui_l,j)+(12A)Ui,j_l
Richardson Explicit 3.
= U. j_l+2X(u. 1 j+ui _l .)4Xui . U i,j+l 1., 1.+ , ,J ,J
Special Explicit 4.
1 u .. 1 = U. + 6(u. 1 .2ui .+U. 1 j) ~, J+ l.,j 1.+ ,J ,J 1.,
TABLE 3.2: Wellknown Finite Difference Methods
STABILITY CONDITION
A ~ 1
Unconditionally
Stable
Unstable
Stable
TRUNCATION ERROR
O(k,h2
)
O(k2
,h2
)
O(k2,h
2)
O(k2
,h4
)
COMPUTATIONAL MOLECULE
~ cS
continued ••• .... o A
,
FINITE DIFFERENCE FORM
Baciauards (FuUyJImpZicit 5.
(1+2>.)u >. (u +u ) = u, i,j+l il,j+l i+l,j+l 1,j
CrankNicolson Implicit 6.
(2+2>')Ui,j+l>'(Ui_l,j+l+ui+l,j+l) = (22>, )u i , j +>. (u i _l , j +ui+l, j)
Weighted ImpZicit
7. (1+2>'9)Ui,j+l>'9(ui_l,j+l+ui+l,j+l) = A(19) (ui_l,j+ui+l,j)+
o~e~l (12>. (19) )ui,j
Douglas Implicit 8.
5 >. 1 >. 1 (~>')Ui,j+l(2  12) (Uil,j+l+ui+l,j+l) = ( + ) (u +
2 12 il,j 5
u i +l ,j)(6>,)u i ,j
TABLE 3.2: Wellknown Finite Difference Methods
STABILITY TRUNCATION CONDITION ERROR
Unconditionally
O(k,h2) Stable
Unconditionally
O(k2
,h2
) Stable
O(k2 ,h2) 1 ; >. ~ (249) ,for 9 =
O(k,h2) O~9~1
for e " ;
Unconditionally
O(k2
,h4
) ,Stable
COMPl1I'ATIONAL MOLECULE
../
V
.J
~
0 \.. 0
.f
continued ••• ,... o '"
FINITE DIFFERENCE FORM STABILITY TRUNCATION COMPUTATIONAL CONDITION ERROR MOLECULE
r'\
Variation of Douglas Implicit Uncondit
9. 151 1 5 ionally 0(k2
,h4
) (SA)Ui+l,j+l+(4+2A)Ui,j+l+(SA)Ui_l,j+l = ~i+l,j·~i,j+ Stable
1 1 5 1 6"il, j 24"i+l, jl ~i. ;1 24"i1. ;1
0 SauZ'ev's Alternating (b) I Uncondit 00 10. (a) Ui ,j+lUi,j = A (u. 1 j U. j U. . 1 +U. 1 . 1) ionally 0(k,h2
) ~+ ,. ~, 1., J+ 1., J+
~ Stable (b)
U i ,j+2U i,j+l = A tu u u +U ) i+l,j+2 i,j+2 i,j+l il,j+l
TABLE 3.2: Wellknown Finite Difference Methods 2 au a U to=
at ax2
107
3.5.3 Consistency, Efficiency, Accuracy and Stability
When a PDE (say, equation (3.5.12» is approximated by a finite
difference analogue, one naturally expects that the difference scheme
indeed represents the differential system in some sense. By this, we
mean that a difference system is consistent with a differential system
when the former becomes identical with the latter in the limit as h,k+O.
Clearly, consistency is a fundamental requirement.
Efficiency refers to the amount of computational work done·by the
computer in solving a problem over unit timelength.
Accuracy of a numerical solution depends on two major classes of
errors, i.e., roundoff and truncation errors. Roundoff errors
characterise the differences between the solution furnished by the
computer and the exact solution of the difference equations.
Truncation errors are caused by the approximations involved in
representing differential equations. Truncation errors depend upon
the spatial grid size h and step size k. Intuitively, one would
assume that the accuracy of any finite difference solution would be
decreased by increasing the grid sizes, Since, evidently, if the grid
spacing is reduced to zero, the discretized equivalent becomes
identical to the continuous field. Magnitude of truncation errors can
be estimated using Taylor, series expansion.
In recommending a numerical method, we need to strike a balance
between efficiency and accuracy. This is because a method might have
incurred more work to attain accuracy. On the other hand, one may
settle for a less accurate method in favour of its simplicity and
computing cost effectiveness.
The final concept to be studied is stability. If u(x,t) is the
exact solution and u .. is the solution of the finite difference l.,)
equations, the error of the approximation at the point (i,j) is
((u .. )u(ih,jk». One is interested to know the behaviour of l.,)
lu .. u(ih,jk) I, as j>«> for fixed h,k. That is, whether the solution l.,)
is bounded (stable) as the index j><o. Also of interest is the
behaviour of lu .. u(ih,jk) I. as h,k>O. That is, whether the l.,]
difference scheme is convepgent.
It is clear in both cases that as the number of cycles of
calculations become large there is a possibility for unlimited
amplification of errors, and hence, the total accumulated error will
quickly swamp the solution rendering it worthless. It can therefore
be said that a numerical method is stable if a small error at any
stage produces a smaller cumulative error.
Lax's equivalence theorem gives a relation between consistency,
stability and convergence of the approximations of linear initial
value problems by finite difference equations.
Theorem 3.1
Given a properly posed initial boundary value problem and a
finite difference approximation to it that satisfies the consistency
condition, then s~ability is the necessary and sufficient condition
for convergence.
Various techniques are available for a quantitative treatment of
the stability of finite difference schemes. Among those commonly used
are the linearised Fourier analysis method (van Neumann criterion),
the matrix method, the maximum principle and the energy method.
108
3.6 METHODS OF SOLUTION
As mentioned in the previous section, the application of the
finite difference methods for solving PDE's (say, equation (3.5.12»
yields a system of linear simultaneous equations which can be
represented in matrix notation as,
109
Au = b (3.6.1)
where A is.a coefficient matrix of order (nxn) , the order of the
matrix A equals the number of interior mesh points, b is a column
vector containing known source and boundary terms, and u is the
unknown column vector.
·This section deals with wellknown methods of solving the system
(3.6.1). Usually the methods used lie in two classes, the class of
direct methods (or elimination methods) and the class of iterative
methods (or indirect methods) which mainly depend upon the structure
of the coefficient matrix A. So, if,A is a large sparse matrix,
iterative methods are usually used, since these will not change the
structure of the original matrix and therefore preserve sparsity.
Another advantage of. iterative methods, not possessed by direct
methods, is their frequent extension to the solution of sets of non
linear equations.
3.6.1 Direct Methods
The direct methods are based ultimately on the process of the
elimination of variables. The most widely used methods are:
1. Gaussian ELimination (GE) method, which involves a finite number
of transformations of a given system of equations into an upper
110
triangular system which is much more easily solved. Precisely the
number of the transformation is one less than the size of the given
system. If any of the diagonal elements of the matrix A in the
system (3.6.1) becomes zero during the elimination process, we re
order the equations. The reorders are referred to as pivoting
which is normally employed to preserve stability against rounding
error.
There are two basic well known pivoting schemes. Partial pivotin·g
and complete pivoting. In the partial pivoting strategy we choose
an element of largest magnitude in the column of each reduced matrix
as the pivot, elements of rows which have previously been pivoted
being excluded from consideration. Whilst in the complete pivot,
the pivot at each stage of the reduction is chosen as the element
of largest magnitude in the submatrix of rows which have not been
pivotal up to now, regardless of the position of the element in the
matrix. This may require both row and column interchange. The
complete pivoting is time consuming in execution and is not
frequently used.
2. GaussJordan (GJ) method, which is an alte·rnative to GE leads to a
diagonal matrix rather than triangular at the end of the process.
In this method, the elements above the diagonal are made zero at
the same time that zeros are created below the diagonal, and hence
the solution can be obtained by dividing the components of the
righthand side vector, £, by the corresponding diagonal element,
i.e., there is no need for the back substitution stage as in GE.
111
3. Triangular, ·or LV Decomposition, is a modification of the GE. The
matrix A is transformed into the product of two matrices Land U,
where L is a lower triangular and U is an upper triangular matrix
with one's on its diagonal entries. Then equation (3.6.1) can be
written as,
LUu ; b . (3.6.2)
The solution of (3.6.1) by this algorithm follows from (3.6.2) by
introducing an auxiliary vector, X (say), such that the system
(3.6.2) will be split into two triangular systems,
(3.6.3a)
Uu X (3.6.3b)
The two vectors X and ~ can be obtained from (3.6.3a,b) by forward
and backward substitution processes respectively.
The amount of work in the LU and GE are the same, while the GJ
method requires almost 50% more operations than the GE method.
3.6.2 The Iterative Methods
It is well known that iterative methods can make use of the great
speeds of modernday computers which are used in large scale computations
for solving the matrix equations which arise from finite difference
approximations to PDE's.
In the iterative method, a sequence of approximate solution vectors
(u(k)} are required to solve the nonsingular system (3.6.1) such that
(k) 1 U + A b as kto;!
112
Without loss of generality, let us consider that the (nxn) non
singular coefficient matrix A of the system (3.6.1) can be expressed as,
A ; Q  S , (3.6.4)
where Q and S are also (nxn) matrices, and Q is nonsingular. This
expression then represents a splitting of the matrix A. Equation
(3.6.1) then becomes,
Su + b (3.6.5)
Different splittings of the matrices Q and S will clearly give
different iterative methods. Some of these methods are:
1. The Jacobi Method. In this method we assume that Q;D and S;E+F,
where D is the main diagonal element of A. E and F are strictly
lower and upper (nxn) triangular matrices respectively. Hence
equation (3.6.5) can be written as,
Du ; (E+F)u + b (3.6.6)
1 By the assumption that A is nonsingular, thus D exists and we
can replace the system (3.6.6) by the equivalent system,
1 1 u ; D (E+F)~ + D b
The Jacobi iterative method is defined by,
(k+l) u
(k) ~ +~
(3.6.7)
(3.6.8)
where B is the Jacobi iterative matrix associated with the matrix A
113
and is given by,
1 B = D (E+F) ,
1 (k) and ~=D £. In this method the components of the vector u
b d wh 'l 'th of u(k+l) must e save 1 e comput1ng e components
2. The GaussSeidel (CS) Method, which is based on the immediate use
, (k+l) (k) of the 1mproved values u
i instead of u
i . By setting Q=DE
and S=F, the matrices D,E and F as defined before, equation (3.6.5)
becomes,
(DE) ~ = Fu+b
Then, the GS iterative method is defined by,
(k+l) (k+l) (k) b Du ; Eu +Fu + (3.6.9)
1 By multiplying both sides by D ,equation (3.6.9) can be written
as,
(k+l) u
(k+l) (k) L~ +R~ +~
1 1 1 where L=D E, R=D F, and ~=D £
(3.6.10)
Since L is strictly lower triangular matrix then det(IL)=L Hence
1 (IL) is a nonsingular matrix and (IL) exists, and the GS
iterative method will take the form,
(k+l) L (k) ~ = ~ +!,
where L is the CS itel'ative matl'ix, and it is given by,
and
L 1
(IL) R
1 t = (IL) ~.
(3.6.11)
The computational advantage of this method is that it does not
(k+l) require the simultaneous storage of the two approximations lli
and u(k) in the course of the computations as does the Jacobi 1.
iterative method.
3. The Simultaneous Overrelaxation (JOR) Method, is a modification of
114
the Jacobi method. ~(k+l) If we assume that u is the vector obtained
from the Jacobi method, then from (3.6.8),
~(k+l) (k) u = B~ +SL (3.6.12)
(k+l) and by choosing a real parameter w, the actual vector u of
this iteration method is determined from,
(k+l) u (3.6.13)
~(k+l) Elimination of u between equations (3.6.12) and (3.6.13) leads
to
(k+l) (k) U =Bu +wg
or  (3.6.14)
where B is the iteration matrix of the JOR method and is given w
by 1
B = [wD (E+F)+(lw)Ij . w
The real parameter w is called the relaxation faetor. If w=l, we
have the Jacobi method. If w>l «1) then we are in a sense
carrying out the operation of "overrelaxation" (underrelaxation)
a t each of the nodal points. Both, the Jacobi and JOR methods are
clearly independent of the order in which the mesh points are
scanned.
4. The Successive OverreZaxation (SOH) Method, which is the same as
(k+l) the JOR method except that one uses the values of ui
whenever
115
possible. So, from equation (3.6.10) the 50R method is defined as,
(k+l) (k+l) (k) (k) u = W(L!:!. +R!:!. +.<I) + (lw)!:!.
Equation (3.6.15) can be written in the form,
(k+l) (IWL)!:!.
(k) [wR+ (lw)I l!:!. +w.<I
(3.6.15)
(3.6.16)
But (IwL) is nonsingular for any choice of w, since det(IwL)=l.
So we can solve (3.6.16) for u(k+l) obtaining
(k+l) (k) 1 u = L u +(IwL) wg
w 
where L is the SOH iteration matrix and is given by, w
L w
1 = (IwL) (wR+(lw)I).
(3.6.17)
(3.6.18)
For w=l, we have the G5 method. And w>l «1) corresponds to the
cases of overrelaxation (underrelaxation) at each of the nodal points.
The G5 and 50R methods are both dependent upon the order in which
the points are scanned.
s. The Syrrunetric SOH (SSOH) Method, whereas this method involves two
half iterations using the SOR method. The first half iteration is
the ordinary 50R method while the second half iteration is the SOR
method using the mesh points in reverse order. Hence, we can define
the 550R iterative method by,
(k+!) (k) 1 u = L u + (IwL) w.9.
w (3.6.19a)
and
116
(k+l) (k+~) 1 u = K u +(IwR) wg (3.6.19b) w 
where u(k+~) is an intermediate approximation to the solution,
L , is given in (3.6.18) and K is given by, w w
1 K = (IwR) [wL+(lw)I]
w (3.6.20)
(k+~ ) By eliminating u between equations (3.6.19a) and (3.6.19b)
we have,
(k+l) (k) u = H u +d
or w
where H is the 880R iteration matrix and is given by, w
H = K L w w w
and
1 1 (IwR) (IwL) [wL+(lw)I] [wR+(lw)I]
d w
1 1 w(2w) (IwR) (IwL) .2.
3.6.3 Block Iterative Methods
(3.6.21)
(3.6.22a)
(3.6.22b)
In our previous discussion of the iterative methods for solving
equation (3.6.1), we dealt with point iterative methods, that is, at
each step the approximate solution is modified at a single point of the
domain. An extension of these methods are the block (group) iterative
methods in which several unknowns are connected together in the
iteration formula in such a way that a linear system must be solved
before anyone of them can be determined. Equation (3.6.1) can be
written as,
(3.6.23)
In the system (3.6.23), assume that the equations and the unknowns u. ~
are partitioned into t groups such that ui
' i=1,2, ... ,n1
, constitute
the first group; i=n1+l,n
1+2, ... ,n
2, constitute the second group, and
th in general, u" n l<i<n constitute the 5 group and nn=n.
1. s ... S A.
117
Young [Young 1971] defines an ordered grouping n of W;{1.2.3 •...• n}
as a subdivision of W into disjoint subsets Gl
.G2 •...• GR. such that
Gl U G2 •... U GR. ; W. Two ordered groupingsn and n' defined by Gl
.G2
•
... ,G! and G1 "G2 , , ... ,Gt
" respectively are identical if £=1' and if
Gl=Gl ,· G2=G 2 , •...• GR.=GR.'
Evidently. this partitioning n imposes a partitioning of the matrix
A into blocks (group~ofthe form.
A  2.2
A = I
I (3.6.24)
A A A R..l R..2 R..R.
where the diagonal blocks A, ., 1~i~2 are square, nonsingular matrices. 1..1.
From this partitioning of the matrix A, we define the matrices,
D = ,
o
and
0 , ,
F =
, , , ,
o
, ,
Al •2     , , , , , , , , , , , , , ,
0 , , , ,
0, " ,
A2 •1 , , o
• E , , , , , ,
, , , , , , , A ~A'
R..l R..R.l
A 1.R.
, , AR._l.R.
0
, , '0
(3.6.25)
118
where D is a block diagonal matrix and the matrices E and Fare
strictly lower and upper block triangular matrices respectively, and
ADEF. (3.6.26)
Now, for the column vector ~ in equation (3.6.23) we define
column vectors U1 'U2""'U where U is formed from _u by deleting all   ~ s
elements of ~ except those corresponding to group s. Similarly,
define column vectors ~1'~2'.'./~2 for the given vector £, we also
define the submatrices A, of for i,j=l,2, ... ,R., such that each A .. 1.,) l.,J
are formed from the matrix A by deleting all rows except those
corresponding to G, and all columns except those corresponding to G,. 1 J
The system (3.6.23) evidently can be written now in the equivalent
form,
~
I j=l
A, ,U, 1,J:J
= c, 1
i=l, .... ,R.. (3.6.27)
If we suppose that all the submatrices A, , are nonsingular. Then, 1,1
the bZock Jaaobi iterative method is defined by
,or equivalently,
where,
B, ' 1,J
and V i
=
~
I j=l j;li
~ I A, ,u~k)
j=l 1, J:J
j;li
(k) B, ,u,
1,J:J
1
{ A, , A,
l.,j 1,1 =
l 0
1 A, , C, , 1,1 ""'2
if i;lj
if i=j
(3.6.28)
(3.6.29)
(3.6.30)
(3.6.31)
We may write (3.6.28) in the matrix form,
wher~,
and
(k+l) u
(U) + v
E(U) and F(u) are again strictly lower and upper triangular matrices.
In a similar manner to the pOint case, we can derive the block
GS, JOR, SOR and SSOR methods.
The determination of a suitable value for the relaxation factor
w of the SOR method is of paramount importance, and in particular the
optimum value of w, denoted by ~' which minimizes the spectral radius
of the SOR iteration matrix and thereby maximize the rate of
convergence of the method.
Young [Young, 1954] proved that when a matrix possesses property
A, then it can be transformed into what he termed a consistently
ordered matrix. Under this condition the eigenvalues A of the SOR
iteration matrix L associated with A are related to the eigenvalues ~ w
of the corresponding Jacobi iteration matrix B by the equation
119
2 (A+wl)
A 2 2
= w ~ (3.6.32)
from this equation it can be seen that,
! AW~A +wl = 0 • (3.6.33)
If we assume that A is symmetric, then, the eigenvalues of B are real
120
and occur in pairs ±~. Let 11 denote the largest eigenvalue of B, it
can be shown that wb defined by,
or equivalently,
2
~ l+/l"ii~
is the value of W which minimizes p(Lw
)' i.e. WFWb
, then
p (L ) > p (L ) W wb
Using this optimum value of w, it can also be shown that,
r:2 l/lii~
~ l+1IU~
Moreover, for any w in the range O<w<2, we have,
p (L ) W
=
_ 2..2 ! r[w~+(W ~ ;4(Wl» J2
lwI l
(3.6.34a)
(3.6.34b)
(3.6.35)
(3.6.36)
(3.6.37)
The estimation of wb depends on whether P(B) or p(L) can be estimated.
Several methods have been suggested by carre [Carn" 1961J, Varga
[Varga, 1962J and Hageman and Kellogg [Hageman, 1968J.
3.6.4 Alternating Direction Implicit (ADI) Methods
To increase convergence, another form of splitting was suggested,
which consists of an initial first half iteration in the row direction
followed by a second half iteration in the column direction. Such
methods are aptly designated AZtePnating Direction ImpZicit methods or
ADI methods for short.
121
The first ADI methods were developed by Peaceman and Rachford (PR)
[Peaceman, 1955] for solving the matrix equation,
Au = b (3.6.38)
where the (nXn) matrix A is nonsingular and can be represented as the
sum of the three (nxn) matrices,
A = H + V + r . (3.6.39)
We can make the following assertions about the matrix A:
1. A is a real, symmetric and positive definite, irreducible matrix
with nonpositive offdiagonal entries. Derived from the
discretisation of a two dimensional elliptic operation.
2. H and V are real, symmetric, diagonally dominant matrices with
positive diagonal entries and nonpositive offdiagonal entries,
derived from row and column discretisation of .elliptic operation.
3. L is a nonnegative diagonal matrix.
In a typical situation H and V would be tridiagonal or could be
made so by a permutation of the rows and corresponding columns.
By using (3.6.39) we can write the matrix equation (3.6.38) as a
pair of matrix equations,
122
(H+!l: +rI) ~ ~ b (V+!l: rI) u
(3.6.40)
(V+!l:+rI)u ~(H+!l:rI)~
for any positive scalar r. If we let,
then the PeacemanRachford Alternating Direction Implicit method is
defined by,
(3.6.41)
where the r's are positive acceleration parameters chosen to make the
process converge rapidly.
Since the matrices (Hl+rk+1I) and (Vl+rk+lI) are, after suitable
permutations, tridiagonal nonsingular matrices, the above implicit
process can be directly carried out by the simple algorithm based on
the GE. Indeed, the ADI method is derived from the observation that
for the first equation (3.6.40) we solve first along horizontal mesh
lines, and then for the second equation of (3.6.40) we solve along
vertical mesh lines. The vector u(k+!) is treated as an auxiliary
vector which is discarded as soon as it has been used in the'
, (k+l) calculatlon of u .
The two equations of (3.6.41) are now combined to give,
u(k+l) ~ T u(k)+g (b), k~O , r  r k+l k+l
(3.6.42a)
where, (3.6.42b)
and
123
T r
(3.6.42c)
The convergence of the ADI scheme can be easily shown for the
stationary case with constant parameters r.=r. If we let, 1
then by similarity, T and T have the same eigenvalues. Hence from r r
(3.6.42c) we obtain,
P (Tr ) = P (Tr ) ~ 11 Tr 11
~ 11 (HlrI) (Hl+rI)lll 11 (VlrI) (vl+rI)lll ,
where P(Tr
) is the spectral radius of Tr
. Since HI and VI are
symmetric and positive definite, then in the L2 norm, we find,
Ir~.
Ir+~: < I ,
where ~i' l~i~n are the eigenvalues (positive) of HI' A similar
argument applies to the norm involving VI' Hence, p(T )<1 for all r>O, r.
and therefore the PR iteration (3.6.41) converges.
3.6.5 Alternating Group Explicit (AGE) Method
As we have already seen in the subsection (3.6.4), the ADI method
was developed to obtain the solution implicitly in the horizontal and
vertical directions.
Evans [Evans, 1985], however, derive another method where the
analysis of which is analogous to the ADI scheme. This Explicit
Iterative method employs the fractional splitting strategy on tri
diagonal systems of difference schemes and which has proved to be stable.
124
Its rate of convergence is governed by the acceleration parameter r.
Let us recall equation (3.6.1) where the matrix A and the two
vectors u and b are given by,
d c l a d c , ,
0 , , , , , , , , , , , , A = , , ,
, , (3.6.43) , , , , , , , , 0
, , , , ,
l a d c
a d
u  (u1
,u2 ,···,un
)T
and b (b ,b
2, .. .,b )T =
1 n
We now consider a class of methods for solving the system (3.6.1),
which is based on the new splitting of the matrix A into the sum of
the matrices,
where Gl and G2 satisfy the conditions Gl+rI and G2+rI are non
singular for any r>O. And Gl
,G2
are given by,
hd c
a hd
=
'I I hd
la
o
c
·4 __ _ ,
L ___ _
I
I
'0 i
+ I hd , I la ,
~ c
hd (nxn)
(3.6.44)
(3.6.45a)
and hd n I
I I tI
1 hd C I
I I _ _ ,_a_ _hd_ l _ I r 
1 , 0 G2
= , (3.6.4Sb) , , ,
" r , 1 ,hd C
,In.n, 0
I , hd 1 " la_
, I
"
if n is even (odd number of intervals) , and,
hd I I , ' r '  
hd C 1 , ,
1 1 a hd , , 0
   1:: 1 1
Gl
= " (3.6.46a) I
~ , , ,
1 0
, hd C
a hd (nxn)
and
ihd C I l a hd 1 1
1 ,  +    1
I hd Cl 1 1 0
a hd _1__ L_  ~!, G
2 = ,
' ,I (3.6.46b) , "
  " I1 hd C
! 1
0 I
I 1 1 , ,a
 h,d, :  J  ,  ,I , , ' , I I
, hd , (nXn)
if n is odd (even number of interval s) , with hd=d/2.
It is assumed that the following conditions are satisfied:
i. Gl+rI and G2+rI are nonsingular for any r>o.
ii. for any vectors ~l and ~2 and for any r>O, the systems,
and
are more easily solved in explicit form since they consist
of only (2 x 2) subsystems.
We shall be concerned here with the situation where Gl
and G2
are
either small (2 x 2) block systems or can be made so by a suitable
permutation of their rows and corresponding columns. This procedure
is convenient in the sense that the work required is much less than
would be required to solve the original system (3.6.1) directly.
By using (3.6.44) we can write the matrix equation (3.6.1) in the
form,
126
(3.6.47)
(k+l) (k+l) Following a strategy similar to the ADI method, u and u
can be determined implicitly by,
or .explicitly by,
(G +rI)u(k+1) 1 .
(k+l) (G +rI)u . 2 
u (k+t)
(k+t) b(G rI)u  1 
If we combine the above two equations into the form,
(3.6.48a)
(3.6.48b)
where,
and b x
(k+l) (k) u = T u +b
r x
The matrix T is called the AGE iteration matrix. r
127
(3.6.49)
(3.6.50)
We now seek to analyse the convergence properties of the AGE
method. Let us assume that U is the exact solution, then,
and (3.6.51)
.. (k) (k) LeL~ =~ Q be the error vector associated with the vector iterate
(k) u Therefore from (3.6.48b) and (3.6.51) we have,
(Gl +rI)~ (k+!) (k)
(G rIle 2 
similarly,
and hence (k+l)
e = T r
= (G _rI)e(k+!) 1 
(k) e
where T is given in (3.6.50)., r
To indicate the convergence properties of T , we have the r
following theorem.
Theorem 3.2
If Gl
and G2
are real positive definite matrices and if r>O then
p(T )<1, where p(T ) is the spectral radius of T . r r r
Proof: If we define the matrix T as, r
~ 1 Tr = (G
2+rI)T
r(G
2+rI)
1 1 = (GlrI)(Gl+rI) (G
2rI)(G
2+rI) ,
~
then it is evident that T is similar to T , and hence from the r r
properties of the matrix norms we have,
However, since Gl and G2
are symmetric and since (GlrI) commutes
1 with (Gl+rI) we have,
11 (GlrI) (Gl
+rI) 111
=
1 p«GlrI) (Gl+rI) )
I (),r) I
~x ().+r)
where)' ranges over all eigenvalues of Gl
. But since Gl
is positive
definite, its eigenvalues are positive. Therefore,
The same argument applied to the corresponding matrix product with G2
shows that I I (G2rI) (G2+rI)11 1<1, and we therefore conclude that,
P (T ) = p 6' ) ~ liT 11 < 1 , r r r
hence the convergence follows.
It is possible to determine the optimum parameter r such that the
bound for p(T ) is minimized. r
Let us assume that Gl
and G2
are real positive definite matrices
and that the eigenvalues ). of Gl
and ~ of G2
lie in the ranges,
128
o < a ~ ).,~ ~ b • (3.6.52)
Evidently, if r>O we have,
= ( max a~A~b
I Ar! I A+r ) ( max a~J.l~b !
J.lr! ) J.l+r
[ max a~y~b
129
= q,(a,b;r) (3.6.53)
Since (yr)/(y+r) is an increasing function of y we have,
max a~y~b
When r=~, then,
larl !br\ = max( la+rl' b+r )
Ibr! = Ib+r
Ib. fa
/b+ fa
Moreover, if O<r<~, we have,
Ibrl Ib+rl 
and if ~<r, then,
lar\ la+r
2/b (rabr)
(b+r) (/b+fa)
2/b (r..:/ab)
(r+a) (1b+1a)
Therefore ~(a,b;r) is minimized when r=/ab and
> 0 ,
> 0 .
Thus, r=1ab is optimum in the sense that the bound ~(a,b:r) for
p(T ) is minimized. r
For an efficient implementation of the AGE algorithm, it is
essential to vary the acceleration parameters rk
from iteration to
iteration. This will result in a substantial improvement in the rate
of convergence of the AGE method, equation (3.6.48a) can be written as,
130
.(3.6.54)
; b(G +r I)u(k+1)  1 k+l 
, k~O •
The best values of rk
+l
can be ascertained provided Gl
and G2 are
commutative (see Evans, 1985b).
CHAPTER 4
STEADY STATE PROBLEMS,
PARALLEL EXPLORATIONS
It isn't reaLLy
Anywhere!
It's somewhere else
Instead! .
A.A. Milne.
131
4.1 INTRODUCTION
Previously, in section (3.6) we conclude that the point (Explicit)
methods have natural extensions to block iterative processes in which
groups of components of x(k) are modified simultaneously. A faster
rate of convergence may be obtained, if a group of points are
evaluated at once in one iteration step rather than solving the
individual points.
In this chapter, a method involving a: (2x2) block is implemented
in parallel. Three parallel strategies of the AGE method are
presented which is an ideal parallel algorithm since it can be
subdivided into a number of small independent tasks. Also, these
tasks can be done at the same time without interfering with each
other, for solving one, two or more dimensional boundary value
problems which were developed and implemented on the Balance 8000
system. These include synchronous and asynchronous versions of the
algorithms. The results from these implementations were compared as
well as the presentation of the timing results and the performance
analYSis of the best strategy.
A one dimensional boundary value problem with boundary condition
involving a derivate is also solved using the AGE iterative method
with the D'Yakonov splitting formula. The same three parallel
strategies were also implemented on this kind of problem.
The SturmLiouville problem,
d dU dx(P(X)dx)  q(x)U + Ap(X)U ; 0 ,
where p,q and p are real functions of x in the interval a~x~b, subject
to the boundary conditions at the points a and b, is also solved using
the AGE algorithm, where the parameter A and the vector solution U are
determined.
4.2 PARALLEL AGE EXPLOITATION
The basic concepts of the AGE scheme for solving a system of
equations was presented in subsection (3.6.5). In this section we
proceed with the parallel implementation of this method.
Three strategies are investigated and implemented to solve one
and two dimensional boundary value problems. The strategies are
mainly concerned with the way the problem to be solved is decomposed
into many tasks that can be run in parallel. In the first two
strategies the problem is solved by decomposing its interval into
subsets and assig~~~ach subset to different processors which can run
them in parallel, whilst in the third strategy the problem is solved
by decomposing its domain into partitions and each partition is
assigned to different processors.
These three strategies are programmed on the Balance MIMD system
using both the synchronous and asynchronous approach. The results
from the implementations of these approaches, such as the timing
needed to solve the" problem, number of iterations required and the
'sp~dup' ratios are obtained and compared.
Using these strategies, shared memory is used to hold the input,
the results from the first sweep and the final output component values.
These values can then be accessed by different processes. Before the
process iterates on its task, it needs to read all its components
first from private memory, then it releases all the values of the
components for the next iteration. In the different parallel versions,
different mesh sizes are evaluated. The results shown in this chapter
are an average of many runs.
To demonstrate these strategies, let us consider the differential
132
133
equation, d
2U
  + q(x)U = f(x) , dx
2 (4.2.1)
subject to the twopoint boundary conditions,
U(a) = Cl, U(b) = 8 , (4.2.2)
where Cl and 8 are given real constants, and f(x) and q(x) are given
real continuous functions in a~x~b, with q(x)~O.
For simplicity, we place a uniform mesh of size hi where
h = (ba) (n+l)
on the interval a~x~b, and we denote the mesh points of the discrete
problem by,
x, = a+ih, ~
as illustrated in Figure 4.1 below.
x =a o
FIGURE 4.1
x nl
x n x =b
n+l
Using Table 3.1, the finite difference replacement of equation
(4.2.1) is given by,
2 u. +(2+h q. )u.u.
~l ~]. ~+l
which can be written in matrix notation as,
Au = £ '
where A as in (3.6.43), with d=2+h 2q. and a=c=l. ~
If we split A into component matrices Gl
and G2
, where Gl
and G2
as defined in (3.6.45a,b), we can find u(k+!) and u(k+l)
(4.2.3)
(4.2.4)
134
explicitly by,
Figures 4.2 and 4.3, show the computational molecule for the
(k+!) and (k+l) sweep (first and second sweep) .
. (k+!) (k+!)
0; (k) (k)
il i i+l i+2 il i i+l i+2
FIGURE 4.2: Explicit Computational Molecule for the (k+!)th Sweep.
(k) (k)
"""00 (k+! ) (k+!)
il i i+l i+2 il i i+l i+2
FIGURE 4.3: Explicit Computational Molecule for the (k+l)th Sweep.
In'the first strategy, the mesh of points (Figure 4.1) is
decomposed into subsets of points, each of which are assigned to a
processor. Then each processor computes its own subset in two sweeps.
In the first sweep, the evaluation of two successive pOints (in their
natural order) at a time starting from the first two pOints and
terminates after evaluating the· last two points .. While the second
sweep is started after the first sweep has been completed. In the
second sweep we evaluate the first point then each two successive
points at a time and the last point is evaluated on its own.
As an example, given the interval shown in Figure 4.1. If we
start evaluating the mesh points by taking a pair of points at a time,
then the order of the first sweep is,.
(Xl ,x2) ,(x3 ,X4 ),···, (xn
_l
,xn
)
After the completion of the first sweep, the ordering of the mesh
pOints in the second sweep will be,
(Xl)' (x2 ,x3), (x4 ,XS)"'" (x
n_
2,x
n_
l), (x
n) •
So, a single iteration is terminated.after evaluating all the
points in the given interval in both first and second sweeps.· A test
of convergence is carried out by one processor. If all the components
of the mesh are obtained with the required accuracy then the procedure
terminates otherwise further iterations are needed until all the
components have converged.
As in the first strategy, in the second strategy the problem mesh
of points (Figure 4.1) is decomposed into subsets each of which are
aSSigned to a processor. Each processor computes its own subset in two
sweeps. In the first sweep, we evaluate each two successive points in _ .. ,
136
an oddeven (redblack) manner. That is, we first evaluate all the
odd points then followed by all the even points. In the second sweep,
the evaluation is carried out as in the first sweep aspect, i.e., odd
points are evaluated first and then followed by the even ones.
As an example, suppose that n;lO (in Figure 4.1). Then in the
first sweep we start evaluating the odd mesh points in the following
order xl ,x3
,xs ,x7
and Xg then followed by the even points in the order
of x2
,x4
,x6
,xS
and xlO · Whilst in the second sweep, we first evaluate
the odd points and in the order x1
,x3
,xS
,x7
and Xg then followed by
the even points in the order of x2
,x4
,x6
,xa and xlO ·
Then, a test of convergence is carried out by one processor and if
all the components of the mesh points are obtained with the required
accuracy then the procedure terminates otherwise further iterations are
needed until convergence is achieved.
The third strategy is completely different from the other two
strategies, in the manner of partitioning the domain (domain
decomposition) on the processors. The problem domain is partitioned
into subtasks, and each of these tasks are assigned to a processor.
If P is the number of processors and n is the number of points in the
interval (Figure 4.1), then each partition contains points.
Each processor then computes its own n points in two sweeps . p
asynchronously without waiting for the other processors to complete
their computations. At the end of each iteration, each processor
checks to ensure convergence. If convergence is obtained, the processor
sets its flag and tests the remaining flags to ensure that the other
groups have also set their flags, otherwise further iterations are
required.
137
4.3 EXPERIMENTAL RESULT FOR THE ONE DIMENSIONAL PROBLEM
Consider the linear problem,
+ U = x , (4.3.1)
subject to the boundary conditions,
U(O) = 1, U(n/2) n 2  1 • (4.3.2)
The exact solution for this problem is given by,
U (x) = cosxsinx + x .. (4.3.3)
By following the finite difference discretisation procedure
given in Table 3.1, equation (4.3.1) can be approximated to obtain the
linear difference equation,
2 u + ( 2h ) u u
iI i i+l
where x,=ih, for i=1,2, ... ,n. ~
2 = h x
i
The boundary conditions are replaced by the values,
where h =
U o = I,
(~ 0)
(n+l)
u n+l
(4.3.4)
(4.3.5)
The linear system (4.3.4) can be represented in matrix notation as,
(4.3.6)
(4.3.7)
u = (u ,u2
, ... ,u )T  1 n
and b (bl ,b2 ,··· ,bn
)T =
2 2 where d=2h , c=a=l and b,=h x" for i=1,2, •.• ,n.
~ ~
138
If we take n even and split A into Gl
and G2
equation (4.3.6) can
be written as
(G +G )u = b , 1 2  
where Gl and G2
are as given in equation (3.6.45a,b), with
(4.3.8)
h2
hd=l2"
(k+l) d (k+l) Hence by' applying the AGE method, u an u can be determined
successively, by,
(k+l) u
(4.3.9)
where r is the iteration parameter. It is obvious that the (2x2) sub
matrices of (Gl+rI), (G2+rI), (GlrI) and (G
2rI) can be determined and
1 1 (Gl+rI) ,(G
2+rI) are easily invertible as shown below.
1
w
1  
l
tIW
I 1 w r    
,o
,
'0 1 _ __ _
:J :1 I
(4.3.lOa)
and
, I
I 1
0
, I
1 1 ,
I t + , ,1 , w,
1wJ
v 1 I I
!  ~, _ L l
I I Iv 11 I 1 v I
, I
I  I ,0 ,  1," I , '
f', ... " I
II "" "' ... '1
1''1
v  I ,v
'1 ,
o
' 
I , v 1
'1 , , v
'l , I
vi L ...  ...   L
1 I  11 o ' I ~I
" " " .. "" I I "', "
, "' I ' ~
0, '
~,~ , 1 1IV , ' 11 v I ,  :l • • v . .
139
(4.3.10b)
(4.3.1ce)
(4.3.10d)
" " v where w=hd+r and v=hdr, and the (2x2) submatrices of Gi
, G2
, G1
and
v G
2 have the forms,
" r1
J v
r1
1] G = I and G
vJ
hence, Iw II " 1 1 2 G * ~ wJ
' where det=(w 1) • det
(k+l) . (k) Therefore, the vector u can be determ~ned from U in "two
steps, we first determine u(k+!) explicitly as follows:
ul
(k+! )
f u
2 1
1
1
w j
Iw , 4 
b2vu
2+u
3
b3
+u2
vu3
140
=  * ___ 1_1_ det * b vu +u (4.3.11a) 4 I 4 5
   I u nll
LU J L n
I I .
1 b: +u vu
J 1 nl n2 n
w Lb vu n n
1
Now by using the values of u(k+!) obtained from above, we can determine
(k+l) u explicitly.
(k+l) Idet/w: u
l ,I
, 1 u 2 w 11 I
1 1 l e w 1  1  "*. * = * ,
I det i' .... ' ""l 1 ,   ~1 _1
I 1 w 1 , I
t ; 0 , u 11 W , nl
f t "'1 u I det/w n ,
. The convergence test used was the average test,
(k+l) (k) (k) abs(u. u. )/(l+abs(u. )) < E ,
1. 1. 1.
4 where E=O.l*lO .
Iblvu l +u2
l (k+!)
b2
+ul
vu2
b3 ~vU3 +u4 (43.11b)
b I +u vu n2 n3 n2
b vu +u nl' nl n
b +u vu n nl n
This problem was implemented in parallel on the Balance 8000 system
using the three strategies described previously. In all these parallel
implementations a different number of points within the given interval
was taken. The optimal iteration parameter r is also obtained from the
numerical experiments by choosing the one that gives the smallest
141
number of iterations.
The results from the first parallel synchronous strategy are
illustrated in Table 4.1. From this table, it is clear that the optimal
timing results are obtained when the number of subsets are equal to the
number of available processors.
Table 4.2 shows the results obtained from implementing the second
parallel synchronous strategy. By comparing the results of Tables 4.1
and 4.2, we notice that the times from the first strategy are less than
that of the second strategy, i.e., evaluating the points using the
first strategy takes less time to converge than that of the second
strategy. This is mainly due to the distribution of the components
within each strategy. In the second strategy, we first notice from
its implementation that the number of computational operations are
higher than that of the first strategy. Secondly, there is a
possibility in the second strategy that during the evaluation of its
components the old values may be used. which means extra iterations will
be needed. Whilst in the first strategy, the most recent values of the
components will be used in the evaluation process and a greater rate
of convergence is achieved.
142
No.of No.of Elapsed time No.of Speedup
points in seconds r iterations processors
12 1 2.293 0.495 17 1
2 1.325 17 1. 730
3 1.177 I 17 1.948
I 4 1.079
, .17 2.125
5 0.971 17 2.361
6 1.040 I 17 2.204 i
7 1.134 17 2.022
8 1.207 17 1.899
9 1.273 17 1.801
48 1 7.423 0.513 25 1
2 3.791 25 1.958
3 2.595 25 2.595
4 2.000 25 3.711
5 1. 721 25 4.313
6 1.431 25 5.187
7 1.435 25 5.172
8 1.152 25 6.443
9 1.182 25 6.280
TABLE 4.1: The Results from the First Synchronous Strategy
1 , I
: I
  No.of No.of Elapsed time No.of points in seconds
r iterations processors
72 1 13.661 0.515 32
2 6.89 32
3 4.643 32
4 3.533 32
5 3.128 32
! 6 2.419 32
7 2.403 32
.' 8 2.052 32
9 1.855 32
120 1 17.339 0.560 39
2 8.928 39
3 6.112 39
4 4.544 39
5 3.750 39
6 3.336 39
7 2.947 39
8 2.567 39
9 2.202 39 .
TABLE 4.1: The Results from the First Synchronous Strategy (continued) •
143
Speedup
1
1.982
2.942
3.866
4.367
5.647
5.684
6.657
7.364
1
1.942
2.836
3.815
4.623
5.197
5.883 
6.754
7.874
144
No.of No.of Elapsed time No.of Speedup points in seconds
r iterations processors
12 1 2.64 0.495 19 1
2 1.559 21 1.693
3 1.280 21 2.062
4 1.221 21 I 2.162 !
I 5 1.102 21 I 2.395 i I 6 1.178 21 2.241
i 7 1.258 21 2.098
8 1. 296 21 2.037 · "
9 1.306 21 2.021
48 1 I 7.614 0.513 25 1
2 I 4.013 28 1.897
3 I 2.993 28 2.543 j
4 2.547 28 2.989
5 2.001 28 3.805
6 1.854 28 4.106
7 1. 772 28 4.296
8 1.431 28 5.320
9 1.322 28 5.759
TABLE 4.2: The Results from the Second Synchronous Strategy
i

No.of No.of Elapsed time No.of Speedup r
points processors in seconds iterations
72 1 13.992 0.515 33 1
2 7.133 35 1.961
3 5.042 35 2.775
4 4.001 35 3.497
5 3.828 35 3.655
6 2.772 35 5.047
7 2.512 36 5.570
8 2.152 35 6.501
9 1.927 35 7.261
120 1 17 .443 0.560 44 1
2 9.128 44 1.910
3 7.113 44 2.452
4 5.544 44 3.146
5 4.750 44 3.672
6 4.336 44 4.022
7 3.947 44 4.419
8 2.567 44 6.795
9 2.403 44 7.258
TABLE 4.2: The Results from the Second Synchronous Strategy (continued)
145
146
The third strategy was implemented using the asynchronous
approach. The results from this implementation are shown in Table 4.3.
By comparing the results obtained from the first and third strategies,
we notice that the timing of asynchronous implementation takes less
time to converge than that of the synchronous approach and this is
due to the synchronization overheads needed after each iteration in
the synchronous implementation. Also, from both Tables 4.1 and 4.3,
it is clear that the better efficiency can be obtained· by using the
third strategy. This is because the speedup ratios of the asynchronous
implementation is higher than that of the .synchronous one.
To conclude from the results of the 3 strategies, we can say that
in the implementation of the one dimensional boundary value problem
using the parallel AGE method the best results are obtained when the
problem domain is decomposed into a number of subsets, each of which
is assigned to a processor where the number of processors is equal to
the number of subsets.
There are extra overheads incurred by the system, which degrades
the parallel algorithm performance in both the synchronous and
asynchronous implementations. These overheads are the generation of
the parallel paths and the synchronization at the end of each iteration
cycle.
The timing results obtained from these strategies are diagrammatically
illustrated in Figure 4.4.
The computational complexity for the sequential algorithm for each
point in each iteration is equal to (8 additions + B multiplications).
Now for n mesh of points, each processor will evaluate f.':'.l points with PI
total computational complexity equal to T:[(8 additions + 8 multiplication)
n * ] operations per iteration. p
147
No.of No.of Elapsed time No.of Speedup points in seconds
r iterations processors
12 1 2.451 10. 495 14 1
2 1.244 14 1.970
3 1.068 14 2.294
4 0.821 14 2.985
5 0.699 14 3.506
6 0.599 14 4.091
7 0.506 14 4.843
8 0.473 14 5.181
9 0.426 14 5.753
48 1 6.188 0.513 21 1
2 3.109 21 1.990 I
3 2.088 21 2.963
4 1.574 21 3.931
5 1.245 21 4.970
6 1.052 21 5.882
7 0.989 21 6.256
8 0.936 21 6.611
9 0.893 21 6.929
TABLE 4.3: The Results from the Third Asynchronous Strategy
No.of No.of Elapsed time No.of Speedup r points processors in seconds iterations
72 1 13.915 p.515 28
2 6.996 29
3 4.701 29
4 3.519 29
5 2.807 29
6 2.342 29
; 2.119 29
8 1.989 29
9 1. 785 29
120 1 19.663 0.563 37
2 9.904 37
3 6.574 38
4 4.990 38
5 3.948 38
6· 3.058 38
7 2.819 37
8 2.711 37
9 2.493 37
TABLE 4.3: The Results from the Third Asynchronous Strategy (continued) .
1
1.988
2.960
3.954 .
4.957
5.941
6.566
6.995
7.795
1
1.985
2.991
3.940
4.980
6.430
6.975
7.253
7.887
148
o 2
No. of procs.
149
• First Strategy ... Second Strategy .. Third Strategy n=4l!
FIGURE4.4: The Timing Results for the One Dimensional Boundary Value Problem.
150
4.4 EXPERIMENTAL RESULTS FOR THE TWO DIMENSIONAL PROBLEM
The concept of the AGE method is now extended to the case.of the
two dimensional problem. Consider the Dirichlet problem on the region
R,
° , (x, y) ER, (4.4.1)
subject to the boundary conditions,
U(O,y) U(l,y) ~ ° , o~y~l (4.4.2a)
U(x,O) ~ U(x,l) ~ sinrrx, o~x~l (4.4.2b)
The exact solution of this problem is given by,
U(x,y) ~ sech~*cosh(rr(y!»*sinrrx (4.4.3)
By following the finite difference discretisation procedure given
in Table 3.1, equation (4.4.1) can be approximated to,
u. 1 .u. 1 . +4u. .u. . lu. . 1 ~ 0, l~i,j!,n, (4.4.4) 1,] 1+,J 1.,J 1.,J 1.,J+
and
is
x. =ih, Yj~jh, ~
If
even)
we assume
internal y
(0,1)
°
(0,0)
for O~i,j:;nTl.
that R is a regular region and we order the 2
(n n
mesh points rowwise, as shown in Figure 4.5.
sinTTX I
  2
n _. 
0
 n+l n+2 2nl 2n
1 2 P Inl n
sin1TX (1,0)
FIGURE 4.5
, ..
~pplying (4.4.4) at each mesh point yields the system Au=b.
f4 1
1
" 4 1 , ,
, , " , , " , " " 1 4
"
11
I " I "
" I
" I 1 f
, , , '. ,
" o /  
1 4 I
!l_ I~    I....... ......... .......
...............  ........... 1
I , ................. _ ......... ...............  ...... I
1      1, ,
_ ::::.._ 1
o
L
'4 1 I
 I '
sin1fx 1 n
sin1TX n
o , ,
I sin~rrx 2 n n
sinrrx 2 1 n n+ , , I
l
 1
1
1 ~I
*
III 1
u n I
151
(4.4.5)
If we split A into the sum of its constituent symmetric and
positive definite matrices Gl
,G2
,G3
and G4
we have,
where Gl and G2
are the differences in the x plane and G3
and G4
are the differences in the y plane.
Then,
, : ""' '
, 1
,t ' ~ , "",' " I
, ' "I I  r : , ......  'I
', ' I "
 1
,, ,"I , " ...... ~ ' ' I
" I , " _,  ~  _I __
:~:~ , I 2 2
(n Xn )
1 , :: ~~ag (A) and,
152
(4.4.6)
(4.4.7a)
(4.4.7b)
and,
n
2 ____ ..111 n __ _
1 I
1 1
I~ , I, , t'" '. I
" 1 I " , L
1'
'I
1
o 1  __ I
1
1
1
1 1
1 , 1 1 I
1 l,, " ,
153
2 2    __ n 1 n
o
+
I ',',',0' ,        1 ' ~
Tl I I I
(4.4.8a)
1 2
111   t 
2 I 1
o
3n
, 1
1 1
1 1 'h 1
,0 1 , 3 1 1 1 , ,_I ,   7, r   t I
I " ,
 1
o
" " , , " 1'1 ___ 1._ 'I , 1 ' , 'I 1
 : ~.:  :, , 1
' ..0'=1 1
1
o
I
11 1 
r
1 ' I
I' tI, ' " , I, " "I 1 ' , , 'I
o
1 o
I~ r. 1 ,I
I ,I
o
1 2 2 (n xn )
2 n
I
I
'I f  ,  ,  , (4.4.8b)
1 r  t  1  j
1 11 1 1 0 1
, 1.::.1_ ].1,"" __ 1, " 1
1 r  1',011 1;o 1 1 1
 L _:1 ~ ;J '1 '2 2 , (n xn )
, 1
(;4
154
Also, by reordering the points columnwise i.e. along the y
direction, then we find that G3
and G4
have the same structure as G1
and G2
respectively, i.e.,
1
1 11 1
n+l   ~  
o
2    
I
I
 n
   1
n+2 ,
1 I
11 11
, , I
I' "" 1
I I n
1 '<'" ,"," I ~
2n I
I
1 I
o
(nl)~1
n2 L
1 n+l 2n+l   
1 ~~T~,1 11 1
1 .1 1 0 n+l
1
11 1
1_1
   2    n I 1
  L_I I
2n+l ~~ !:_ L_ I 1
G2
; 1 1 2
  I
1 I ~, 1
I' " '~ I I  ;  1 ,    _1 __
o I ,1 1, 1
~ '_1 1 i 1 _1'. _ _ ~ _L: ____ _
_ 1. , '1 , __ 1..:1_ " , " ,
2 2n(nl)n n
I
o
o
, "''I  _ __ , 11 I
;I n
 1I 11
T
n
I I
2
o
11
I 1 '1 1_ _lL
I," I " '  ~'t
1 1 1 c 11
c + I
 j 1
(4.4.Bc)
(4.4.Bd)
t
Figures 4.6 and 4.7 illustrate the way the points are
,.., "" represented in Gl ,G2 ,G3
and G4
in the x and ydirections respectively.
I I
: : I I , I
I , I I I
I I , I , , , ,
I I , , I 1 I , ,
I , , I
I I I I , I
, I
, I ,
I , ,
J I''
y
1   
I : I I , I , I I ,
     I ,
     I ,
.  .
I , 1 ,
I I I , I , I
I I , , r'
t
x
FIGURE 4.6
y
I I I
,_   .1 , ,     1
I

 _____ L
r4'  
t
FIGURE 4 _ 7
155
t
156
The DouglasRachford formula for the AGE fractional scheme then
takes the form,
( ) (k+l/4)
G +rI u 1 
(k) (rIG 2G 2G 2G)u +2b
1 2 3 4 
(G +rI)u(k+!) 2 
(k) (k+l/4) ~ G~ +~
(G +rI)u(k+3/4) 3 
(k) (k+!) ~ G3~ +r~ .
(k) (k+3/4) G4~ +~
(k+l/4) (k+!) (k+3/4) Then we can obtain the values of U I U , u and
(k+l) , ~ respectlvely from,
and
(k+l/4) 1 (k) u ~ (Gl+rI) [(rIGl2G22G32G4)~ +2b] ,
u(k+!) (G2+rI)1[G~(k)+r~(k+l/4)]
u(k+3/4) ~ (G3+rI)1[G3~(k)+~(k+!)]
(k+l) u
(4.4.9a)
(4.4.9b)
(4.4.9c)
(4.4.9d)
(4.4.lOa)
(4.4.l0b)
(4.4.lOc)
(4.4.lOd)
Since the matrices (G,+rI), i~1,2,3,4 are all (2x2) block sub1.
matrices, then they are all easily invertable as shown below. To find
the inverses of the above systems for Gl
, we let w~l+r, so,
. II :1 " _1 *Iw ~ Gl
~ , then Gl
~
det lL wJ '
where 2
det~ (w 1) , then,
(G1+rI)1 ~ 1
det *
and,
1 * ~ det
157
~ :! : 0    .1 _I ___ I
I''', I o I'~~'I ,
  1 ~
L
L
o I , W I I I
11
o
  '1,
I '1 .....         / wl
   t I,' " I "," I " , , ,
, " , ___ 1 ___ ,Iw l'
,  ,  ,   , 1, wl I 0 ~ r  ...., ~,\, , "
w I
, I 1 I ,0 ,1 __ w+,,, _I_
t'" I ':""" I
,  .!... ~~~ I IW o I ,
I  . I ,
, 1
I
'1
I
1
c
(4.4.11a)
l
'1 I    L , "
w, , I , _1.. __ '_,_
o
'det/wl L  "
~, 'I , , 4,1 
,det/w I _.1 __
1 I . : J _ _I
Iw , ,1 I
11 1
2'" L , , I' , ~
o 1 r .l
f , \.1 +  !.. ~ I
l, o ~ ,
IW
I '1 w I
r ~l
:det/~ (4.4.11b)
1 1 The inverses (Gl+rI) and (G
2+rI) are numbered in the order
1,2,3, ... ,nxn, in the xdirection. Also (G3+rI)1 is similar to
1 (4.4.lla) and (G4+rI) is similar to (4.4.llb) but they have a
different ordering and are applied in the ydirection.
We now consider the iterative formulae, equation (4.4.10) at each
of the four intermediate levels:
(i) For the solution at the first intermediate level (the (k+l/4)th
iterate), equation (4.4.l0a) is rearranged as,
u (k+l/4) 1 (k)
(GlirI) [(rI(Gl+2G2+2G3+2G4))~ +2£1
(k) Since the coefficients of u , are all matrices of the same size, we
can add up their elements paying attention to their direction and
r 1
7 2
2 7
I , , " "
c = I
2
1
'2 I; I; 1 \
, 1 , " \ ,
" ,
, 1
0
" , 7
1
" 2' , \ \ \
7 , '.21 ",1:2, _" ' ."i
=  =2 ~ ~I '2 17 1 , , , ' . , I
, \
I I \
\ ,
21
'1 7 I
2
2
7 ,
, ,
o
1
, "
" ., ,
" 1
2
" , 7
7
2
1
158
1 7
(4.4.12)
159
then, (k+l/4) 1 (k) u = (G
l +rI) [(rIC) ~ +~)
Now let t=rI7, where 7 is the value along the diagonal of C.
Then the new coefficient matrix of ~ (k) , will have the same structure
as C, but will have the value t along the diagonal. Hence,
(k+l/4) u
1 det
f 11
! I C 1 Wl  '  I I,', . ~ 'I
I~ , ,
 I I ~,
0 I IW 1 1
o I
l ~ Jl_ wl L _____ J
" I' ", 1 ,,' ,I
*  _ _ _  J ~,'t 1 w 11
1
T 
o 1
11 1
I r
:0
1
10
L I ':1
~ (k) (k) (k) I . 1 ,
tUl +u2 +2un+l+2sLnnxl
(k) (k) (k) (k) . ul +tu2 +~u3 +2un +2+2sLnnx
2 I
(k) (k) I (k) (k) . 2u 2+tu l+u +2u2 1+2sLnnx 1 n n n n n
(k) (k) (k) . u l+tu +2u
2 +2sLnnX
n n Inn I
(k) I(k) (k) .
1
wJ
12u 2 +tu 2 +u 2 +2swnx 2 n 2n+l ,n n+l n n+2 n n+l
2 (k) + (k) +t (k) +2 (k) u 2 u 2 u 2 u 2 +2sinnx 2
n 2n+2 n n+l n n+2 n n+3 n n+2 I I I
(k) (k) (k) (k) . 2u 2 +2u 2 +tu 2 +u 2+2sLnnx 2
n n2 n 2 n 1 n n 1
(k) (k) (k) . 2u 2 +u 2 +tu 2 +2sLnnX 2
n nl n 1 n n
*
(4.1O.13a)
160
By carrying out these matrix vector operations, we obtain the
(k+l/4) values of u at the first intermediate level.
(ii) For the solution at the second intermediate level (the (k+!)th
iterate, equation (4.4.l0b),
(k+!) _ (G + I)l[G (k) (k+l/4) 1 ~  2 r ~ +~ ,
can be written in a matrix form as,
(k+! ) ·u
1 det *
det/w I
I
L
I
I W
11
,0 I
"I
le 1 _wL
, :::: 1I" , I ", '
_I
0
, IW
11 r
_1
I
I
I
11
I ,  I I 0
 T
I ..
o
r
, .l
10
wl 1 ' 1 , "
'\1 1 ,
I ~ 'W
11
 _I I

l 1,
Wl
', ,det/w ,
*
U +ru1 r(k) (k+l/4)
~k) (k) (k+l/4) u2
+u3
+ru2
(k) (k) (k+l/4) u2
+u3
+ru3 , ,
, (k) 1 (k+l/4) u +ru
n n ,
I (k) 1 (k+l/4) u 2 +ru 2
n n+l n n+l I
161
(k) (k) (k+l/4) u 2 +u 2 +ru2 (4.4 .13b) I n n+2 n n+3 n n+2
(k) (k) (k+l/4) u 2 +u 2 +ru 2
n n+2 n n+3 n n+3 1
I u 2 +ru 2 Ln n
(k) : (k+l/4) j The multiplication of the above matrix vector will give the values of
(k+! ) u .
(iii) Now for the third intermediate level (the (k+3/4)th iterate).
If we reorder the mesh points columnwise parallel to the yaxis,
equation (4.4.10c) is transformed to,
Then we have,
(k+3/4) u ~
1 (k) (k+!) ~ (Gl +rI) [Gl~ +~ 1
w l' 1
1 '0 ~_~I __ L __ I
," 1 I,,, " I , " _..1_'' __ I o 1 ,w 1
o
(k+3/4) u _1_* ___ 1
, 1 1 ,1 w
," l~ , "I
1 , , , ' 1 1
Iw l'
det
1 o
*
r.: (k) (k) (k+!) ,ul +un+1 +ru
1 (k) (k) (k+!)
u +u +ru 1 n+l n+l
I
(k) : (k) (k+!)
l
u 2 +u 2 +ru 2 n 2n+l n n+l n 2n+l
(k) (k) (k+i) u 2 +u 2 +ru 2
n 2n+l n n+l n n+l I , I
(k) (k) (k+i) u +u +ru
n 2n n (k) (k) (k+!)
u +u +ru n 2n 2n
I (k) I (k) (k+!)
u 2 +u 2 +ru 2 I n n n n n
It (k) (k) (k+i) u 2 +u 2 +ru 2
n n n n
where c stands for columnwise ordering.
Hence the values of u(k+3/4) will be obtained from the above
matrix vector multiplication
(iv) At the fourth and final level (the (k+l)th iterate). In a
similar manner, equation (4.4.10d) is transformed to,
Then, we have,
(k+l) u c ) l[ (k) (k+3/4)] ; (G 2+rI G~ +~
162
(4.4.13c)
(hI) u c
1 det
fCIet/w I .II I' 1
, W , 1 0 I
'1 w' 1 1 t  ,   I
h,' I 1
"" 'I I '.' ,    ,:J 1' ' , ,w
,0 , ' 
~l _ wJ __ , I , ;det/w
   I" __  ~  !. _1
*   
~ ,~ , f ,"
 Ideth.' ,  .!
I
1 '
w
o 1  I...!
l ruik) +ruik+3/4) l
(k) (k) (k+3/4) u +u +ru
n+l 2n+l n+l (k) (k) (k+3/4)
u +u +ru n+l 2n+l 2n+l ,
I (k)' (k+3/4)
u 2 +ru 2 n n+1 n n+1
I , (k) I (k+3/4)
u +ru n n (k) (k) (k+3/4)
u2n
+u3n
+ru2n
(k) (k) (k+3/4) JI u +u +ru I 2n :3n 3n
l (k) 1 (k+3/4) u 2 +ru 2
n n
o
0

 I 11
 ,
10
, 'I
163
l
T
I
T I  w l ,,
,det/w
(4.4.l3d)
then the values of u(k+l) will be obtained from the above matrix vector
multiplication.
*
Hence, the AGE scheme corresponds to sweeping through the mesh
parallel to the coordinate x .and y axes involving at each stage the
solution of (2 X 2) block systems. The AGE iterative procedure is
continued until convergence is reached.
Again, in all these parallel implementations the optimal
iteration parameter r was obtained from the experiments by choosing
the one that gives the best execution time.
Tables 4.4, 4.5 and 4.6 represent the results obtained from the
implementation of these strategies. As in the one dimensional
problem, the asynchronous strategy performs better than the other two
strategies, giving much smaller elapsed timings and show a near linear
speedup. This is due to the absence of any synchronization points,
which implies less overheads and no contention for shared data.
Also, from these three tables it is clear that better efficiency
can be obtained by using the asynchronous approach rather than the
synchronous approach. This is because the speedup ratios of the
.asynchronous implementation is higher than that of the synchronous.
Figure 4.8 shows the run time results obtained from using the
three parallel AGE strategies when the matrix size of the problem is
equal to 12x12.
In .conclusion, we can say from Figure 4.8 that the best results
are obtained when the third strategy of the parallel AGE iterative
method is used to solve the two dimensional boundary value problem.
164
165
. Matrix size No.of Elapsed time r No.of Speedup processors in seconds iterations
12x12 1 32.992 1'.600 11 1
2 16.849 11 1.959
3 11. 535 11 2.860
4 8.888 11 3.711
5 7.650 11 4.312
6 6.362 11 5.185
7 6.380 11 i 5.171 I I
8 5.122 11 I 6.441 I I
9 I 5.256 11 I 6.277 , ,
24x24 1. I 151.796 1.443 17 1
2 76.555 17 1.982
3 51.596 17 2.942
4 39.263 17 3.866
5 34.731 17 4.370
6 26.879 17 5.647
7 23.845 17 6.365
8 21. 211 17 7.156
9 19.877 17 7.636
TABLE 4.4: The Results from the First Synchronous Strategy
Matrix size No.of Elapsed tilDe r No.of processors in seconds iterations
36x36 1 288.991 1.037 23
2 148.803 23 I
3 101.867 I 23
! 4 75.747 I 23
5 62.502 I 23 I
6 55.607 I 23
7 49.118 23
8 42.792 23
9 36.713 23
48x48 1 500.095 0.900 27
2 252.324 27
3 172.438 27
4 132.491 27
5 103.213 27
6 93.055 27
7 82.777 27
8 73.408 27
9 64.208 27
TABLE 4.4: The Results from the First Synchronous Strategy (continued)
166
Speedup
1
1.942
2.836
3.815
4.623
5.197
5.883
6.753
7.871
1
1.981
2.900
3.774
4.845 I·
5.374
6.041
6.812
7.788
167
Matrix size No.of Elapsed time r No.of Speedup processors in seconds iterations ,
12x12 1 38.339 1.600 13 1
2 19.912 14 1.925 , 3 13.475 14
I 2.845 I
!
4 10.549 I I
14 3.634 I
5 9.266 14 4.137
I 6 7.533 14 5.089 I 7 7.217 14 5.312
8 6.181 14 6.202
9 6.072 14 6.314
24 x 24 1 175.358 1.447 21 1
2 89.104 23 1.968
3 60.367 23 2.904
4 46.396 23 3.779
5 40.191 23 4.363
6 31.936 23 5.490 I·
7 27.635 23 6.345
8 24.744 23 7.086
9 25.001 23 7.623
TABLE 4.5: The Results from the Second Synchronous Strategy
Matrix size No.of Elapsed time r No.of processors in seconds iterations
36 x36 1 335.788 1.033 27
2 174.223 29 I i 3 119.875 29
i 4 I 88.776 29
5 73;784 I 29
6 63.418 30
., 52.647 30
8 50.158 30
9 48.182 30
48x48 1 280.361 0.893 33
2 293.139 33
3 201.763 33
4 156.271 33
5 125.217 33
6 109.466 35
7 96.504 35
8 83.092 35
9 76.452 35
TABLE 4.5: The Results from the Second Synchronous Strategy (continued)
168
Speedup
1
1.927
2.801
3.782
4.550
5.294
6.378
6.694
6.969
1
1.979
2.876
3.713
4.634
5.301
6.013
6.984
7.591
169
Matrix size No.of Elapsed tilDe r No.of Speedup processors in seconds iterations
12x12 1 30.699 1.600 9 1
2 15.602 9 1.967
3 10.631 9 2.887
4 7.922 9 3.875
5 7.131 9 4.305
6 6.021 9 5.098
7 4.723 9 6.235
8 4.103 9 7.482
9 3.929 9 7.813
24x24 1 132.702 1.441 17 1
2 66.795 17 1.989
3 44.069 17 3.011
4 34.081 17 3.893
5 29.014 17 4.573
6 23.568 17 5.630
7 20.259 17 6.550
.8 18.256 17 7.268
9 17.235 17 7.699
TABLE 4.6: The Results from the Third Asynchronous Strategy
Matrix size No.of Elapsed time r No.of Speedup processors in seconds iterations
36x36 1 231.932 1.030 20
2 116.612 20
3 78.355 20
4 58.662 . I 20
5 46.786 I 20
6 39.045 , 20
7 33.662 20
8 31.968 20
9 29.037 , 20
, 48x48 1 493.327 0.903 i 25
2 245.332 25
3 168.241 25
4 12".493 25
5 98.244 25
6 88.243 25
7 78.763 25
8 69.447 25
9 62.027 25
TABLE 4.6: The Results from the Third Asynchronous Strategy (continued)
1
1.988.
2.960
3.953
4.957
5.940
6.890
7.255
7.987
1
2.010
2.932
3.869
5.021
5.590
6.263
7.103
7.953
170
j
8
6
4
2
o 2 4
No. of ProC9.
6
Go First Strategy .. Second Strategy ... Third Strategy
1'1 = 2.'t
8
171
10
FIGURE4.8: The Timing Results for the TwoDimensional Boundary Value Problem.
172
4.5 THE AGE METHOD FOR SOLVING BOUNDARY VALUE PROBLEMS WITH NEUMANN
BOUNDARY CONDITIONS
In this section, the AGE iterative method as discussed previously
in Section (3.5), is used to solve a boundary value problem with
boundary conditions involving derivative or Neumann boundary conditions.
Here another formulation of the AGE method using the D'Yakonov
[D'Yakonov, 1963] splitting formula is used.
4.5.1 Formulation of the Method
Consider the differential equation,
d2
U   + q(x)U ; f(x) , ~2
over the line segment a~x~b, subject to the boundary conditions,
i.
ii.
iii.
dui ~I a
U(a)
001 ~I a
= a , U(b) ; a ;
a, dui ; S ~Ib
; a, ~Ib ; S·
(4.5.1)
(4.5.2)
Here, a and a are given real constants, and q(x) and f(x) are given
real continuous functions on a~x~b, with q(x)~O.
For the differential equation (4.5.1), the strategy of the finite
difference method is to replace the above equation by a difference
equation. Hence, we cover the interval a~x~b, by a uniform mesh of
size, h ; (ba)
(n+l)
and denote the mesh points of the discrete problem by,
(4.5.3)
173
Therefore, the finite difference replacement of equation (4.5.1) can
be obtained (using Table 3.1) and is given by,
+ q,u, 1. 1.
with a truncation error of order O(h2
) .
In matrix notation (4.5'.4) can be written as,
A~ = £ '
(4.5.4) ,
(4.5.5)
where A and b have the following forms, according to the boundary
conditions, these are:
i. Idl c 1
fl I
fbl l 1 I
a2
d2
c2 u
2 Ib~ 1
.... 0 i , .... I
" " ....
I ....
, , " , , , " , I ........ .'" .,
I , , "
.... I
>j Ib' 0
....
l 'ill d nl cn_11 I nl
a dn J ~n J n Ln
(4.5.Ga)
2 where d, = 2+h q, , a, c, = 1, i=1,2, ... ,n, with a
1 c = 0
1. 1. 1. 1. n 2
and bl
= a+h f (xl) , 2
i=2,3, ... ,nl, b, = h f (x, ) , 1. 1.
2 b = h f(x )+2hS.
n n
ii.
.. .. .. .. ..
o
L
where c, = l.
and, b, = l.
iii.
..
" .. .. ..
a n1
o
d nl
a 1 n
1, i=I,2, ... ,n, with al
2 2 h f(x,), b h f(x )+6,
l. n n
= c = 0 n
b n1
Ibn L J
174
(4.5.Gb)
i=2,3, ... ,nI.
l = I (4.5.Gc)
Let us proceed with equation (4.5.Gb), and follow the approach of
Evans [Evans, 1985], by splitting A into,
(4.5.7)
where,
and
1
o I hd
2 I
1 I     1,,  "I   
I" , ," , I
I ,'" 'I , , ____ ' ,'I
   Thd
 I nl I o
L 'a , n I
hd I 1   _1_
1 hd I 2
L _ _ _
1
'a , 3
 I
,  1  ,.
o
l
c n2
hdn
_l
where n is even representing an odd number of
1
1
1
1 ,hd, I , I!.J ,
intervals,
175
(4.S.8a)
(4.5.8b)
and hd.=d./2. ~ ~
By using equation (4.S.7), the matrix equation (4.5.5) can be written
in the form,
(4.5.9)
Another formulation similar to (3.6.48b), with the same accuracy
can be derived from the AGE method using the D'Yakonov [D'Yakonov, 19631
splitting formula which can be written explicitly, as,
(k+!) 1 (k) u (Gl+rI) [(GlrI)(G2rI)~ +e.1 (4.5.lOa)
u(k+l) = (G2+rI)1[~(k+!)1 , (4.5.10b)
where u(k+!) is an intermediate value and r is the iteration
parameter.
and
Hence, from equation (4.5.10) u (k+t) (k+1)
and u are given by,
1 (k+t)
i 1
1 det
l (k+1)
1 det
*
~ I L
c 1
W2 _.1
o
I, \
I \ \ 1\ \ \ I I , \ \ I , ' ,
, \ I
; .:..'.! I W I n1
, a , n
o I  \
i i \
f *
(k) 2 I (k) (k)
l
a v u +v U +c v u +b n1 n1 n2 n1 5 n1 n1 n n1
J a a u(k)+a v u(k)+v2u(k)+b L n1 n n2 , n n1 n1 n n n
ret I I
w~ L ___ I I W c I
I 1.2 2 I
1_  :.~3_ w3 I'
\' \ * I ,\ \
I I \ \ ,
,\ \ , , , 1 :I 0
1 :L
~ I
I _ 1_
IW I n2
Ia I n1
I  ' / c I
n2 1 I W ,
_n_11 _ I i I de1:.1 IW J I n
* I I I I
176
(4.5.11a)
(4.5.11b)
177
where, v.;hd.r, w,=hd.+r, i=1,2, ... ,n and det=w.w, lc,a. 1. ~ ~ 1. 1. 1 1.+ 1..1+
By carrying out the vector multiplications in equation (4.5.11a,b)
(k+t) we obtain the values of u
4.5.2 Numerical Results
d (k+l)
an u .
A number of experiments were conducted to demonstrate the
application of the AGE algorithm with D'Yakonov splitting on boundary
value problems. The iteration parameter r was chosen so as to provide
the most rapid convergence and the convergence criterion was taken as
6 £=10 .
Consider the following problem taken from Fox [Fox, 1957],
d2
U
dx2
 U = 0 ,
subject to the boundary conditions,
0<':>0:.<1
: I = 1 , U(l) = 0 •
Ix=o
The analytical solution is given by,
where,
U
1 A = .=....~
2 (l+e )
x x Ae + Be
andB 2
e
(4.5.12)
(4.5.13)
(4.5.14)
The following tables show the results obtained by implementing
this method for a different number of points.
178
n=10, r=0.560
Numerical Sol. Exact Sol. Absolute Error Relative Error Number of iterations
0.67375 0.67371 0.00004 0.000059 23
0.59144 0.59140 0.00004 0.000067
0.51401 0.51397 0.00004 0.000077
0.44083 0.44080 0.00003 0.000068 .
0.37130 0.37127 0.00003 0.000080
0.30485 0.30482 0.00003 0.000098
0.24090 0.24088 0.00002 0.000083
0.17898 0.17894 0.00004 0.000223
0.11851 0.11848 0.00003 0.000253
0.05901 0.05900 0.00001 0.000169
TABLE 4.7
179
n=16, r=0.590
i Number of Numerical Sol. Exact Sol. Absolute Error Relative Error iterations
0.70408 0.70405 0.00003 0.000042 28
0.64899 0.64895 0.00004 0.000046
0.59616 0.59610 0.00006 0.000100
0.54533 0.54530 0.00003 0.000055
0.49645 0.49640 0.00005 0.000100
0.44925 0.44921 0.00004 0.000089
0.40363 0.40358 0.00005 0.000123
0.35938 0.35934 0.00004 0.000111
0.316309 0.31635 0.00004 0.000129
0.27448 0.27445 0.00003 0.000109
0.23353 0.23350 0.00003 0.000128
0.19340 0.19336 0.00004 0.000206
0.15392 0.15389 0.00003 0.000194
0.11499 0.11496 0.00003 0.000260
0.07647 0.07642· 0.00005 0.000654
0.03816 0.03814 0.00002 0.000524
TABLE 4.8
Again, the same three parallel strategies discussed earlier, were
implemented. The timing results from this implem~~tation gives no
significant difference from the AGE implementation with Peaceman
Rachford formulation (Section 4.3).
7
6
5
4
2
1
2 4
No. of Procs. 6
Go First Strategy ... Second Strategy .. Third Strategy
h " It.
8
180
10
FIGURE4.9: The Timing Results for the oneDimensional Boundary Value Problem with D'Yakonov.
181
4.6 THE AGE METHOD FOR SOLVING STURM LIOUVILLE PROBLEM
Consider the differential equation,
d dU dx(P(X)dx)  q(x)U + ApU : 0 , (4.6.1)
where p,q and p are real functions of x in the interval a~x~b, subject
to the boundary conditions,
U(a) : Cl, U(b): S , (4.6.2)
where Cl and S are given real constants.
The problem is to determine the unknown function U and the unknown
parameter A using the AGE iterative method.
x =a o FIGURE 4.10
x nl
x n
x :b n+l
Now when the interval a~x~b is divided up into n equal subintervals
h (Figure 4.10) and the derivatives approximated by the central
difference expressions (Table 3.1), we obtain,
Pi 2(u. 12u.+u. 1) h ~ ~ ~+
i=1,2, ... ,n and h = (ba) (n+l)
 q.u. : Ap.u. 1. 1. 1. 1.
In matrix notation (4.6.3) can be written in the form,
Au : AU
where,
(4.6.3)
(4.6.4)
182
Idl cl I la2
d2
c2 0
", " "
" " "
I " " "A = " " (4.6.5) " , , " " I " ,
", , , , , 0 " , ,
c I a d nl nl nll
a d J n n
2 (2p,+q,h ) d, ~ ~ l<i<n = , ~
h2
p,p~ h/2 ~ ~ 2<i<n (4.6.6) a, = ,
~ h
2
p, +p~ h/2 ~ ~ l<i<nl c, =
~ '2 h'
and >: /;;\ ' l<i<n ~ ,  :A !~' 11
"~
The tridiagonal matrix A is real and diagonally dominant (if,
d,>a,+c,) with positive diagonal entries which are ideal conditions to ~ ~ ~
consider iterative methods of solution [Varga, 1962).
4.6.1 Method of solution
By splitting the matrix A into the sum of two submatrices,
where, I
C !  I
1
c I nll
hd i nJ
(4.6;7)
(4.6.8a)
and ~d~ ~ __
I hd2
I I a 3 [   
I I
t
I _
hd 1 3 I __ _ 1_
" ' I, , "
I ' , , r , , .
o , 1
. I I
','~' I  'hd  c
l
1 I n2 n2 I
e" I
   J    ~ a~_~  ~d,,:.l ~  J . '1 I hd I' : n
if n is even, and hd.~d,/2, i~1,2, ... ,n. ~ ~
183
(4.6.8b)
By using (4.6.7) the matrix equation (4.6.4) can now be written
in the fona,
and by following a strategy similar to the ADI method, [Peaceman,
19551 , (k+!) d (k+l) b d t ' d' I' 'tl b u an u can e e erml.ne llDp l.C 1. Y y,
and
(Gl +rI) ~ (k+!)
(k+l) (G2+rI)~
 (k) (k) A~  (G
2 rI) ~
~ ~u(k+!)_(G rI)u(k+!)  1
or explicitly by, (k+!)
u 1 (k) (k)
(Gl+rI) ['X~ (G2rI)~ 1
and u(k+l) ~ (G2+rI)l[A~(k+!)(G2rI)~(k+!)1
where r is the iteration parameter, given by,
(4.6.9)
(4.6.10)
(4.6.11)
where ~ and v are the minimum and maximum eigenvalues of the submatrices
of Gl
and G2
•
From equation (4.6.11) u(k+!) and u(k+l) are given by,
and
and
fl Tk+!)
u2
1 *
I det 1
: I [nl l un J
I 1 * =
I det
un 
J Lu n
~: , 2
1
l 0
c I 1,
, , "
, , I' " ,
, 1
" , , I . ," , • I
~~~'Wn~l
Ia , n
fdet/ wl
,
1   ~     1 Iw c 1
: 1 2 21
o l 1
1*
184
(k) (k) ul vlul
 (k) (k) (k) A u 2 v 2u 2 c2u 3 ~ (k) (k) (k) AU
3 a
3u
2 v
3u
3 1
, (k) (k) (k) A U a u vu
nl nl n2 nll ~ (k) (k) J A U v u
n n n
(4.6_12a)
W I nl 1,
~ det/w I L
, ,
I n~
 (k+!) 1 (k+!) (k+!) 1 All v U c u nl nl n nl n
 (k+!) (k+!) (k+!) 1 All a u v u n n nl n n I
(4_6_12b)
Hence, the numerical procedure would be as follows:
Given an initial guess eigenvalue A (0) and an initial eigenvector u(O)
we determine a new solution ~(l) , by using the AGE algorithm. To
determine a new value of A (1) , we use "RayZeigh's quotient" [Fox, 1957),
185
(l)A (0) u • u
A (1) (1) (0)
(4.6.13) u • u
Then we continue until the procedure gives no further change in the
value of A. This procedure is only suitable for deriving the smallest
eigenvalue A (n) otherwise diagonal dominance of A will be lost and the
AGE algorithm will diverge.
4.6.2 Numerical Result
In order to explore this method, we present here some numerical
experiments. Consider the problem of Mathien's equation,
d2
U  + (A2qcos2x)U = 0 , dx
2 (4.6.14)
\'lhere,
q is constant and A the eigenvalue, subject to the boundary
conditions, U(O) = 0, U(21T) = 0 . (4.6.15)
The following tables show the results obtained by solving this
problem by the AGE method.
n=9, 5
£=0,1*10 ,r=1.500, q=l
A=0.890461, Number of Iterations = 17
The Solution Vector A*U. A*U. ~ ~
0.133369 0.11877 0.11876
. 0.257145 0.22900 0.22898
0.359889 0.32047 0.32047
0.428808 0.38183 0.38184
0.453195 0.40354 0.40355
0.428808 0.38182 0.38184
0.359889 0.32048 0.32047
0.257145 0.22900 0.22898
0.133369 0.11877 0.11876
TABLE 4.9
186
5 n=13, £=0.1*10 , r=1.500, q=l
A=0.894524, Number of iterations = 22
The Solution Vector A*U, A*U, 1. 1.
0.081095 0.07269 0.07254 0.159266 0.14243 0.1424,
0.231265 0.20710 0.20687 0.293353 0.26238 0.26241
0.341572 0.30557 0.30554 0.372259 0.33288 0.33299'
0.383806 0.34232 0.34243
0.372260 0.33292 0.33300
0.341571 0.30548 0.30554 0.293356 0.26259 0.26241 0.231261 0.20682 0.20687 0.159269 0.14272 0.14247
0.081091 0.07246 0.07254
TABLE 4.10
4.7 CONCLUSIONS
This chapter included the study of synchronous and asynchronous
AGE iterative methods, where the mesh of points of a problem to be
solved was partitioned into n processes. The implemented parallel AGE
methods have been used to solve one and two dimensional boundary value
problems. Three strategies of the parallel AGE methods have been used
to solve these problems.
In both the one and two dimensional problem, the third strategy
showed a better improvement over the other two. Since, in the other
two strategies we need to synchronize the processes at the end of each
iteration step, which in turn, degrades. the performance of the
algorithms.
Comparison, between the first and second strategies shows that
better results obtained when using the first strategy. This is due to
the total number of computational operations in the second strategy
being higher than that of the first strategy and also because of the
reorderings needed in the second strategy.
From the experimental results it can be seen that the shared data
overhead in the third strategy implementation is less than that of the
first and second strategy implementation.
187
To solve a boundary value problem involving derivatives at its
boundaries, by using AGE algorithm, we noticed from the experimental
results, that the absolute errors and the number of iterations is less
than many other iterative methods, whilst, the implementation of the
three parallel strategies gives almost nearly the same speedup obtained.
Finally, it can be seen that the parallel AGE method is suited for
the parallel implementation on a MIMD computer which results in the 
almost linear speedups obtained from their implementation.
To find the eigenvalues and the corresponding eigenvectors of the
SturmLiouville problem,· by the AGE method, we noticed experimentally,
that the results obtained by applying this method is better in
comparison with many other methods.
188
CHAPTER 5
TIME DEPENDENT PROBLEMS,
PARALLEL EXPLORATIONS
We look before and after
and pine for what. is not •.
Shelley.
189
5.1 INTRODUCTION
The parallel AGE iterative method was developed and implemented to
solve one and two dimensional boundary value problems in Chapter 4.
In this Chapter, we will implement the same three parallel strategies
presented earlier in Section 4.2 to solve one and two dimensional
parabolic equations as well as the one dimensional wave equation.
The parallel AGE method was developed and implemented on the
Balance 8000 system using its available 5 processors. These include
synchronous and asynchronous versions of the algorithm. The results
from these implementations were compared as well as the performance
analysis of the best method presented.
The AGE iterative method with the D'Yakonov splitting formula to
solve the one and two dimensional diffusion equation is also
represented in this chapter. This new strategy yields an algorithm
with reduced computational complexity and greater accuracy for multi
dimensional problems. The same three parallel strategies were also
implemented on this kind of problem formulation.
A new method for solving the (nxn) complex tridiagonal matrices
derived from the finite difference/element discretization of the
Schrodinger equation is also presented and confirmed by numerical
experiments.
190
5.2 EXPERIMENTAL RESULTS FOR THE DIFFUSIONCONVECTION EQUATION
This experiment involved the parallel implementation of the AGE
method for solving the diffusionconvection equat~on. Consider the
following problem,
aU a2
u au O<x<1 . F70 E p , at
ax2 ax (5.2.1)
with the initial condition,
U(x,O) = ° , O~x~l , (5.2.2)
and the Dirichlet boundary conditions,
U(O,t) ° U(l,t) = 1 t~o . (5.2.3)
In this example the coefficient E and p assumed the same values of 1.
The exact solution is given by,
U (x, t) (ePx/e:.. l )
(eP/':.l) + 2 L
n=l
n L(xl) ~(_=l~)~n~n ____ ~ e2E
2 2 (mr) + (piE)
2 2 , ( ) ,.[ (nn) E+P /4EJt
S1n nTTX • e (5.2.4)
A uniformlyspaced network whose mesh points are x,=ih, t,=jk, 1 J
for i=O,l,2, .•. ,n+l and j=0,1,2". "m+l, is used ",ith h=(n:l) ,
T k k=(m+l) and A=h2 ' the mesh ratio.
divided as illustrated.
The real line O~x~l is thus
x nl x x =1
n n+l
At the point p(x, ,t, ,) the derivatives in equation (5.2.1) are ~ J+,
approximated by the finite difference discretizations given in Table
3.1. Thus we have,
;(E+K)U, 1 ' l+(l+E)U, , IHEK)U, 1 ' 1 = ;(E+K)U, ,+ 1. ,J+ 1.,)+ 1.+ ,J+ 1.1,J
191
(lE)u, ,+; (EK)u, 1 ' , 1.,J 1.+ ,]
(5.2.5)
2 2 which is the CrankNicolson formula with O(h,k ) accuracy, where
E=EA and K=;pAh. When all the points l~i~n within a line are
considered, equation (5.2.5) generates a tridiagonal system of linear
equations of the form,
d c 1 I~l l [hI l I a d c u
2 b2
/ a d c 0 I " , ,
" I , , , I I
"" ",
I = (5.2.Ga) " ", I , " " I I
" " J I I I I , , " " I
0 , , ,
/bLl/ " " d' u'
L a
nl
a d (nxn) Ln J (nxl) '~n J (nxl)
or A~=!?, (5.2.Gb)
where a=;(E+K), d=(l+E) and c=;(EK).
Also,
;(E+K)UO ,+;(E+K)UO
' l+(lE)Ul
,+;(EK)U2
' ,J ,J+ ,) ,J
b, = ;(E+K)u, 1 ,+(lE)u, ,+;(EK)u, 1 ' , i=2,3, ... ,nl .... 1. ,J 1.,] 1.+ ,J
b = ;(E+K)u ,+(lE)u ,+!(EK)u ,+;(EK)u , n nl,J n,) n+1,J n+1,J+1
Let us now assume that we have an even number of intervals
(corresponding to an odd number of internal points, i.e. n odd) on
the real line O~x~l. We can then perform the following splitting of
the coefficient matrix A,
192
A (5.2.7) where,
[h~ :_ _ _ i _ _ ~ __ ~ r ,hd cl'
1 la hd 1 :0 1 :  :;,:,~ : I
1 ' , , I 1
I ' , 1
_____ 1_ '~'J ,I 1 I hd c
1
I 0 I I
_ I : . ,a hd
(5.2;8a)
and [hd c
a hd c ~J ,
(5.2.8b)
where hd;d/2.
Hence, the AGE iterative method with.PeacemanRachford formula
. (k+!) (k+l) .. can be applied to determ1ne ~ and ~ exp11c1tly by,
and
(k+!) u
(k+l) 1 (k+i) U ; (G2+rI) [(rIGl)~ +£1 (5.2.9)
where the matrices (Gl+rI), (G2+rI), (rIG
l), (rIG
2) are represented
by
(5.2.10a)
193
~ cl ,
l I 0 I
wl I "Go'  1 r
"" ' I I (G
2+rI) = I "" , (5.2.lOb) , ,
I I " ... ...... 1
r~ ,   ~ r
I 'w c I I
I la I t 1' f _w_ ~j , W ,
Ivl I I 1 t_.1  11 Iv cl
~ I I 0 'a vi , 1" I I ", (rIG
l) I "" " (5.2.lOc)
I t ' .... ' ~~ I .1 I  ''1
I 0 'v
J L I I 'a , ,
r c il 0 v t  f" ,
I,"" I (rIG
2) I "",~
(5.2.l0d) = ," 'I I 1 __ I __ "::......'L, I I Iv c , .
I c I I I I 1 la
_ v ~j I ,, v
L I ,
where w=hd+r and v=rhd.
It is clear that (Gl+rI) and (G2+rI) are block diagonal matrices.
All the diagonal elements except the first (or the last at (G2
+rI)) are
(2x2) submatrices. Therefore, (Gl+rI) and (G2+rI) can be eaSily
inverted by merely inverting their (2X2) block diagonal entries.
(k+t) Then, from equation (5.2.9), u d
(k+l) an u are given by,
= _1_ * det
I a  I
I
o L
_....J _
10 I
I , a I
  (k) (k) I IVU l cu2 +bl (k) (k)
aul
+vu2
+b2
. (k) (k) b vu3 ,cu4 + 3
(k) ! (k) b ell +
vu nl n2 n2
. au (k) +vu (k) +b 1
n2 nl nll (k)
["'n +bn J
194
(S.2.11a) and
l (k+l)
1 det *
2 where det=w ac.
, 1 (k+t> b l I vUl + 1
I (k+,) (k+,) 'b vu 2 eu3 T 2
f c ;
a w I  +
o
o
, '  ,I, " ' I "" " " , I
I
_1I
" ' " I , ' r
c I I
I
IW
Ia w 1 1 , ~:t~~
(k+,) (k+,) b ' au
2 tvu
3 + 3
I I I
I (k+,) (k+,) b vu ell +
tl n(~+!> n (k+,) bn1J au 1 +VU + n n n
(S.2.11b)
The corresponding explicit expressions for the AGE equations are
obtained by carrying out the multiplications in (S.2.11a,b). Thus
we have,
(i) At level (k+t>
and
where,
(k+!) u i +l
195
{ 0 for i=nl A aw, B = vw, C cv, E. wb.cb. 1 , D =
c2
otherwise, ~ ~ ~+
2 { 0 for i=nl
~ ~ ~ ~
and A a , B av, C = vw, E. = wbi+lab i , D = ~
otherwise. cw
with the following computational molecules (Figure 5.1) .
1 k+!
k
1 2
1 k+! k+!
k k
iI i i+l i+2 iI i i+l i+2
FIGURE 5.1: The AGE Method At Level (k+!)
(ii) At level (k+l)
(k+l) u
i u. 1 . u. 1 . 2 . e 1. 1. 1.+ 1+ l. = (P (k+!)+QU(k+!)+R (k+!)+su(k+!)~T )/d tj
i=1,3, .. . ,n2, (k+l)
lli+l = (P (k+!) ~QU (k+!) +R~ (k+!) ~S (k+!) T"') Id t
u. 1 +. u. 1 + u. 2 +. e 1 1. 1+ 1+ 1.
196
and (k+l) ( (k+!) (k+!) +b )/w u au 1 +vu n n n n
where,
{ 0 for i;l 2 p Q ; vw, R ; cv, 5 c , T. wb.cb. 1 ,
aw otherwise ~ ~ ~+
and
{ o for i;l
~ .....,
'V ~ "' p ;
2 , Q ; av, R ; vw, 5 ; cw, T. = ab. +wb. l'
~ ~ ~+
a otherwise
with its computational molecules given by Figure 5.2.
1 k 1 k
k+!
iI i i+l i+2 iI i HI i+2
k+l
k+t
FIGURE 5.2: The AGE Method At Level (k+l)
The computational complexity per iteration per point is 42
multiplications + 28 additions, while it usually requires only two
iteratioIl3 for convergence for small time step k.
In order to compare the three parallel strategies we list the
timing results with the speedup ratios of the three strategies in
Tables 5.1, 5.2 and 5.3, where a different number of mesh sizes are
chosen. From these tables, we notice that the asynchronous strategy
(Table 5.3) achieves better speedup ratios than the other strategies.
These speedups are mostly linear, i.e., of order p, where p is the
number of processors that are in use. However, if we consider the
timing results in Tables 5.1 and 5.2, we see that the algorithm of
the first strategy requires less time.
197
The difference in the running times for the three strategies is
due to the fact that in the first two strategies we need to synchronise
the processors, while in the third strategy, each processor works on
its subtasks without waiting for the other processors. Also, the
difference in the runningtimes between the first two strategies is
due to the reordering needed in the second strategy.
The timing results of the three versions are represented in
Figure 5.3.
7 k=O.OOl, t=O.l, A=O.l, £=5.0xl0
No. of points No. of processors
9 1
2
3
4
5
7 k=0.OO5, t=0.5, A=0.7, £=5.0xl0
No. of points No. of processors
11 1
2
3
4
5
7 k=0.OO5, t=1.0, A=l, £=5.0xlO
No. of points No. of processors
13 1
2
3
4
5
r
0.5
r
0.5
r
0.5
Time (second) Speedup
10.330 1
5.695 1.813
4.096 2.521
I 3.296 3.134
: 2.556 4.041
Time (second) Speedup
12.512 1
6.890 1.815
5.995 2.087
4.339 2.883
3.229 3.874
Time (second) Speedup
14.791 1
8.968 1.649
7.348 2.012
6.352 2.328
5.181 2.854
TABLE 5.1: The Results From Implementing The First Strategy.
198
7 k=O .001, t=O.l, \=0.1, E=5.0xl0
No. of points No. of processors
9 1
2
3
4
5
7 k=0.OO5, t=0.5, \=0.8, E=5.0xl0
No. of points No. of processors
11 1
2
3
4
5
7 k=0.OO5, t=1.0, \~1, E=5.0xlO
No. of points No. of processors
13 1
2
3
4
5
Time Speedup r
(second)
0.5 11.117 1
6.146 1.808
4.454 2.495
3.771 2.948
2.993 3.714
Time Speedup r
(second)
0.5. 13 .364 1
7.321 1.825
6.562 2.036
4.511 2.962
4.068 3.285
Time r
(second) Speedup
0.5 15.821 1
9.479 1.669
7.993 1.979
6.897 2.293
5.433 2.912
TABLE 5.2: The Results From Implementing The Second strategy.
199
7 k=O.OOl, t=O.l, A=O.l, c=5.0xl0
No. of points No. of processors
9 1
2
3
4
5
7 k=0.005, t=0.5, A=0.8, c=5.0xl0
No. of points No. of processors
11 1
2
3
4
5
7 k=0.005,t=1.0, A=l, c=5.0xl0
No. of points No. of processors
13 1
2
3
4
5
Time Speedup r
(second)
0.5 9.603 1
5.247 1.830
3.779 2.541
2.833 3.389
2.297 4.180
Time Speedup r
(second)
0.5 11.480 1
6.152 1.866
4.977 2.306
3.788 3.030
2.993 3.835
Time Speedup r
(second)
0.5 13.544 1
7.971 1.699
6.804 1.990
5.707 2.373
4.669 2.900
TABLE 5.3: The Results From Implementing The Third Strategy.
200
o 2 3
No. of proc8. 4
 First Strategy ... Second Strategy ... Third Strategy
h:r13
FIGURE 5.3:The Timing Results for the Diffusion Convection Equation.
201
6
202
5.3 EXPERIMENTAL RESULTS FOR THE TWODIMENSIONAL PARABOLIC ·PROBLEM
Now consider the twodimensional heat equation,
aU at
a2u + 2 + H(x,y,t) , O~K,y~l , t~T , (5.3.1)
ay t
with H(x,y,t)=sinx siny e 4,
where the theoretical solution is given by,
U(x,y,t) t 2 2
sinx siny e + x + y , O~x,y~l, t~O . (5.3.2)
The initial and boundary conditions are defined so as to agree with
the exact solution. At the point p(x. ,y.,t ) in the solution domain, 1 J r
the value of U (x, y, t) is defined by u.. where x. =ih, y. =jh for ~/J,r 1. J
O~i,j~n+l 1
and h (n+1) The increment in the time t,k is chosen such
that t =rk for r=0,l,2, .... r
The mesh ratio is defined
A weighted finite difference approximation to (5.3.1) at the
point (i,j,r+t) (Table 3.1) leads to the five point formula,
ABu. l· l+(1+4~e)u.. llBU. l· llBu,. 1 1 ABu .. 1 1 = 1.  I J I r+ 1. , J , r+ .1. + , J I r+ 1. , J  I r+ 1. , J + ,r+
A(16)u. 1 . +(14A(16) )u .. +A(16)u. 1 . +A(16)u .. 1 + 1. ,J,r 1.,J,r 1.+ ,],r 1.,) ,r
A(16)u .. 1 +kH.. l' for i,j=1,2; ... ,no 1.,J+,r l.,J,r+z
(5.3.3)
We notice that when 6 takes the values 0, t and 1, we obtain
the classical explicit, the crankNicolson and fully implicit schemes
2 2 2 2 whose truncation errors are O(h ,k), O(h ,k ) and O(h ,k) respectively.
Let us proceed with CrankNicolson scheme since it is more
accurate and unconditionally stable. Hence, the weighted finite
difference equation (~.5.3) can be expressed in the more compact
matrix form, as,
Au (k+l) BU(k) + b +!l. (5.3.4)
where,
d
a , ,
a
A
1I I
C
d , , , , ,
C , , , a
, , d
a ~ , ,
,
203
..., I
, , o , ,
, C , , dl Cl
(5.3.5a)
1 .....  0 ___ ....... C
I ... ............... ......... I o   ......... a I .......  ..... ~I c I
I r a Id C
, I la d , I " o , "
C.
,,", " " , , , ,
, , a I
I l where d=1+2A,a=c=A/2,
and the matrix B is of the form,
le s
s
f e
f ~
B
o
l where e=12A, f=S=A/2.
I IS , I " I \ ,
\
\ \ ,
I __ sI
I" , 0 s r, ....... " .... I ...." .... O ," . " . I '.'    ~
I e f. I
I
\ If \ " \
" \ \
\ f'
I
"
s
e
"
" ' " .... ', , " a d' c
o
s ,
a d l 2 2 .l (n xn )
I I I
, s 1 (5.3.5b)
I
I , , " ,
I e si
, , , ". f'
f ~ (n2xn2)
204
The vector b consists of the boundary values where,
A A A A [u +u ]+ [u +u J,u t :U , ••• , 2 O,l,r 1,O,r 2 O,l,r+l l,O,r+l 2 2,O,r 2 2,O,r+l
A A A u + u ,[u +u J 2 nl,O,r 2 nl,O,r+l 2 n,O,r n+l,l,r
A + [u +u J
2 n,O,r+l n+l,l,r+l
b. J
lu +~2 . l'o, ... ,o,lu . +A 20,j,r O,J,r+ 2 n+l,J,r ~n+l,j,r+l'
for j=2,3, ... ,nl,
b ; J" [u +u J + J" [u +u J, lu + n 2 O,n,r l,n+l,r 2 O,n,r+l l,n+l,r+l 2 2,n+l,r
2AU
2 1 1 ' ••• , lu + lu , ,n+ ,r+ . 2 nl,n+l,r 2 nl,n+l,r+l
J,,[u +u J+ J,,[u +u J , 2 n,n+l,r n+l,n,r 2 n,n+l,rTl n+l,n,r+l
and the vector ~ contains the source term of (5.3.3) given by,
g. ; k(Hl . ~,H2' ~, ... ,H . », for j;1,2, ... ,n. J ,J,r+"l ,J,r+"1 n,J,r+"l
We observe from (5.3.5) that A has the form,
~:\i __ 'I~ :~:~ 0 '
A
~ :~ ~~I ~ _ L _ 1   1  
L I ~i ~ 22 (n Xn )
Now, if we split A into the sum of its constituent symmetric and
positive definite matrices G1
, G2
, G3
and G4
, we have,
where,
Therefore,
where,
o , , I
l
o
;~J 122
(n Xn )
l !  I I I
diag (G2
) = %ctiag(A) ,
205
(5.3.6)
Fd 1 ,
1+ 1 hd cl
1 1  ,  I
1 l o
1
I;:::~ 11     I
, ...................
+ ___ ~>':~ _, __ , __ , __ J
o
L
Cl
1 ,     
o
L
I hd 1 1 1
1       1  , 1
1 Ihd cl' 0 1
1 'a hd 1 , , 1  1,, 1
I" ' , " I '"" J + 'f,
1 1 Ihd
o la ,
c
,
l I
206
(5.3.7b)
207
also, by reordering the points columnwise, i.e. along the ydirection
we found that G3 and G4
have the same structure as GI
and G2
respectively
i. e. ,
I n+l 2 n+2
lfhd 1 1
1 1  _L  I   .: n+l I hd cl,
I I, 0 I. 1 1
" ~ .!:d.! __ , ____ I t I'" t I , I'" 'I
,   1'':: ~  ' o 1 Ihd c 1
n 2n       (nl) n
o
2 n
l I 1
1
h~ , ;: "' _ _ ~ _ _ _ _ _ _ _ _ _ _
I, ~" I ' " I
C
, , ,,' " ,I
, , 1 1''
Ihd I' 1_ .1 ______ 1 __ _
I hd c' J~
1_ .l_
C hd'
__ 1,= _...L , , ",,~, 1
 ~ ~ ~:  I C ' I hd cJ I ,
1 ' I a hd (2 2) 1 n Xn
(S.3.7c)
and, 1 n+l
Cl
~
G =G = 4 2
wi th hd=d/4.
2n+L __ . ___ 2. ___ n_
o
. I ,, .L
o
2  n
l
208
(s.3.7d)
The DouglasRachford formula for the AGE fractional scheme then
has the form,
(G +rI)u(k+l/4) = 1 
(k) (k) (rIG 2G
22G
32G)u +2Bu +2b+2a
1 4  .<.
(G +rI)u(k+t) = (k) (k+l/4), 2 _ G~ +ru
(k) (k+t) = G3~ +~
(k) (k+3/4) G4~ +~
(5.3.8)
In a similar manner to equation (4.4.9) (Section 4.4), we now examine
the above iterative formula at each of the four intermediate levels:
(i) At the first intermediate level (the (k+l/4)th step)
Using the expression (s.3.8a), we obtain,
(k+l/4) u 1 (k) (k)
= (Gl+rI) [(rI+Gl)2A)~ +2B~ +2(£+9.)] (5.3.9)
where,
+
o I
o
I a _w.! I ;' ....
10 I
, I " ....... "'" I , ........ '" :I
1_ 1_ ..L ,'_ , I IOw
209
l I
(5.3.10)
cl I '
,a ~ (2 2) n xn
with w=r+hd.
By letting c=«rI+Gl
)2A), we have,
c
f 2d 2c
12a w2d c
I a w2d 2c
a==
l o
le
\
2a w2d c I
\
\
\
=a w2g I __ Cl _
1:::::::.. __ IC___a __ ~~I __
la .w2d 2c
I ' ,2a w2d c
o
I \ a w2d
\
, (
al
"'
, "
,
c
2c
", "
" , , 2a w2d c
Therefore,
1 I I
I I knl I I
1 I
, U
Lnj
r F2d
1 = 
det
2c
w2d
I
L c 12a
I a w2d 2e " " " "
" " " "
1 " " " " " " "
1 I
1 !a2a
,.
0
'w ,
o
,
c , ,
l .:.... 
I ~ I
r   I
01 _ J':~ :!il_
I T
r<· I
" ' I '," I   ~J
1 ___ L
1
" " " "
re I
\
\ \ I
\ I
" " e I I
\ I \ I
w2d I \ ,
a _w=2~1 _ 5 L __ _ . , 1\" '\ :c  .
\\ ......:
. a! \ \. .
I I w 0, :
: a .:J
l 1
0 I
I  ______ 1
*   e
11           I lw=2d  2e
I o
a I' I
\ 12a w2d c I \ I \
\ , \ ,
I I a,
a w2d 2c, ,
2a w2d c
a W2dl
.J
210
*
(k)
~lil U
12
I, 1
U1n +
I
1
2*
~ : s
's I \
I' " " \
\ \
1
,", " , '" , " \
f e s \ I
I o
f e I S _______ , _.J       f _ ", Is_
 I "",  1  __  ,,~ I  ... __  _fl _ ' " I   s 1    I f ' le s     
I \. I , \
If o \
\ \
\ I
£1
,l 1£1 +.'Il
b2
+g2 1 
1 :
1 I
2 * 1
I 1 1 1 I
b +g nl nll
I b +g C' n ~
)
, e s , ' , , ' " , " , ,
" " f
, ,
e
~ f
211
By carrying out these matrix vector multiplications, we obtain the .
f (k+l/4)
values 0 u at the first intermediate level.
(ii) At the second intermediate level (the (k+t)th step)
From equation (5.3.8b), we have,
(5.3.11)
Hence, equation (5.3.11) yields,
 ~(k+,)
Ulll u
12 I I
I :
= _1_* u ln det
rw c 1
I 0 I
.::a_ ~ 1, .:.. __ L ___ I
~ __ C,~ __ ~ ___ I:
1 w c o I' , 1
I a w i , _ ..J _ I , I
,det/wl    ,,, ~
k'" 1 '," ....
o
" ,'! 'w c
r   
I. ,J Ln l
r, f c: I la w' 1
0 I
I~ ~~'~: :,  I ~'I  ' ~
I ,W cl o 1 I I w l
o
l I
o I
__ L. _ la  1 __ ,
" 'w I ....! '       I I" 1 ", I ,,,       ,_,..L.  r ,w c
 I ,  I
1 '10 1 I 1 1 I a wj _...1_ , ',, I I
"" , t ,__"" I I
I r ~"~~ c 1
10 ' I I
:a w ~ ~J
o
L _ ~_
'0 I_a w I ! 1'' I  I
" " ' , ........... ,,! I " , t_ _ _ I _ ....!.. :...J _ l _ _
o 1 IW c 1
 ,
, 1
I 1 I
U'2J _n
+ r*
I
Ia l 
U In ,
I I
1 _ , 1
I )
*
The multiplication of the above matrixvector operands will give the
values of u(k+').
212
213
(iii) At the third intermediate level (the (k+3/4)th step)
By reordering the mesh points columnwise parallel to the yaxis
we found that
and (5.3.12)
where c stands for a columnwise ordering of the mesh points. Therefore,
equation (5.3.8c) can be transformed to give,
(k+3/4) 1 (k) (k+!) ~(c) = (Gl+rI) [Gl~(C) +~(c) J,
and hence, equation (5.3.13) yields,
_1_* det
Idet/w 1 1 Iw
r_ I a I 
1 1
I , 0 L
1
I I "1  I 
cl 0
w'  'I, , 1·
I' " """1 , " I j"'r:'
w c I I ,
o
Ia w I  ',_, I      
I'"'' I , " ' , ,,~
       'ljet/;' 1;
_ __ l __ .J _ I
'
0
I I
IW
1
Ia ,
.!..
c I I
w'
1
, j, ' I ," I" ,,'!
I~ , W
0
(5.3.13)
l
o
cl L 'a ~
214
o
:a w I      1     _.  . 1 ' ,
""" I +r*
 ' ' , ,," I
~ 'iWi   1   _.
1_'1 __ '_, I  r . . Iw C :0
o I J
th f 1 of U (k+ 3/4 ) d f th ere ore, the va ues will be obtaine ram e
multiplication of the above matrix vector operands.
(iv) At the fourth (and final) intermediate level (the (k+l)th step)
Similarly, by reordering the mesh points_columnwise parallel
to the yaxis we have,
Then, the last equation of (5.3.8) is transformed to give,
which yields,
(k+l) u (c)
(5.3.14)
(5.3.15)
u nl 1 I , I
(k+l) Iw c I·
I 0 ~a _ :' L I
I' ~ \. i , , 1 " ,
~'r tI ·wc
o I
  1,
Ia _..L_
I , 1 1   '
~* det I
I I L
o
  ; , , , 1 __ l_,_
I
c 1
,0
1I
J
, . ! a
o
c
*  ,
'0 w' ~'i!
" \. ' I : ,,\. 'I
....!... ~~ __ J._ 'w c I I
:a w 1 , , ,det/~
[U l (k+3/4) 1
I 111 I
IU~ll I ! I I '
J
215
(k+l) The values of u will be obtained from multiplying the above matrix
vector operands.
216
Thus, the AGE scheme corresponds to sweeping through the mesh
parallel to the coordinate x and y axes. The iterative procedure is
continued until the requirement I (k+l) (k) I· .. u. . u. . ~e: 15 satl.sfl.ed, where 1.,J 1.,J
£ is the converge criterion.
Now, we concentrate our attention on the implementation of this
problem on the parallel Balance 8000 system. We abbreviate our
discussion in this section only on the first and the third strategies,
in order to compare the synchronous and asynchronous strategies.
Tables 5.4 and 5.5 represent the results obtained from these
implementations respectively. From these tables, we notice that the
asynchronous version achieves better speedup ratios, where these
speedups are mostly linear.
The difference in the running times for the two strategies is
due to the fact that the synchronous version generates a large amount
of communication overheads in the system. Also, the synchronisation
cost will be large if the number of tasks is greater than the number
of available processors. This is because, the tasks that complete
their job remain idle for a long time whilst waiting for the other
tasks to complete their job. The same situation occurs as many times
as the algorithm iterates to achieve convergence. Clearly, this
generates a significant amount of overheads that degrade the
performance of the algorithm.
On the other hand, in the asynchronous strategy, the number of
subtasks generated is equal to the number of cooperating processors
where each task is iterated to evaluate its points without any
synchronisation. Therefore, the asynchronous strategy which has no
synchronisation delays is better than the synchronous version.
In Figure 5.4, the timing results of the two strategies are
illustrated.
5 k=0.OOO2, t=0.OO18, A=0.06, £=10
Matrix size No. of processors
7x7 1
2
3
4
5
5 k=0.OO04, t=0.OO36, A=0.04, £=10
Matrix size No. of Processors
Time Speedup r
(second)
1.5 23.118 1
13.061 1.170
10.141 2.279
8.036 2.876
7.447 3.104
Time Speedup r
(second) 
9x9 1
2
3
4
5
5 k=o.ool, t=O.l, A=0.OO5, £=10
Matrix size No. of processors
13x13 1
2
3
4
5
1.5
r
1.5
26.684 1
14.845 1.797
9.622 2.773
8.426 3.166
I 7.420 3.596
I Time Speedup
(second) I I
I ! 29.261 1
16.634 I 1.759 I I ,
I 14.598 I 2.004
I ! 10.81'1 I 2.705 I I
I 8.822 3.316
TABLE 5.4: The Results From Implementing The First Strategy

I , ,
217
I
5 k=0.OOO2, t=0.OO18, A=0.06, £=10
Matrix size No. of processors
7 x 7 1
2
3
4
5
5 k=0.OOO4, t=0.OO36, A=0.04, £=10
Matrix size No. of processors
9x9 1
2
3
4
5
5 k=O.OOl, t=O.l, A=0.005, £=10
Matrix size No. of processors
13x13 1
2
3
4
5
r
1.5
r
1.5
r
1.5
Time Speedup
(second)
22.908 1
12.876 1.779
9.818 2.333
7.742 2.958
6.085 3.764
Time Speedup
(second)
24.843 1
13 .377 1.857
9.420 2.637
8.095 3.068
6.309 3.937
Time Speedup'
(second)
27.905 1
14.134 1.974
13.004 2.145
9.556 2.920
8.099 3.445
TABLE 5.5: The Results From Implementing The Third Strategy
218
219
4
3
G First Strategy ... Third Strategy n .c:: 't
o~~~~==~~~~~==~~~==7 o 2 3 4 5 6
No, of proe.,
FIGURE 5,4: The Timing Results for the TwoDimensional Heat Equation,
220
5.4 EXPERIMENTAL RESULTS FOR THE SECOND ORDER WAVE EQUATION
Consider the onedimensional wave equation,
a2u 2 ' O~x~l, O~t<T ,
ax (5.4.1)
subject to the initial conditions,
U(x,O) = ~in(nx) , (5.4.2a)
aU at (x ,0) = 0 , (5.4.2b)
and the boundary conditions,
U(O,t) U(l,t) = 0 • (5.4.2c)
The exact solution is given by,
1 . U(x,t) = as~nnx COSnX . (5.4.3)
From the expectation of achieving stability advantages, a general
implicit finite difference discretisation to (5.4.1) at the· j+l,j
and jl time levels (Table 3.1) is,
2 2 2 aA u, 1 ' 1+(l+2aA )u, , laA u, 1 ' 1
1. ,J+ 1.,)+ 1.+ ,)+ 2
 (12a)A u, ,+ ~l,J
2 2 2 2 2(1(12a)A )u, ,+(12alA u, 1 ,+aA u, 1 ' 1(1+2aA )u
1.,J 1.+ ,) 1. ,) i,jl
2 , 1 2 +OA u. 1 ' 1 ' 1.= , I a 0 • , n , 1+ ,J (5.4.4)
with a=l for unconditional stability, A=i and truncation error of
order 0(h2
,k2
) with its computational molecule given by Figure 5.5.
Equation (5.4.4) gives a tridiagonal system of equations at
the (j+l)th time level, which can be displayed in matrix form as,
221
il,j+l
il,j i,j i+l,j
il,jl i,jl. i+l,jl
FIGURE 5.5
Id c l r' I Fl l la
I d c 0 u
2 !b2 I I .... .... " "
.... "
I: I " .... .... ....
I : " " .... " " .... , I I (S.4.Sa) "
, "
.... , I
I ! " " " .... " I I " .... , 0
, , " " I a d cl lUn 1 /bn_ll I
L a dJ U J 'b J n l!n
or Au = b , (S.4.5b)
where, 2 2 a = c = aA and d 1+aA.
The vector ~ is defined by,
22 2 2 b
l = [2(l(l2a)A )u
l .+(l2a)A u
2 .1+[(1+2aA )u
1 '_1+aA u
2 '_11
IJ ,] ,J,J
2 2 +aA [uo . l+u
O '_ll+(l2a)A Uo .
,J+,J ,J
and
b i
222
2 2 2 2 (12a)A u. 1 .+2(1a)1 )u .. +(12a)A u. 1 .+aA u. 1 .
~ ,J 1.,) 1.+,J 1 ,)1
2 2 (1+2aA )u .. l+aA u. 1 . l' i=2,3, ... ,nl,
1,) 1+ ,J
b n
2 2 2 2 = [(12a)A u 1 .+2(1(12a)A)u .1+[aA u 1 . 1(1+2aA)u . 11
n ,J n t ) n ,) n,J
2 2 +al [u 1 . l+u 1 . 11+(12a)A u 1 . n+ ,J+ n+ ,J n+ ,)
The u values on the first time level are given by the initial
condition (S.4.2a). Values on the second time level are obtained
by applying the forward finite difference approximation to equation
(S.4.2b) at t=O,
or,
u u au(x 0)" i,l i,O at i' k
= 0
u. 1 = u. ,which is a first order approximation. 1., 1.,0
, ~olution on the third and subsequent time levels are
generated iteratively by applying the AGE algorithm along the lines
(k+!) and (k+l).
When we implemented the AGE algorithm on equation (S.4.5b), we
arrived at the same form of equations along the lines (k+!) and (k+l)
as those that we have derived in Section (5.2). These explicit
equations are given by (5.2.11a,b).
By implementing this algorithm in parallel using the first and
third strategies, we obtain similar results to those obtained
previously in Section 5.2. These can be predicted because the
computation load on each of the processors is roughly the same.
Figure 5.6 illustrates the results obtained from implementing these
two strategies.
5
4
3
1
o 1 2
No. of Proes.
3 4
.... First Strategy
... Third Strategy h:13
5
FIGURE 5.6: The Timing Results for the Second Order Wave Equation.
223
6
224
5.5 THE NUMERICAL SOLUTION OF ONEDIMENSIONAL PARABOLIC EQUATIONS BY
THE AGE METHOD WITH THE D'YAKONOV SPLITTING
In this section, the AGE iterative method for solving the one
dimensional parabolic problem is introduced using a D'Yakonov splitting
strategy. It is known that the diffusion equation when solved by
the AGE method yields an algorithm with reduced computational
complexity and greater accuracy for multidimensional problems.
Consider the heat conduction problem,
au at
with the boundary conditions,
U(O.t) = gl
U(~.t)
and the initial condition,
U(x.O) = f(x)
I t~O , (5.5.1)
(5.5.2)
(5.5.3)
Let us assume that the rectangular solution domain R is covered
by a rectangular grid with grid spacing h.k in the x and t directions
respectively. The grid point (x.t) are given by.
x. = ~
ih fori=1,2, ... ,n
and t. = jk J
for j=O.1.2 ••..
A crankNicolson implicit finite difference approximation to
(5.5.1) at the jth and (j+l)th time level with principal truncation
2 2 error O(h .k ). (Table 3.1). can be written as.
Au. 1 . l+(2+2A)u. . lAu. 1 . 1 = All. 1 .+(22A)u .. +Au. 1 .• 1. ,J+ 1.,)+ 1.+ ,J+ 1. ,J 1.,) 1.+,]
h i=1,2, ... ,n; j=O,l,2, ... ; and A = k2
For the totality of points i=l.2.3 •...• n. equation (5.5.4) can
be written in matrix form as,
(5.5.4)
i. e. I
where,
and
c l d c ,
L
Au ; b
I U nl
u J n
;
fb l I 1 I b 2 I
I
c ; a ; A, d;2A+2 and the column vector b is,
b, ~
AU ,+(22A)ul
,+AU2
,+Agl O,J ,),J
AU, 1 ,+(22A)U, ,+AU, 1 ' 1 ,J 1,J 1.+,J
b ; AU 1 ,+(22A)U ,+AU 1 ,+Ag2 n n I] n, J n+,)
5.5.1 Formulation of the AGE Method
By splitting A matrix into components Gl
and Gi such that,
A ; Gl
+ G2
'
where Gl
and G2
are as defined in (5.2.8a) and (5.2.8b),
respectively with hd;(2+2A)/2.
225
(5.5.5)
(5.5.6)
(5.5.7)
By using equation (5.5.7) the matrix equation (5.5.6) can be
written in the form,
(5.5.8)
Another formulation similar to the PeacemanRachford splitting and
which has the same accuracy can be derived by using the D'Yakonov
splitting formula. This can be written explicitly as,
(k+l)* 1 (k) U ; (Gl+rI) [(GlrI) (G2rI)~ +!?] (5.5.ga)
(k+l) 1 (k+l) * U ; (G
2 +rI) [~ ] (5.5.9b)
226
(k+l)* where u is an intermediate value and r is L~e iteration
parameter.
Since (Gl+rI) and (G2+rI) are easily invertible, then from
(k+l) * d (k+l) equa tion (5. 5 . 9) u an u are given by,
I~l l (k+l) *
fde~/~ ~ I I I
1 ____ 1 __ _
u2
u' nl
and
1
I
1
I
1 det
1 det
*
IW c I
!:"~ _w L _ I
r c :
a W I
1  ...,' "
1
, , "" , ,
" ,
.
*1 ...... " ',I
" ' ,',1
1.I    1;
1 I
0 I
1 Ia  1    I'
L 2
where v=hdr, w=r+hd and det=w ac.
1
10 I 1 :  I
0
'w
, a ,
I
1 r
C
1
j
I
I  I
1
c
WI !,
pet/w J
f2UlV~U2 +bl 2 1k)
\vaul+v u2vcu3+c u
4+b
2 , 2 2
* la u l vau2+v u 3vcu4 +b3
; 2 I Ivau 2+v u lvCU +b 1 i 2 n n 2 n n I 1 a u vcu +v u +b J L n2 nl n n
(5.5.lOa)
fl l (k+l) *
IU 2
I 1 I I 1
* 1 (5.5.l0b)
I u'
1 nl r 1 ' n J
By carrying out the multiplication 'in equation (5.S.l0a,b)
we have,
(i) At level (k+l) *,
(k+l)* u
l
(k + 1) * u,
1. = (A1U, 1+B1U,+Clu. 1+D1U. 2+E.)/det
1. 1. 1.+ 1.+ 1.
where,
(k+l)* u i +l
Al = 2 (vwa+a c)
2 Bl = (v w+vac)
2 Cl = (vwc+v c)
A2
B2
C2
(va 2 2
= +wa )
2 = (v a+vwa)
2 = (vac+v w) ,
\ 0 i=n Dl = loa: i=:l
(wc +vc ) otherwise i 2 (ac +vwc) otherwise
\
E, = (wb,cb, 1) ~ ~ 1.+
El.' = (wb,ab, 1) , l. l. +
with the following computational molecules (Figure, 5.7).
1 (k+l)*
(k)
1 2
1 (k+l)*
(k)
il i i+l i+2 il i i+l i+2
FIGURE 5.7: The AGE Method At Level (k+l)*.
227
(ii) At level (k+l) ,
(k+l) u.
1 (wu.cu. l)/det
1 1+
(k+l) = (au.+wu. l)/det Ui +1 ~ 1+
(k+l) u = u /w ,
n n
228
i:;l,3, ... ,nl, 2.
with computational molecules given by (Figure 5.8),
(kt 1)
88 (k+l)*
i i+l i i+l
1 (k+l)
l/w (k+l) *
n
FIGURE 5.8: The AGE Method At Level (k+l).
The computational complexity can be easily derived, which is 6
multiplications and 6 additions per point per AGE iteration plus
. th . . h 2 2 11 some precomputat1on of e quant1t1es sue as c , w , etc. Norma y,
because the previous solution is a good approximate solution then only
two iterations are usually required.
5.5.2 Numerical Results
A number of numerical experiments were conducted on a model
problem to demonstrate the application of the AGE algorithm with the
D'Yakonov splitting strategy on parabolic problems.
We considered the following problem,
aU at
subject to the initial condition,
U(x,O) = sinx, O~X~TI
and the boundary conditions,
U(O,t) = 0
U(n,t) 0, t~O
The exact solution is given by,
t U(x,t) = e sinx
Tables 5.6, 5.7 and 5.8 present the numerical solution, absolute and
relative errors and the exact solution of this problem at appropriate
grid points for values of nt=5,11 and 21. The results confirm the
accuracy given by the D'Yakonov splitting is approximately equivalent
to the PeacemanRachford and better than DouglasRachford splitting
as expected.
By implementing this algorithm in parallel using the three
strategies discussed earlier, we found that the speedups are
slightly better than those found in Section 5.2. Figure 5.9 shows
the running times achieved by applying these three strategies.
229
kO 005 hO In rO 5  .  .  . 6 El0 
~ O.in
Numerical 0.302948 solution
Absolute 5 D 5.0xl0 Error
Relative 1.6xl0 4
Error
Numerical 0.302963 Solution
DR Absolute Error
6.5xlO 5
Relative 2.1xl0 4 Error
Numerical 0.302948 Solution
PR Absolute Error
5.0xl0 5
Relative Error
1. 6xlO 4
Exact Solution 0.302898
TABLE 5.6
0.2n
0.576241
9.4xl0 5
1.6x1O 4
0.576269
1.2xl0 4
2.1xl0 4
0.576241
9.5xl0 4
1.6xl0 4
0.576146
0.3n 0.4n O.sn 0.6n
0.793127 0.932377 0.980359 0.9328"0
1. 3xl0 4
1.5xl0 4
1.6xl0 4
1.5x1O 4
. 1.6xl0 4 1.6xlO
4 1.6xl0
4 1.6x1O
4
0.793166 0.932423 0.980407 0.932423
1.6xl0 4
1.9xl0 4
2.0xl0 4
1.9xlO 4
2.1xl0 4 2.1x1O
4 2.1xl0
4 2.1xl0 4
0.793127 0.932377 0.98036 0.932377
1.3xl0 4
1.5xl0 4 1.6xl0
4 1.5xl0
4
1. 6xl0 4
1.6xl0 4
1.6xl0 4
1.6xl0 4
0.792997 0.932224 0.980199 0.932249
LEGEND D: D'Yakonov splitting PR: PeacemanRachford splitting DR: DouglasRachford splitting
0.7n 0.8n
O. "193127 0.576241
1.3xl0 4
9.4xl0 5
1.6xlO 4
1.6xl0 4
0.793166 0.576269
1.6xl0 4
1.2xl0 4
2.1xl0 4 2.1xl0 4
0.793128 0.576241
1.3xl0 4
9.4xl0 5
" 4 1. 6xl0
" 4 1.6xl0
0.792997 0.576146
0.9n
0.302948
5.0xl0 5
4 1.6xl0
0.302963
6.4xl0 4
2.1xl0 4
0.302948
s.oxl0 5
1.6xl0 4
0.302890
No.of Iters.
2
4
2
" w o
Time level nt=13, 6
k=O.OOs, h=O.ln, r=O.s, E=lO
~ O.ln
Numerical 0.291164 solution
D Absolute Error
1.4xlO 4
Relative Error
4.2x l04
Numerical 0.291207 solution
Absolute 4 DR Error 1.8xlO
Relative 4 Error
6.3xlO
Numerical Solution 0.291165
PR Absolute Error
1. 4xlO 5
Relative 4.9 xlO 5 Error
Exact Solution 0.291021
0.2n 0.3n
0.553838 0.762278
2.7 xlO4 3.7X104
4.2 x l04 4.2xlO4
0.553909 0.762390
3.5xlO 4 4.8 xlO4
6.3 X104 6.3xlO4
0.553828 0.762279
5 2.7 xlO . 3.7 X105
4.9 XlOs 4.9 XlO5
0.553555 0.761903
No.of 0.4n O.sn 0.6n 0.7n 0.8n 0.9n Iters.
0.896112 0.942228 0.896112 0.762278 0.553828 0.291164
4.4xlO4 4.6xlO4 4.4xlO4 3.7xlO 4 2.7 X104 1.4xlO 4 2
4.2xlO4 4.2 Xl04 4.2 Xl04 4.2 X104 4.2 X104 4.2 X104
0.896244 0.942366 0.896244 0.762391 0.553909 0.211207
5.7 xlO 4
6.0xlO 4 s.7XlO4
4.8xlO 4
3.SxlO 4 1.8xlO 4 4
6.3xlO4 6.3 XlO4 6.3 xlO4 6.3 xlO4 6.3 XlO4 6.3 xlO4
0.896113 0.94228 0.896113 0.762279 0.553828 0.291165
4.4 X105 4.6xlO5 4 .4 xlO5 3.7xlO5 2.7 XlOs 1.4x l0 5 2
4.9 X10s 4.9XlOs 4.9 XlO 5 4.9 XlOs 4.9 xlOs 4.9 XlOs
0.895671 0.941765 0.895671 0.761904 0.553444 0.291022
TABLE 5.7
Time level nt=21, 6
k=0.OO5, h=O.ln, r=0.5, £=10
~ O.ln 0.2n
Numerical 0.279839 0.532286
Solution
Absolute 4 4 D 2.2xlO 4.3xlO Error
Relative 4 4 Error 8.2xlO 8.2xlO
Numerical Solution 0.279908 0.532417
Absolute 4 4 DR 2.9xlO 5.6xlO Error
Relative 3 3 Error 1.OxlO 1.OxlO
Numerical solution 0.279840 0.532287
Absolute 4 4.3xlO 4 PR 2.2x1.0 Error
Relative 4 4 Error
8.2xlO 8.2x10
Exact Solution 0.285446 0.542951
0.3n 0.4n 0.5n

0.732629 0.861257 0.905576
6.0xlO 4
7.0xlO 4
7.4xlO 4
8.2xlO 4
8.2xlO 4
8.2xlO 4
0.732809 0.861468 0.905801
7.8xlO 4
9.1xlO 4
9.6xlO 4
1.0x10 3
l.OxlO 3
1.0x10 3
0.732630 0.861258 0.905580
6.0xlO 4
7.0xlO 4
7.4xlO 4
8.2xlO 4
8.2x10 4
8.2xlO 4
0.747307 0.878513 0.923723
TABLE 5.8
0.6n 0 .. 7n
0.861257 0.732629
7.0xlO 4
6.0xlO 4
8.2xlO 4
8.2xlO 4
0.861468 0.732809
9.1xlO 4
7.8xlO 4
1.OxlO 3
l.OxlO 3
0.861258 0.732630
7.0xlO 4 6.0xlQ
4
4 4 8.2xlO 8.2xlO
0.878513 0.747308
0.8n
0.532286
4.3xlO 4
8.2xlO 4
0.532417
5.6xlO 4
l.OxIO 3
0.532287
4.3xlO 4
8.2x10 4
0.542951
0.9n
0.279839
2.3xlO 4
8.2xlO 4
0.279908
2.9xlO 4
l.OxlO 3
0.279840
'4 2.2xlO
8.2xlO 4
0.285446
No.of Iters.
2
4
2
'" w
'"
4
3
1
o 1 2
No. of procB
3 4
+ First Strategy ... Second Strategy ... Third Strategy n .. \3
5 6
FIGURE 5.9: The Timing Results for the OneDimensional Heat Equation with D'Yakonov.
233
234
5.6 THE NUMERICAL SOLUTION OF TWODIMENSIONAL PARABOLIC EQUATIONS BY
THE AGE METHOD WITH D'YAKONOV SPLITTING
The AGE method with D'Yakonov splitting for solving a one
dimensional parabolic differential equation described in Section 5.5,
can be readily extended to problems involving higher space dimensions.
Consider the twodimensional heat equation,
au at , O~x,y~N and t~o ,
with the initial conditions,
U(x,y,O) = F(x,y,t) ,
and the boundary conditions are,
U(x,O,t)
and
(5.6.1)
(5.6.2)
(5.6.3)
At the point p (x. ,y . ,t ) in the solution domain R, the value of 1 J r '
U(x,y,t) is denoted by u, , ,where x,=ih, y.=jh, for O~i,j~n+l ~,J,r ~ J
and h N (n+l)
The increment in time t,k is chosen such that t =rk, r
for r=O,1,2, •... A weighted finite difference approximation to
(5.6.1) (see Table 3.1) at the point (i,j,r+i) is given by,
leu. l' 1+(1+4le)u.. lleu. l' lleu .. 1 llBu .. 1 '1 ~ ,],r+ 1,J,r+ 1+ ,J,r+ 1,) ,r+ 1,)+ ,r+
_ l(lB)u. . +(14l(1B»u .. +l(lS)u. . +A(lB)u .. J.l,),r . 1,],r 1+1,J,r l.,)l,r
k +A(l.,.B)u .. 1 ,for i,j=1,2, ... ,n, r=O,1,2, .•. , A = 2 and O~e~l.
1,J+ ,r h
This approximation can be displayed in a more compact matrix form as,
(k+l) (k) Au ;; B~ + E ' (5.6.4)
where the vector b consists of the boundary values with,
235
bl
A(lS)[u +u ]+A8[u +u ],,,(18)u + O,l,r 1,O,r O,l,r+l l,O,r+l 2,O,r
A8u2 1.··· .,,(18)u 1 +,,8u '1.,,(18) [u + ,O,r+ n ,O/r nl,O,rT . n,O,r
u ]+,,8[u +u ] • n+l,O,r n/O,r+l n+l,l,r+l
bJ. ; ,,(18)Uo . +,,8Uo ' 1.0 •...• 0."(18)u l' +A8u l' 1
,J,r ,J,r+ n+ IJ,r n+ ,],r+
for j=2,3, ... ,nl,
and
b ; 1.(18) [u +u ]+1.8 [u +u ].A(18)u n O,n,r l,n+l,r O,n,r+l l,n+l,r+l 2,n+l,r
+A8u •...• A(18)u +,,8u • 2,n+l,r+l nl,n+l,r nl,n+l,r+l
1.(18) [u +u ] +1.8 [u +u ] n,n+l,r n+l,n,r n,n+l,r+l n+l,n,r+l
The coefficient matrices A and B in equation (S.6.4) take the same
block tridiagonal form as in (S.3.Sa) and (S.3.Sb) respectively.
If we choose 8;! and split A into the sum of constituent
symmetric and positive definite matrices Gl
.G2
.G3
and G4
• such that.
(S.6.S)
where Gl .G2 .G3
and G4
are the same as in (S.3.7a). (S.3.7b). (S.3.7c)
and (S.3.7d) respectively.
Analogous to Section S.S. the D'Yakonov splitting for equation
(S.6.4) then takes the form.
(G +rI)u(k+l/4) ; 1 .
(k) [(G rI) (G
2rI) (G
3rI)(G rI)]u +b
1 4  
(G +rI)u (k+!) ; 2 
(G +rI)u(k+3/4) 3 
(G4
+rI).!:: (k+l)
(k+l/4) u
(k+! ) ; U
(k+3/4) u
(a) 1 (b)
(S.6.6)
(c) j (d)
236
Let us consider the above iterative formulae at each of the
four intermediate levels;
(i) At the first intermediate level (the (k+l/4)th step)
If we reorder the mesh points columnwise parallel to the yaxis,
~ '" we find that G3~=Gl~(c) and G4u=G2 ~(c)' where the suffix c stands
for columnwise ordering. Therefore, equation (5.6.6a) can be written
as,
r l (k+l/4) IUll1
= 1
det *
fd~t/:~  ~
IW c I 0 Ia w
~I'  1,, I I
"~,', I I 1 " r  1 I
, 0 , W c
l o
I _____ ...L :a ~I,' j  
o
L
I'" I , " I ,,,I , 11'1'
ldet/W 1
~ I
I Iw
'a ,
1
T 1 
c I ,0 W, ~I
I, .;~, I
I " ....... ',
o '_ 1
,w
I
'a
c
WI
*
r 2
w
wa
L
o
wc 2
w wc
o
2 c
" ' , 2'
, ,
a wa
o
c
Iwa I 2 a wa
o
wc 2
w
2 c
wc
l
  I I
o I , , , ,
o
, I 2,
w __ L I,.... I
I" " ""'," I
I _, +Iw
2
I wa
I 2 la I
o
,
wc 2
w
"
wc 2
o
2 c
wa w wc ' " '. '
"",
, " "
" " " 2' ,
"a wa
o ,
*
"2
w
I U nl
I I
237
+
(5.6.7a)
B . t th 1 eh th 1 of u(k+l/4) y carry1ng ou e necessary a 9 ra, e va ue
can be derived.
(ii) At the second intermediate level (the (k+!)th step)
From equation (5.6.6b) we have,
(kd) U (G )
1 (k+1/4) = 2+rI ~
which yields,
I \ '
I"
k+t) c f 10
a W j
1 J det "I
 T ..... ==,    TI" : ," ! ,,,
  1 ~ 'TW c!
o '
1 _______ 
o
, I
238
o
, 
c ,
1
1
IUln
,_oil : o 1 : ,
a 1
w ~, "
... ,," I t" " '
1
, , 1
I i" '.      1 / c "w" c,
'a w' /
:  t:t~J " , J
U 2 n
fl."nd u(k+t) by " th" t" 1 "1" " Hence, we can carry1ng out 1S ma r1Xvector mu t1P 1cat1on.
(iii) At the third intermediate level (the (k+3/4)th step)
By reordering the mesh points columnwise parallel to the ""'axis,
equation (5.6.6c) can be written as,
(k+3/4) u (c)
and hence, we have,
(G + I) 1 (k+t) 1 r ~(c)
Iul jk+3/4) ~e~/:: ___ : ___ : __ _ u ,I Iw e, I 0 ~l'l I
Ia w I I  f  ,   r
I, I I I~~, I
t ",,' , " 'I +''," I I 0 ,w e I
: I _1_* U nl ( det
 'j Ia w I
 .J   T ,,, , I ,,'
, 'I , I
___ L
I
, " '''
o
     ,  det/w , ,
,
w e: : 0 I I
o 1 _ =a _ w; ,, : _ _ __ I I  ~\~  I
::a ; t~2J 0,
L and by carrying out the necessary algebra, we find the values of
(k+3/4) u
(iv) At the fourth intermediate level (the(k+l)th step)
In a similar manner to the third step, equation (S.6.6d) can be
expressed as,
which yields,
(k+l) u (c)
( )1 (k+3/4)
= G2+rI ~(e)
239
240
(k+l)
:0 , I ,
o r l (k+3/4) u
ll
u2l
IU3l
_1_* unl det 1
\
I
I : j LU
2 n L
C, I a w
_1_ ' 1
,
  
0
, c
I
w l
o
By carrying out the necessary algebra, we can find the values of
(k+l) u .
It can be readily shown that the method formulated above is
second order accurate AGE algorithm in k. Hence, the D'Yakonov split
gives a second order AGE method. The computational work involved at
each stage is the solution of 2x2 block systems, or in explicit form
as simple recurrence relations. The iterative procedure is continued
until convergence to a specific level of accuracy £. is achieved.
5.6.1 Numerical Results
A number of numerical experiments were conducted on the model
problem, i.e. the diffusion equation to demonstrate the application
of the AGE algorithm with the D' Yakonov splitting strategy on a two
dimensional parabolic problem.
241
Consider the following problem,
" ,2 "2 oU ~ + ~ +
2 2 H(x,y,t) (5.6.8)
at ax ay H(x,y,t) ; sinxsiny exp(t)4, defined in the region O~x,y~l, t<O.
We choose the exact solution to be,
t 2 2 U(x,y,t) = sinx siny e + x +y , O~x,y~l, t~o.
The initial and boundary conditions are defined so as to agree with
the exact solution.
The numerical results for different values of x and y were
compared with the corresponding results obtained in different formulae.
It is generally observed from Tables 5.9, 5.10 and 5.11 that the AGE
algorithm using D'Yakonov splitting produces more accurate results
with less number of iterations.
4 x=O.l, k=O.OOl, h=O.l, t=l.O, A=O.l, r=1.5, £=10
~ 0.1 0.2 0.3 0.4
AGED l.oxlO 5
l.oxlO 5
2.1xlO 5
3.2xlO 5 
AGECN 1.75xlO 5
3.4xlO 5
4.98xlO 5
6.29xlO 5
AGEIMP 2.0xlO 5
3.9xlO 5
5.7xl0 S
7.3xlO 5
  Exact
0.029018 Solution 0.067946 0.126695 0.205177
0.5 0.6
3.7xlO 5 4.0xlO
5 .
7.24xlO 5
7.69xlO 5
8.4xlO S
9.0xlO S
0.303308 0.421006
TABLE 5.9: The Absolute Errors of the Numerical Solution of Problem (S.6.8)
x=0.5, k=O.OOl, h=O.l, t=l.O, A=O.l, r=l.S, £=10 4
I~ 0.1 0.2 0.3 0.4 O.S 0.6
AGED 1.7xlO 5 2.3xlOS 3.3xlO
S 3.6xlO
6 S .0xlO
S S.lxlO
S
AGECN 5 4 4 4 4 4 ".24xlO 1.42xlO 2.07xlO 2.62xlO 3.04xlO 3.25xlO
AGEIMP 8.41xlO S l.66xlO 4 2.41xlO 4
3.07xlO 4 3.58xlO 4 3.87xlO
4
Exact Solution 0.303308 0.376183 0.468197 0.S78931 0.707976 0.854943
TABLE S.lO: The Absolute Errors of the Numerical Solution of problem (5.6.8)
.. ~... 
0.7 0.8 0.9 No.of Iters. 
4.1xlO 5
4.0xlO 5
3.1xlO 5 3 
7.42xlO 5
6.21xlO 5
3.78xlO 5
3    
8.9xlO 5
7.6xlO S
4.8xlO S
4
0.SS8194 0.714801 0.890760
0.7 0.8 0.9 No.of iters.
4.9xlO S
5.0xlO S 4.3xlO5
3 
3.16xlO 4
2.68xlO 4
1.6SxlO 4
3  ,  
3.38xlO 4 3.34xlO 4 2.1SxlO 4
4  _._
0.019463 1.201191 0.399808
4 x=0.9, k=O.OOl, h=O.l, t=l.O, A=O.l, r=1.5, £=10
~ 0.1 0.2 0.3 0.4
5 5 5 4 AGED 3.1xlO 4.9 XlO 8.0xlO 1.oxlO
3.78xlO 5
7.45xIO 5 4 4
AGEeN 1.09xlO 1.4xlO
5 5 4 4 AGEIMP 4.84xlO 9.59xlO 1. 4xlO 1.81xlO
Exact 0.890760 0.990813 1.109460 1.246013
Solution
0.5 0.6
1.2xlO 4
1.3XIO 4
1.65xlO 4
1.81xlO 4
2.15xlO 4
2.4xlO 4
1.399809 1.5/0209
TABLE 5.11: The Absolute Errors of the Numerical Solution of Problem (5.6.8)
0.7 0.8 0.9 No.of iters.
1.5xlO 4 1.4xlO
4 1.oxlO
5 3
1.83xlO 4
1. 63xlO 4
1.09xlO 4
3 ,_ ..     .~ ,,  .
2.48xlO 4
2.29xlO 4
1.63xlO 4
4 . _._ 
1.756611 1. 958450 2.175209
244
5.7 A NEW STRATEGY FOR THE NUMERICAL SOLUTION OF THE SCHRODINGER
EQUATION
The numerical solution of the one space dimensional Schrodinger's
equation which is a well known equation in Quantum Mechanics is
obtained by a new direct method which depends on separating the real
and imaginary parts of the discretized complex tridiagonal matrices
which are obtained from a CrankNicolson formulation of the
differential equation into a new decoupled form.
Given the equation,
(5.7.1)
with the initial condition,
U(x.O)  H(x) • tO • (5.7.2)
and the boundary conditions,
U(O.t)  " (5.7.3)
A uniformly spaced network whose mesh points are x rh. t.jk. r J
R. for rO.l.2 ••••• n+l and jO.l.2 •••.• m+l is used with h(n+l)
T k= and the
(m+l) mesh ratio A k2 .
h 'A weighted approximation to the differential equation (5.7.1) at
the point (x .t. ,) is given by. (Table 3.1). r J+,
A9u 1 . 1+(2A9i)u . lASu 1 . 1  A(19)u 1 .(2A(19)+i)u . r ,J+ r,J+ r+ ,J+ r ,J r , )
+A(lS)u 1 .• rl.2 •...• n. jO.l.2 •... and O~S~l. (5.7.4) . r+ I)
This approximation can be written in a more compact matrix form as,
245
e i A6 l ~'l I b l I I u 2 I l A6 2A6 i A6 i b 2 " " " 0 ' I " " " I I
" " , I I
I I I
I " ",
I : " " " , " " " I " "
" " " I I
" , " , ,
Ib;_~ L
0 A6 2A6i A6
G:J A6 2A6i ~n
(s.7.sa)
i. e. ,
Au = b (s.7.sb)
where, b
l = A(16) (uo ,+u
2 ,)+A6Uo ' 1(2A(16)+i)u
l j
IJ I) ,J+ ,
b r
A(16)(u 1 ,+u l,)(2A(1e)+i)u ,. r=2.3 •...• nl r I) r+,J r,)
and b = A(16) (u 1 ,+u 1 ,)(2A(1e)+i)u ,+A6u 1 ' 1 .
n n IJ n+,J n,) n+ ,J+
Here b is a column vector of order n consisting of the boundary
values as well as known u values at the time level j while u are the
values at the time level (j+l) which we seek. If we let 6=!. we
recall that (5.7.4) corresponds. to the well known CrankNicolson
2 2 method with accuracy of order O(h .k ).
5.7.1 Outline of the Method,
The complex tridiagonal system (s.7.5b) can be rewritten as.
(C+iD)~ = b (5.7.6)
where, d
a
C = (s.7.7a)
246
I l I
c 0 I "
" , D ; "
l " , ,
0 , "
"" c
(5.7.7b)
a=AI d;>./2 and c;l.
Now let, u ; (v+iw) , (5.7.7c)
b ; (f+ig) (5.7.7d)
The above system (5.7.7) can further be rewritten in a real variable
form as,
F 01 rvl
ifl
cJ ;
19 J (5.7.8)
~ lwJ
which is obtained by separating out the real and imaginary parts of
the complex coefficient matrix, i.e.,
(C+iD) (v+iw) (f+ig) , (5.7.9)
to give, CVDw; f (5.7.lOa)
Dv+CW ; 9 (5.I.lOb)
The system (5.7.l0a,b) ·can further be reduced into the uncoupled
form by premultiplication of the matrix,
~" Dl I ~
to give,
2 2 01 fvl rCf+Dg l rc +D (5.7.11)
~ 2 j lwJ ~Df+C9J C +D
247
which represents two linear systems, with the same coefficient matrix
for the unknown components v and w, which need to be solved only once,
but with different right hand sides.
Since C was originally of tridiagonal form, then the matrix c2
will be quindiagonal and of the form,
2 2 2
l d +a 2ad a
d2+2a
2 2 2ad 2ad a I
I
2 2 2 2 a 2ad d +2a 2ad a , C " "
... " " " "" " "
"
I " " "
" " " ... ... " "" " ... ... ... " ... ... ... ...
c2 J ... ... ... ... " ... "
" " ... "
... .(5.7.12) " ..... 2 2'..... ..... ... 2 " 2.
a 2ad d +2a 2ad a ,
0 2 2ad d
2+2a
2 2adl
L a
d2
+a 2 J 2
a 2ad
whilst D 2
will remain in diagonal form.
Thus, the solution of the complex system has been reduced to
solving two identical quindiagonal systems in real variables of the
form,
2 2 2ad
2
l rl l ~ I' +a +1 a IWl
2 2 2 12ad . d +2a +l 2ad a
IV2 I w2
la2 2 2 2 0 I I I I 2ad d +2a +1 2ad a
I ... ... ... I I
" "...
"" ... ... " " ... ... ...
I " ... ... " ... ... ... ... I I " "...
"... " , .... " " I ... " ... ... ... ... ... ... ... ... "... 2 2 ... 2 ... ... 2
0 a 2ad d +2a +1 2ad a
2 2 2 2ad I a 2ad d +2a +1 v
nl \Wn_ll
I 2 2 2 \ L'n J L
a 2ad d +a +1 ~ L'n J
Real Sol. Imag.Sol.
248
[fl l gl l f2 g2
[ I I
I I ~ I (5.7.13)
I I I I I
f g~l nl
~n I [n ·1 J
RHSl RHs2
which can be solved directly by efficient elimination procedures
involving no inte,rchanges.
Since the coefficient matrix is' unchanged, then the Gaussian
elimination algorithm need only be applied once to the matrix whilst
the operations on the two right hand sides can be carried out
simultaneously.
By the use of the quindiagonal solver [Conte, 1965] the
computational complexity for the algorithm is now of order O(n)
multiplications, while the Gaussian elimination is of order o(n3
)
multiplications.
5.7.2 Numerical Results.
Consider the problem of solving the Schrodinger equation,
au a2u i at 2' O~x~n/2 , (5.7.14)
ax with the initial condition,
u = sinx + icosx ,
and the boundary conditions,
U(O,t) ~
U(n/2,t)
t ie
t ~ e
(5.7.15a)
(5.7.15b)
249
The following tables illustrate the results obtained by solving
the problem (5.7.14) by this method.
Numerical solution Exact solution
Real Imag. Real Imag.
0.139508 0.970233 0.139497 0.970222
0.276164 0.940504 0.276154 0.940494
0.407198 0.891629 0.407189 0.891620
0.529943 0.824604
I 0.529935 0.824596
0.641904 0.740795 0.641894 0.740785
0.740794 0.641903 0.740785 0.641894
0.824604 0.529943 0.824596 0.529935
I 0.891631 0.407200 I 0.891620 0.407189 I
I 0.940504 0.276164 I 0.940494 0.276154
0.970231 0.139506 I
0.970222 I
0.139497 , , : I
TABLE 5.12: Solution At Time Level nt=5, t=0.OO2
250
Numerical Solution Exact Solution
Real Imag. Real Imag.
0.131382 0.913729 0.131373 0.913720
0.260082 0.885734 0.260072 0.885724
0.383487 0.839707 0.383476 0.839696
0.499082 0.776583 0.499074 0.776575
0.604522 0.697645 0.604513 0.697645
0.697655 0.604523 0.697645 0.604513
0.776583 0.499082 0.776575 0.499074
0.839705 0.383485 0.839696 0.383476
0.885734 0.260082 0.885724 0.260072
0.913731 0.131383 I 0.913720 0.131373
TABLE 5.13: Solution At Time Level nt;17, t;0.08
Numerical Solution Exact Solution
Real Imag. Real Imag.
0.128728 0.895636 0.128772 0.895628
0.254931 0.868194 0.254922 0.868185
0.375894 0.823080 0.375883 0.823069
0.489203 0.761209 0.489192 0.761198
0.592553 0.683841 0.592543 0.683831
0.683838 0.592551 0.683830 0.592543
0.761207 0.489201 0.761198 0.489192
0.823079 0.375893 0.823069 0.375883
0.868204 0.254931 0.868185 0.254922
0.895636 0.128780 0.895628 0.128772
TABLE 5.14: Solution At Time Level nt;21, t;O.l
251
5.8 CONCLUSIONS
The parallel implementations of the AGE iterative method to solve
one and two dimensional parabolic problems as well as the second order
hyperbolic problem are investigated in this Chapter. Using the three
parallel strategies described in Chapter 4, we notice that the
asynchronous version achieves a better speedup ratio, where these
speedups are mostly linear.
This is because of the overheads generated in the other two
strategies and also because of the idle time that is spent waiting
for the other processors to complete their computations. There is
also a time that is needed to order the pOints in the second strategy
which also degrades the efficiency of the algorithm.
From the experiments we notice from the results obtained from
the first two strategies that there is no sufficient parallelism
available to enable the processors to work efficiently. This is
because of the granularity of the 2x2 blocks is too small to keep
the processors always busy.
The numerical experiments carried out on the solution of the
model problem of one and two dimensional parabolic differential
equations using the AGE method with D'Yakonov .splitting yields an
algorithm with reduced computational complexity and greater accuracy
for multidimensional problems.
Solving the (nxn) complex tridiagonal matrices by a new direct
b method'presented in section 5.7. This direct method enables us to
reduce the. complex system into two identical real quindiagonal
systems. Since the coefficient matrix is unchanged, then the Gaussian
elimination algorithm need only be applied once to the matrix whilst
the operations on the two right hand sides can be carried out
simultaneously.
252
CHAPTER 6
NUMERICAL INVERSION OF THE LAPLACE
TRANSFORMATIONS) SOME INVESTIGATIONS
AND PARALLEL EXPLORATIONS
You ean observe a ~ot
just by watohing.
Y. Berra.
253
6.1 INTRODUCTION
The Laplace transform is an important technique which is used to
find the solution of differential equations with assigned boundary and
initial conditions in applied sciences and engineering. Already,
extensive tables of the Laplace transforms are available. However, it
is interesting to investigate and obtain numerically the inverse
Laplace transforms.
Another technique, that of integral transforms, which had its
origin in Heaviside's work, has been developed during the last few
years and has certain advantages over the classical methods.
The integral transform F(s) of a given function f(t) in the range
a~t~b is defined as follows,
b
F(s) ; J k(s,t)f(t)dt , a
(6.1.1)
where k(s,t) is a known function of sand t, known as the kernel of
the transform, provided that the integral exists.
"In the application of integral transforms to the solution of
differential equations, use has so far been made of five different
kernels. These five transforms are:
(a) The Laplace transform,
F (s) ; ( st
f(t)e dt,
Cb) The Fourier sine and cosine transform,
F(s) f(t)sin(st) dt cos(st) ,
(c) The Complex Fourier transform,
F (s) ;
(6.1.2)
(6.1.3)
(6.1.4)
254
(d) The Hankel transform,
r~
F(s) ; J f (t) t o
J (st)dt n
(6.1.5)
where J (t) is the Bessel function of the first kind of order n. n
(e) The Mellin transform,
F (s) ; r~
J f (t) o
The effect of applying an integral transform to a partial
(6.1.6)
differential equation is to exclude temporarily a chosen independent
variable and to leave a partial differential equation in one less
variable for solution. The solution of this equation will be a function
of s and the remaining variables. When this solution has been
obtained, it has to be inverted to recover the Zost variable. Thus,
if t is the variable eliminated and F(s) is one of the transforms given
above, we first obtain auxiliary equations giving F in terms of s and
the remaining independent variables. These are solved for F and then
inverted to obtain f(t).
In this chapter, we shall briefly describe the Laplace transform
and its numerical inversion. A comparison between different methods
suggested for the numerical inversion of the Laplace transform is
given in order to choose an appropriate method. Then an attempt is
made to obtain more accurate results by applying extrapolation
techniques in a Romberg quadrature strategy is considered. Then we
implement the numerical inversion of the Laplace transform suggested
by Crump [Crump, 1976] on the Balance 8000 MIMD parallel system
incorporating some parallelism.
255
6.2 THE NUMERICAL INVERSION OF THE LAPLACE TRANSFORM
The Laplace transform of a real function f:R + R, with f(t);O
for t<o and its inversion formula are defined as,
r~
F(s) ; L[f(t)] J
st f(t)e dt, (6.2.1)
o
f(t) 1
L [F(s)] ~+~ st J F(s)e dt (6.2.2)
vim
with s;v+iw;v,w E Rand i;ll. v E R is arbitrary, but greater than
the real parts of all the singularities of F(s). The integrals in
(6.2.1) and (6.2.2) exist for Re(s»a E R if,
(a) f is locally integrable,
(b) there exists a to~O and k,a E R, such that If(t) I~keat
for all t>to' (i.e. of exponential order),
(c) for all t E (O,~) there is a neighbourhood in which f
is of bounded variation.
In the following we always assume that f fulfills the above
conditions and in addition that there are no singularities of F(s) to
the right of the origin" (for the case of singularities to the right,
a suitable translation of the imaginary axis can be performed).
Although there exists extensive tables of transforms and their
inverses, it is highly desirable to have methods for approximate
numerical inversion for computer implementation. A large number of
different methods have been devised for the numerical evaluation of
the Laplace inversion integral. In this section we briefly survey some
of these methods, where many of them are either orthogonal series
expansions or weighted sums of values of the transform at a set of
pOints.
Schmittroth [Schmittroth, 1960] has described a method in which
the inverse transform is obtained from the complex inversion integral
by use of numerical quadrature. This method gives good results, but
if the inverse transform is required for a large number of values of
the independent variable, the quadrature procedure must be repeated
for each value of the independent variable.
In cases where the inverse is required for many values of the
independent variable, it is convenient to obtain L~e inverse as a
series expansion in terms of a set of linearly independent functions.
Norden [Norden, 1955] has described two such methods in which the
expansion coefficients are calculated by solving a system of
simultaneous equations. The problem of solving simultaneous linear
equations can be reduced to one of solving a triangular system.
Salzer [Salzer, 1958] described a method, using orthogonal
polynomials (function) which was later refined by Shirtlifte and
Stephenson [Shirtlifte, 1961]. They attempt an approximate
evaluation of the inversion integral using Gaussian quadrature in the
complex plane. In this method it is necessary to find all roots, real
and complex, of a polynomial of high degree.
Lanczos and Papoulis [Lanczos, 1956] describe methods in which
the inverse transform is obtained as a series expansion in terms of
trigonometric functions, Legendre polynomials or Laguerre polynomials.
Papoulis obtained the inverse transform as a series expansion in
terms of Laguerre functions. The expansion coefficients are obtained
256
257
from the coefficients of the Taylor series expansion of the Laplace
transform by solving a triangular system of linear equations. The
disadvantage of this method, is the necessity of obtaining the Taylor
series expansion of the Laplace transform. Lanczos found the inverse
transform as a series expansion in terms of Laguerre functions, by
applying a con formal mapping to the Laplace transform and then developing
the resulting function in a Taylor series expansion.
Weeks [Weeks, 1966] further refined the ideas of Lanczos and
Papoulis. He obtained the coefficients in the expansion of the inverse
transform directly by trigonometric interpolation applied to the Laplace
transform. The resulting algorithm is quite suitable for automatic
computation.
Longman [Longman, 1975] described a method in which the inverse
function F(s) is replaced by Pade approximants ~ (s) in the inversion n,m
integral. Unfortunately, the construction of the rational function
~ (s) requires a knowledge of the Taylor expansion of F(s) about the n,m
origin, and this makes it impossible to implement a procedure which
uses only values of F(s) .
Finally, Dubner and Abate [Dubner, 1968] showed that the inverse
integral may be approximated by a certain Fourier series. This method
will be described in detail in the next section.
258
6.3 NUMERICAL EXPERIMENTS
As stated in the previous section, a number of numerical inversion
methods have been developed during the last few years. In this
section, we shall confine ourselves to the methods using Fourier
series approximations.
We shall describe the derivation of the Dubner and Abate method
(OUbner, 1968]. Then, the natural continuation of this method as
suggested by Durbin [Durbin, 1974] and Crump [Crump, 1976] is
described. A comparison between these methods is given by a number of
experimental results. Finally, we attempt to increase the accuracy
by using the Romberg integration method.
Let f(t) be a real function of t satisfying (6.2.3) conditions.
Then, by expanding equation (6.2.1) we have,
F(S) = (00 vt
J e f(t)coswtdt  i
o
vt e f(t)sinwtdt,
= Re{F(v+iw)} + iIm{F(v+iw)}.
Also, by expanding equation (6.2.2), we have,
f(t) =
or
f (t)
1 211i
vt e 211
+ i
+W
f evt(coswt+isinwt)(Re{F(s)}+iIm{F(s)}idw
_W
r+W l J!Im{F(S)}Coswt+Re{F(S)}Sinwt)dWJ
(6.3.1a)
(6.3.1b)
(6.3.2a)
(6.3.2b)
The imaginary part in (6.3.2b) cancels out because of the parity
of Re{F(s)} and Im{F(s)}. Again by the same argument, we can have,
259
f (t) (6.3.2c)
For t<O, f(t)=O, which means that,
rm J (Re{F(s)}coswt+lm{F(s)}sinwt)dw O. (6.3.2d)
o
Consequently, we obtain 3 formulas for the Laplace inverse of
f(t) corresponding to F(s). These 3 are:
(a) f (t)
(b) f (t)
(c) f (t) =
2 vt
 e 11
vt e
11
Re{F(s)}coswtdw ,
r'"
J Im{F(s)}sinwtdw,
o
r'" Jo(Re{F(S)}CoswtIm{F(S)}sinwt)dW
(6.3.3a)
(6.3.3b)
(6.3.3c)
Fourier series were first used by DubnerAbate, where they
suggested the following. Let h(t) be a real function of t with h(t)=O
for t<o. Consider sections of h(t) in intervals like (nT, (n+l)T) ,
construct an infinite set of 2Tperiodic functions g (t) (Figure 6.1), n
where,
r h (nTt) T~t~O (a)
1 g (t) = i h (nT+t) O~t~T (b)
t (6.3.4) n
h ((n+2)Tt) T~t~2T , n:;:O, 2,4, ... , (c) l J
r h ( (n+l) T+t) T~t~O
(a) 1
gn (t) i h ((n+l)Tt) O,t,T (b) ( (6.3.5) I
l h( (nl)T+t) T~t~2T, n;;l, 3 15, ••• , (c) J
o
T o
T o
T
T
T
vt h(t)=e f(t)
2T
2T
2T
FIGURE 6.1
3T
3T
3T
Then by developing each g (t) into cosine Fourier series we have, n
where,
g (t) n
A n,k
L k=l
A cos knt/T , n,k
(n+l)T 2 r .
h(t)cosknt/T dt. T JnT
Since it is always possible to write,
260
(6.3.6)
(6.3.7)
261
h (t) (6.3.8a)
or
we have,
L n=O
f (t)
A n,k
vt e h (t) , (6.3.8b)
vt e f(t)cos knt!T dt (6.3.9a)
i Re{F(v+ikn!T)} (6.3.9b)
L n=O
vt e g (t)
n
2evt
= T[ lRe{F (v) }+ L Re{F(v+ikn!T}cosknt!T) (6.3.9c) k=l
By using the expressions (6.3.4b), (6.3.4c), (6.3.Sb), (6.3.Sc),
(6.3.8a) and (6.3.8b) we have,
L n=O
vt e g (t)
n
m
f (t) + \' vkT 2vt L e (f(2kT+t)+e f(2kTt»
k=l
f(t)+error l(V,t,T)
In conclusion, for any O~t~2T, we can write,
f(t)+errorl(v,t,T) 2 vt _e_[ !Re{F (v) }+ .T
l Re{F(v+ikn!T)}cosknt!T), k=l
(6.3.10)
Equation (6.3.10) is the DubnerAbate formula, where the error is a
function of v,t/T. The factor L 2vkT 2vt
e f(2kTt)e is the most k=l
disturbing one since it increases exponentially with t. Numerically
the DubnerAbate method is only valid for t~T!2.
Similar numerical inversion techniques can be developed that
utilize Im{F(s)} rather than Re{F(s)}. Durbin [Durbin, 1973)
suggested the following,
262
Consider h(t) in the interval (nT, (n+l)T) with an· infinite set of
odd 2Tperiodic functions k (t). (Figure 6.2). n
h (t) =evtf (t)
T 2T 3T
T ° T 2T 3T
T ° T 2T 3T
FIGURE 6.2
By def ini tion we hav.e,
{ h (t) nT~t~(n+l)T
k ( t) n h(2nTt) (nl)T:st:;:nT , n=O,1,2, ...
(a) } (b)
Similarly, on the intervals (T,T), (O,T) and (T,2T), we have,
(6.3.11)
h (nTt) T::t::O (a) 1 k (t) = h (nT+t) O~t~T (b) I n
h ( (n+2) Tt) T::t~2T n=O, 2,4, ..... (c) J
r h ( (n+l) T+t) T::t~O (a)
1 k (t) =
1 h ((n+l)Tt) O~t:;T (b)
n
h( (nl)T+t) T~t~2T n=1,3,5, ...... (c) l J
Hence, the Fourier representation for each odd function k (t) is, n
k (t) n
I B sin krrt/T , n,k
where, we find that,
B n,k
(n+1)T r vt
= J e f (t) sink rrt/T dt .
nT
263
(6.3.12)
(6.3.13)
(6.3.14)
(6.3.15)
Then, by summing (6.3.15) over n and comparing it with equation
(6.3 .1a), we have,
B n,k
vt e f(t)sinkrrt/T dt
Again, summing (6.3.14) over n and multiplying both sides by
vt e , we can have an expression similar to equation (6.3.9c),
t 2 vt . L e
V k (t) =  e[Im{F(v+ikrr/T)}sinknt/T]
n T
00
n=O
(6.3.16a)
(6.3.16b)
(6.3.17)
Likewise, on the interval (O,2T), using the equations (6.3.8a),
(6.3.8b), (6.3.12c), (6.3.13b) and (6.3.13c), we find,
k (t) n
f (t) + L k=l
2vkT 2vt e [f(2kT+t)e f(2kTt)]. (6.3.18)
264
Hence, another representation for f(t) is,
f(t)+error2(v,t,T) 2evt
iW
]  T ~~O Im{F (V+ik1T/T) }sink1Tt/T] . (6.3.19)
The advantages of OUrbin method are twofold. First, the error bound
on the inverse f(t) becomes independent of t, instead of being
exponential in t. Second, the trigonometric series obtained for f(t)
in terms of F(s) is valid on the whole period 2T of the series.
Kenny S. Crump [Crump, 1976] utilized a combination of both real
and the imaginary parts of the Fourier series. His proposal is as
follows:
For n=O,l,2, ... I define 9 (t), oo<t<co, by, n
vt g (t) = e f(t) , 2nT~t~2(n+l)T ,
n
where g (t) is a periodic function with 2T period. The Fourier series n
representation of g (t) is then given by, n
g (t) = n
lA + L {A kcos(k1Tt/T)+B k Sin (k1Tt /T)}, n,D k=l n t n,
where the Fourier coefficients A and B are given by, n,k n,k
and
A n,k
B n,k
(2(n+l)T_vt = (liT) J e f(t) cos (k1Tt/T) dt
2nT (2 (n+l)T_vt
= (l/T)j e f(t)sin(k1Tt/T)dt 2nT
By summing equation (6.3.20) with respect to n, we have,
L n=O
g (t) n
(liT) [!F(v)+ L Re{F(v+ik1T/T)}cos(k1Tt/T) k=l
Im{F(v+ik1T/T)}sin(k1Tt/T) ] ,
(6.3.20)
(6.3.21)
265
which can be written in the form of the inverse Laplace transform as,
f(t)+error3(v,t,T) vt
= (e /T) [tF(v)+ I Re{F(v+ikrr/T)}cos(krrt/T) k=l
Im{F(v+ikrr/T)}sin(krrt/T)}] (6.3.22)
where, 00 '"
error3(v,t,T) = evt I g (t)
n = e
vt I exp{v(2nT+t)}f(2nT+t) n=l n=l
or
error3(v,t,T) I 2nvT e f( 2nT+t) (6.3.23) n=l
However, by comparing errorx, error2, and error3 we notice that
error3 = (errorl + error2)/2
To increase the rate of convergence of equation (6.3.22) and
thereby reduce the truncation error Crump suggested the use of either
the Euler transformation (first used by Simon, Stroot and Weiss
[Simon, 1972] or the epsilon algorithm [Wynn, 1962]. He also showed
the superiority of using the epsilon algorithm in speeding the rate of
convergence.
In 1984, Honig and Hirdes [Honig, 1984] attempted to accelerate
the convergence of equation (6.3.19) by using three acceleration methods.
The epsilon algorithm, the minimummaximum method and a method based
on the curve fitting. Also, they tested other acceleration methods
such as the Euler transformation and Aitken's extrapolation procedure
but these methods turned out to be less efficient.
To speed up the computation and to increase accuracy DubnerAbate
and Durbin introduced the use of the Fast Fourier Transform (FFT)
266
techniques [Cooley and Tukey, 1965], (this technique is described in
the next section).
To fulfil our comparison between these methods, we list the results
obtained for solving the two problems,
(1) F (s) 2 = sits +1) f (t) t/2 sin t
2 = l/(s +s+l) f (t) = ~ exp(t/2)sin(tI3!2).
13 ( 2) F (s)
The results of these problems are illustrated in Table 6.1 and 6.2.
The trapezoidal rule lacks the degree of accuracy, which is
generally required from a quadrature formula. However, Romberg
integration is a method that has wide applications because it uses
the trapezoidal rule to give preliminary approximations, and then
applies the Richardson extrapolation process to obtain improvements
to these approximations.
To improve the accuracy of the final results for a reduced number
of terms n, we applied the Romberg integration to the numerical
inversion of the.Laplace transformation methods that were suggested by
DubnerAbate, Durbin and Crump.
Although the Romberg strategy appears to give more accurate
results for small time intervals, they are too small to be of
practical use in applying it to the full range of the transform.
Table 6.3 shows the numerical inversion of the Laplace transform with
Romberg integration by using the Durbin method.
267
F(S)~s/(S2t1), f(t)~t/2 sint.
t Exact Solution Dubner and Abate Durbin Crump
1 0.4207355 0.420732 0.4207785 0.4207348
2 0.9092975 0.909297 0.9092443 0.9092966
3 0.2116800 0.211683 0.2116644 0.2116795 i
4 1.5136050 1.513614 1. 512939 1.5136050
5 2.3973107 2.39'7300 2.3972"16 2.3973100
6 0.8382466 0.838262 0.8382199 0.8382477
7 2.2994530 2.299529 2.299371 2.2994370
8 3.9574329 3.957423 3.957483 3.9574310
9 1.8545331 1.854535 1. 854252 1. 854533
10 2.7201055 2.720956 2.719565 2.7201030
TABLE 6.1
2 2 . t::' F(s)~l/(s +s+l) , f(t)~  exp(t/2)sin(dt/2).
13
t Exact Solution Dubner and Abate Durbin Crump
1 0.5335073 0.5335338 0.5335343 0.5335066
2 0.4192797 0.4192651 0.4192649 0.4192776
3 0.1332426 0.1332654 0.1332644 0.1332419
4 0.0495299 . 0.0494962 0.0~94953 0.0495370
5 0.0879424 0.0879581 0.0879563 0.0879449 .
6 0.0508923 0.0508995 0.0508996 0.0508929
7 0~0076437 0.0076448 0.0076477 0.0076439
8 0.0127151 0.0127121 0.0127137 0.0127149
9 0.0128046 0.0128020 0.0128079 0.0128046
10 0.0053854 0.0053849 0.0053881 0.0053859
TABLE 6.2
F(s)=(2f;1/(s+l»; f(t)=2exp(t).
Increment in t=O.OOOOl
0.4686862
0.4687000 0.4687046
0.4686812 0.4686749 0.4686729
0.4686969 0.4687022 0.4687039 0.4687044
0.4686954 0.4686948· 0.4686943 0.4686942 0.4686942
TABLE 6.3
6.3.1 The Implementation of the Fast Fourier Transform (FFT)
Technique
Although the earliest discoveries of the FFT technique was in
1942 (Danielson and Lanczos), this technique became generally known
only in the mid1966's from the work of Cooley and Tllkey [Cooley,
1965]. This technique reduced the computational time of the Fourier
transform from N2
to less than Nlog2
N, where N is usually taken to be
a power of· 2.
An attempt to introduce the use of the FFT for numerically
inverting the Laplace transform was first suggested by Wing [Wing,
1968]. DubnerAbate also applied the implementation of the FFT
technique to their method. Their implementation is as follows:
The finite Fourier transform pair X(j)" A(k) is defined by,
Nl
268
x (j) \' 2Tfi L A(k)exp(~ jk) , (6.3.24)
k=O
for j=0,1,2, ... ,Nl. Also equation (6.3.10) may be written as,
f (t) 
vt e
T { I
n=CIO
269
Re{F(v+ i1Tn)} exp(i1Tn t)}. T T
(6.3.25)
Now, if we require the value of f(t) at the equidistant points
t=j~t, where j=O,1,2, •.. ,~, i.e., ~t is the desired sampling
interval and the maxilllum tvalues is t max 1
:: ~L)t. Then, we have
either T=2t or T=tN~t. From these definitions we can write max
equation (6.3.25) as,
f (j~t) Nl
( 21Ti = I A(k) exp  jk)
b (j) N ,
k=O where,
~
A(k) 1 I Re{F(v+
21Ti (k+nN»} = N~t N
n=CIO
and,
b(j) = 2exp(vj~t)
Hence, the right hand side of equation (6.3.26a) is of the same
form as equation (6.3.24), thereby permitting the use of the FFT
(6.3.26a)
(6.3.26b)
(6.3.26c)
technique. In a silllilar manner, Durbin, applied the FFT technique to
his method.
In 1987, Hsu and Dranoff [Hsu, 1987] ilIlplemented the FFT
technique to the method that is described by Crump (equation 6.3.22).
By applying the trapezoidal rule approximation to equation (6.3.2b)
with W=k1T/t and ~W=1T/t, we obtain,
vt e
f(t) ~ 2T { I k=co
F(v+ik1T/T)exp(ik1Tt/T)}.
Assuming t=j~T, then equation (6.3.27) can be written as,
f (j~T)
where,
A(k) =
= exp(vjllT) { 2T
Nl I {A(k)exp(21Tikj/N)}}
k=O
F{v+i1T(k+nN)/T}, n=IXI
(6.3.27)
(6.3.28)
(6.3.29)
270
with N~T;2T and j;0,1,2, ... ,Nl. Now equation (6.3.28) and (6.3.29)
can be computed by the FFT technique as follows. First, A(k) can be
obtained from F(s) by the use of equation (6.3.29). Then, the complex
conjugate of A(k) will be the input to the FFT subroutine and the
output will be X(j) for j;0,1,2, ... ,Nl. The inverse function is then
given by,
f (j~T) = exp(vj~T)
2T x (j) j==O,1,2, .. : ,Nl, (6.3.30)
where X(j) is the complex conjugate of X(j). Since the imaginary part
of X(j) is zero, X(j);X(j).
There are some disadvantages of the FFT technique, when applied
to the numerical inversion of the Laplace transforms. These
disadvantages are:
(i) When only a small number of transform values are required
it is not economic to use FFT. Since the Fourier transform
needs only O(N) operations for each transformed value
required.
(ii) N must be a highly composite number (i.e. N;2m). Otherwise
it is necessary to add a string of zero values to the data
to make the length a power of 2.
(iii) If a reordering has to be carried out, no real savings in
computing time is achieved for N<SO.
Thus, on the remarks above, we can say that when N is too small
or only a small number of transformed values are required, the FFT
algorithm is not economic.
271
.6.4 PARALLEL IMPLEMENTATION OF THE NUMERICAL INVERSION OF THE LAPLACE
TRANSFORM
We give in this section a parallel treatment of Crump's
expression that numerically inverts the Laplace transform,
~
vt \' f(t)  (e /T) [!F(v) + L {Re{F(v+iklT/T) }cos(klTt/T) k=l
Im{F(v+iklT/T)}sin(klTt/T)}]
Numerical experiments with this parallel algorithm have been
carried out on the Balance8000 multiprocessor MIMD system, where we
used all the available 5 processors.
The speedup values represent the ratio of the time taken by the
parallel version over the sequential version, which contained none of
the special programming constructs required for parallel processing
over the parallel version.
Tables 6.4 and 6.5 lists the elapsed execution times (in seconds)
minus the cost of forking child processes and child page table build
up, since child processes do not automatically inherit the parent's
page table when they are created.
The speedup values are plotted in Figures 6. 3 and 6.4 .
F(s)=exp(4rs)
No. of Points
10
3 ; f(t)= 2exp(4/t) (TIt)
Elapsed time No. of Processors in seconds
1 55.008
2 34.942
3 27.828
4 22.422
5 17.522
TABLE 6.4
; 1 F(s)=s exp(s )
; ; f(t)=(TIt) cos(2t)
Elapsed time No. of Points No. of Processors in seconds
10 1 31.994
2 16.628
3 10.784
4 8.828
5 7.989
TABLE 6.5
272
Speedup
1
1.574
1.976
2.453
3.139
Speedup
1
1.924
2.966
3.624
4.004
273
5
4
3
o~~~~~~~~==~~~==~~=7 o 2 3 4 5 6
No. 01 proes.
FIGURE 6.3: The Timing Results for the First Problem.
274
1 2 3 4 5 6 No. of proc8.
FIGURE 6.4: The Timing Results for the Second Problem.
.6.5 CONCLUSION
In this chapter, we have presented different methods for
numerically inverting the Laplace transform. Each of these methods
has its own shortcoming. Although we recommend Crump's method, there
is no general technique to invert the Laplace transform function
numerically.
The proper formula for the numerical inversion of Laplace
transform based on the Fourier series method should contain both the
sine and cosine function terms. Eliminating "either the sine or cosine
function and replacing the integration by a discrete formula will
distort the original function into an even or odd function with a
concomitant decrease in the working interval of the inverted function
from (O,2T) to (O,T).
275
An attempt to increase the accuracy by applying the Romberg
integration resulted in giving accurate results for small time intervals,
which is too small to be of practical use in applying it to the full
range of the transform.
Although the direct FFT technique was found to be easy in
application to numerically inverting the Laplace transform, it has
been shown that it has many disadvantages as outlined earlier.
The improvement in performance attainable by the implementation of
the Laplace inversion transform is considerable: Which means that the
numerical inversion of the Laplace transform algorithm is a viable
parallel algorithm.
CHAPTER 7
CONCLUSIONS AND FINAL REMARKS
I am no prophet 
and here's no great matter.
T.S. Eliot.
276
In the past computers have developed more processing power,
firstly, by an increase in the density of integration and, secondly,
by the operational acceleration of their basic switching elements.
Both L~ese methods, however, lead to higher power dissipation per unit
area to which there are fundamental thermodynamic limits.
For further improvement of the performance attainable by present
day computers technologically, it is necessary that radical new
developments must take place.
ParaZZeZ processing, therefore, is widely viewed as the only
natural and feasible way forward to achieve a significant increase in
processing power. Parallel computers, as has been discussed in the
introductory chapters of this thesis, are classified in various
different types, each of which has an effective range of problems to
which it is most suitable for solving.
To date, in concern with the imminent next generation of
computer systems, the epitome of contemporary computing  the
supercomputer  consists of banks of highspeed processors, operating
in parallel on arrays of numbers or closely coupled in a pipeline,
with further processors to pass the information to and from secondary
storage. The total assembly achieves a very high throughput of data,
and makes use of densely packed VLSI chips.
In general, programming parallel systems is more difficult than
programming uniprocessor systems, and this had led to the parallelism
being concealed on most existing MIMD systems. This kind of system
consists of the primary testbed for all numerical algorithms designed
and analysed herein. The actual prototype on which the extensive bulk
277
of our experiments and measurements were carried out was the Balance
8000 MIMD system at Loughborough University of Technology.
The problem which arises here lies in making the p processors
cooperate, so that one problem can be appropriately partitioned amongst
them to be solved with greater speed than could be solved on a uni
processor.
The analysis of an algorithm is important for different reasons.
The most straightforward reason is to discover its vital statistics
in order to evaluate its suitability for various applications or
compare it with other algorithms. Generally, the vital statistics of
interest are the time and space, most often time. Actually, in.all
the algorithms presented in this thesis, we were interested in
determining the times for an implementation of a particular algorithm
on the Balance 8000 parallel system. From that analysis, we were able
to measure the relative speedup ratio of each algorithm· which was the
main criteria to exploit the efficiency by using parallelism.
In this theSis, we have· demonstrated a characterization of the
architecture of current parallel computers such as SIMD and MIMD
computers and how the research in this area is developing as technology
advances where the whole computer can be assembled on a silicon VLSI
chip, which may contain millions of transistors.
In the second chapter, we have showed how to program parallel
systems. However, as we have demonstrated different applications
require different language structures. One of the major divisions in
language development is that between the imperative and declarative
style of programming. The other major development is the objectoriented
approach to programming. We have also showed that some specific
languages have been used on a parallel computer by adding some
parallel constructs.
In Chapters 4 and 5, the parallel AGE iterative method, to solve
one and two dimensional elliptic and parabolic equations as well as
278
the wave equation, was developed and implemented on the Balance 8000
parallel system. The implementation of the parallel AGE iterative
method was programmed using different versions and strategies involving
synchroneity and asynchroneity together with natural or redblack
ordering. It is clear that the implementation of the different
strategies presented different 'timings and losses when they were run
on the Balance 8000 parallel system.
It can be seen from the experimental results that the best
results were obtained when the problem is solved using the asynchronous
strategy of the parallel AGE iterative method. This is due to the
synchronization overheads occurring at the end of each iteration.
Also, we noticed that the results obtained by applying the first
strategy is better than that of the second'strategy. This is due to
the total number of computer operations in the second strategy being
higher than that of the first strategy. As the old values' are used
while evaluating the next point using the second strategy.
The SturmLiouville problem is also solved using the AGE
iterative method, in Chapter 4. The eigenvalues and the corresponding
'eigenvectors were determined.
The AGE iterative method with D'Yakonov splitting strategy to
solve the diffusion equation is presented in Ch'apter 5. The new
strategy yields an algorithm with reduced computational complexity and
greater accuracy for multidimensional problems. The same three
parallel strategies were implemented on this kind of formulation.
279
A new strategy for solving the (nXn) complex tridiagonal matrices
derived from the finite difference/element discretisation of the
Schrodinger equation is presented in Chapter 5. In this strategy, the
solution of the complex system has been reduced to solve two identical
quindiagonal systems, in real variables. These can then be solved
directly by efficient elimination procedures involving no interchanges.
By the use of the quindiagonal solver, the computational complexity for
the algorithm is now of order O(n) multiplications, whilst the Gaussian
elimination procedure for the original matrix is considerably more.
The fundamental importance of the Laplace transform resides in its
ability to lower the transcendence level of an equation. Ordinary
differential equations involving f(t) are reduced to algebraic
equations for F(s) .
The evaluation of inverse Laplace transforms of functions is a
problem of fundamental importance in pure and applied mathematics.
Thus, it is often desirable to have a convenient means of computing
·f(t) numerically for various values of t from the given values of F(s) ,
where th~ function F(s) might be either a complicated expression
whose poles and residues are too difficult to obtain, or F(s) might be
itself known only numerically as a computed function of s without any
knowledge as to its explicit analytic form.
A large number of different methods for numerically inverting the
Laplace transform has been developed during the last few years. In
Chapter 6, we have briefly mentioned some of these methods, whilst we
restrict ourselves to the methods using Fourier series approximations.
We carried out a number of experimental tests to demonstrate the
simplicity and the accuracy of these methods. To speed up these
methods, we implemented the Crump's method on the Balance 8000 MIMD
parallel system, where the resulting times are represented
diagramatically.
Further work is required to include more than one iteration
parameter r in the AGE iterative method for which an improvement in
the rate of convergence is predicted.
280
The study in Chapter 5 of the new strategy for the numerical
solution of the Schrodinger equation can be easily extended to two
dimensional equations as well as to the nonlinear Schrodinger equation.
Finally, more extensive work is required on the numerica~
inversion of the Laplace transform. Techniques for its efficient
application to determine the numerical solution of certain classes of
linear parabolic and hyperbolic equations especially those involving
independent variables, time t and two spatial variables x and y. The
basic idea follows the usual Laplace transform approach in that after
applying the transform to the differential equation the number of
independent variables is reduced by one and the resulting subsidiary
equation involves a complex parameter s instead of the variable t.
The numerical application of a suitable inversion procedure
means that the subsidiary equation has to be solved for several
different values of the parameter s, in order to obtain the solution
of a parabolic or hyperbolic problem at some specified time t.
REFERENCES
281
Anderson, J .P. (1965): "Program Structu:r>es for PaI'aZZeZ PI'ocessing",
Comro. of ACM, Vol.8, No.12, pp.786~788.
Baer, J.L. (1982): "Techniques to ExpZoit PaI'aUeUsm", in Parallel
Processing Systems, An Advance Course, ed. Evans, D.J.,
Cambridge Univ. Press, pp.7599.
Barlow, R.H., Evans, D.J., Newman I.A. and Woodward, M.C. (1981):
"A Guide to Using the Neptune PaI'aUeZ PI'ocessing System",
Internal Report, Computer Studies Dept. L.U.T.
Baudet, G.M., Brent, R.P. and Kung, H.T .. (1980): "ParaUeZ Execution
of a Sequence of Tasks on an Asynchronous MuUipI'ocessor",
The Australian Computer J., Vol.12, No.3, pp.105112.
Bernstein, A.J. (1966): "AnaZysis of PI'ogI'ams fop ParaUeZ Processing",
IEEE Trans. on Electronic Computers, Vol: EC15, No.5,
pp.757763.
Booch, G. (1986): "Object OPiented Development", IEEE Trans. Software
Eng., Vol. SE12, pp.211221 .
.,. Carre, B.A. (1961): "The Determination of the Optimum AcceZeI'ation
FactoI' fop Successive OveI'I'eZaxation" , The Comput.J., Vol.4,
pp.7378.
Conte, 5.0. [1965]: "Elementary Nwneric:al Analysis", McGrawHi11 Inc.,
New York.
Cooley, J.W. and Tukey, J.W. [1965]: "An Algorithm for the Mac:hine
Calc:ulation of Complex Fourier Series", Maths.Comp.J.,
vol.19, pp. 297301.
Crump, K.S. [1976]: "Numeric:al Inversion of Laplac:e Transforms Using a
Fourier Series Approximation", J. ACM, Vol. 23, No. 1,
pp. 8996.
282
Davies, B. and Martin, B. [1979]: "Numeric:al Inversion of the Laplac:e
Transform: A Survey and Comparison of Methods", J. of Comput.
Phys., Vol.33, pp.132.
Dennis, J.B. and Van Horn, E.C. [1966]: "Programming Serrrmtic:s for
Multiprogrammed Computations", Comm. of the ACM, Vol. 9,
pp.143155.
Dijkstra, E. W. [1965]: "Guarded Commands Nondeterminanc:y and Formal
Derivation of Programs", Comm.Assoc.Comput., Vol.18, pp.457.
Douglas, J. and Rachford, H.H. [1956]: "On the Numeric:al Solution of
Heat Conduc:tion Problems in Two and Three Spac:e Variables",
Trans.Amer.Math.Soc., Vol.82, pp.421439.
Dubner, H. and Abate, J. [1968]: "Nwnerical Inversion of LapZace
Transforms by Relating Them to the Finite Fourier Cosine
Transform", . J .ACM, vol.15 , No.1, pp .115":123.
Durbin, F. [1973]: "Nwnerical Inversion of Laplace Transforms: An
Effective Improvement of Dubner and Abate's Method",
Comput.J., Vo1.17, No.4, pp.371376.
D' Yakonov, Y. [1963]: "On the Application Distintegrating Difference
Operators, ·Z,Vycil.Mat.i.Mat.Fiz., vol.3, pp.385388.
Enslow, P.H. [1977]: "Multiprocessor Organization  A Survey",
Comput. Surveys, Vol. 9, No. 1, pp.l03129.
Evans, D.J. and Williams, S.A. [1978]: "Analysis a"d Detection of
Parallel Processable Code". The Comput.J. Vo1.23, No. 1,
pp.6672.
Evans, D.J. [1985]: "The AGE Matrix Iterative Method". 3rd Franco
South Asian Math.Conf., Kuala Lumpur, Malaysia.
Feng, T.Y. [1974]: "Data Manipulation Functions iT; Parallel Processors
and Their Implementations", IEEE Trans. on Comp., Vol. C23,
No.3, pp.309318.
F1ynn. M. J. [1966]: "Very High Speed Computing Sysoems", Proc. IEEE,
Vo1.54, pp.19011909.
283
Fox, L. [1957]: "The Nwnerical Solution of TwoPoint Boundary
Problems in Ordinary Differential Equations",
The Clarendon Press, Oxford.
Gill, S. [1958]: "Parallel Prograrmling", Comp.J. Vol. 1, pp.210.
Gurd, J.R., Kirkham, C.C. and Watson, I. [1985]: "The Manchester
Prototype DataFlow Computer", Comm. ACM, No.l, pp.3452.
Hageman, L.A. and Kellogg, R.B. [1968]: "Estimating Optimwn Over
relaxation Parameters", Maths. of Comp., Vol.22, pp. 6068.
Handler, W., [19i7]: "The Impact Of Classification Schemes on Computer
Architecture", Int.Conf. on Parallel Proc., pp.715.
Harland, D.M. [1985]: "ToWards a Language for Concurrent Processes",
Software Practice and Experience, Vol.15, pp.839888.
Hellerman, H. [1966]: "Parallel Processing of Algebraic Expressions",
IEEE Trans. on Electronic Computers, Vol. EC15, pp.8291.
Hobbs, L.C. and Thesis, D.J. [1970]: "SuY'Vey of Parallel Processor
Approaches and Techniques", in parallei System: Technology
and Applications, ed. by Hobbs, Spartan Books, New York,
pp.320.
284
Honig, G. and Hirdes, U. [1984]: "A Method for the Nwnerical
Inversion of Laplace TransfoI'Tlls", J .Comput. and Appl.Math.,
Vol. 10, pp.113l32.
Hockney, R.W. and Jesshope, C.R. [1988]: "Parallel Computers:
Architecture, Programming and Algorithms", Adam Hilger,
Bristol.
285
Hsu, J.T. and Dranoff, J.S. [1987]: "Nwnerwal Inversion of Certain
Laplace TransfoT'Tlls by the Direct Application of FFT Algorithm",
Comp.Chem.Engng., Vol. 11, No.2, pp. 101110.
INMOS [1984]: "Occam Programming Manual", Englewood Cliffs, N.J.:
Prentice Hall.
Kuck, D.J. [1977]: "A Su:r>vey of Parallel Machine Organization and
Programming", Comp. Surveys, Vol. 9, No. 1, pp.2959.
Kung, H.T. [1980J: "The Structure of Parallel Algorithms", Advances
in Computers, Vol. 19, pp.65ll2.
Kung, H.T. [1982J: "Notes on VLSI Computation", in Parallel Processing
Systems, ed. Evans, D.J., Cambridge Univ. Press, pp.339356.
Lanczos, c. [1956J: "Applied Analysis", PrenticeHall, Englewood Cliffs,
New Jersey.
Lawrie, D.H., Layman, T., Baer, D. and Randal, J.M. [1975]: "Glipnir 
A Programming Language for ILLIAC IV", Comm. ACM, Vol.1B,
pp.157164.
2B6
Miranker, W.L. [1971]: "A Survey of Parallelism in Numerical Analysis",
SIAM Review, Vol. IB, pp.524547.
Muraoka, Y. [1971]: "Parallelism Exposure and Exploitation in Programs",
Ph.D. Thesis, Univ. of Illinois at UrbanaChampaign.
Murtha, J. and Beadles, R. [1964]: "Survey of the Highly Parallel
Information Processing Systems", Prepared by Westinghouse
Electroc Corp., Aerospace Division, ONR Rep. No. 4755.
Patel, J.H. [19Bl]: "Performance of ProcessorMemory Interconnections
for Multiprocessors", IEEE Trans. Comp., Vol. c31, No. 10,
pp. 7717BO.
Peaceman, D.W. and Rachford, H.H. [1955]: "The Numerical Solution of
Parabolic and Elliptic Differential Equations", J.Soc.lndust.
Appl.Math., Vol. 3, pp.2B41.
Ramamoorthy, C.V. and Li, H.F. [1977]: "Pipeline kPchitecture",
Comp. Survey, Vol. 9, No. 1, pp.61102.
Salzer, W.E. [195B]: "Tables for the Numerical Calculation of Inverse
Laplace Transforms", J.Math. and Phys., Vol.37, pp.B9109.
Saul'yev, V.K. [1964]: "Integrotion of Equations of Parabolic Type by
the Method of Nets", G.T. Tee, Transl., Pergamon, New York.
Schmittroth, L.A. [1960]: "NumeI'ical InveI'sion of Laplace Tr>ansfoI'lTls",
Comm. ACM, Vol. 3, pp.171173.
Schapery, R.A. [1962]: IIPr>OC. 4th V.S. Nat. CongI'., Appl.Mech.",
ASME 2, pp.10751085.
Seigel, H. T. [1979]: InteI'connection Networks foI' MIMD Machines 11,
IEEE Comp.J., pp.57~65.
Seitz, C.L. [1982]: IIEnsemble AI'chitecture for VLSI  A Survey and
Taxonomy 11, in Proc. of MIT Conf. on Advanced Research in
VLSI, ed. by Penfield, P. and Artech House Jr., pp.130135.
Shapiro, E. Y. [1984]: IIA Subset of ConcurI'ent Prolog and its
InterpretoI'II, ICOT Inst. for New Generation Technology,
Technical Rep. TRoo3.
287
Shirtlifte, C.J. and Stephenson, D.G. [1961]: IIA ComputeI' Oriented
Adoption of SalzeI' IS Methods fOI' InveI'ting Laplace Tr>ansfoI'lTls ll ,
Ibid., Vol. 40, pp.135141.
Shore, J.E. [1973]: "Second Thoughts on PaI'allel Processing ll , Comput.
Elect.Engng., pp.95109.
Simon, R.M., Stroot, M.T. and Weiss, G.H. [1972J: "NwneT'ical
InveT'sion of Laplace Transforms with Application to
Percentage Labeled Experiments", Comput. Biomed. Rev.,
Vol.6, pp. 596607.
Stone, H.S. [1967J: "OnePass Compilation of Arithmetic Expressions
288
for a ParaLLel Processor", Comm. ACM, Vol. 10, No.4, pp.220223.
Stone, H.S. [1971J: "ParaLLel Processing with the Perfect Shuffle",
IEEE Trans.Comp., vol. C20, pp.153161.
stone, H.S. [1973J: "Problems of ParaLLel Computation", in Complexity
of Sequential and Parallel Algorithms, ed. Traub, J.F.,
Academic press, New York.
Stone, H.S. [1973J: "An Efficient ParaLLel Algorithm for the Solution
of a Tr>idiagonal Linear System of Equations", J. of ACM,
Vol.20, No.l, pp.2738.
Squire, J.S. [1963]: "A Translation Algorithm for Multiprocessor
Computers", Proe. 15th ACM Natl.Conf.
Tedd, M., CrespiReghizzi, S. and Natali, A. [1984J: "Ada for
Mul tipT'ocessors", Ada Companion Series, Cambridge Universi ty·
Press.
Varga, R.S. [1962J: "Matrix Iterative Analysis", PrentieeBall,
Eng1ewood Cliffs.
Wilkinson, J.B. [1965J: "The Algebraic Eigenvalue Problem", Oxford
University Press, Oxford.
Wing, o. [1967J: "An Efficient Method of Nwnerical Inversion of
Laplace Transforms", Arehs.E1eetron.Comp., vo1.2, pp.153.
Wyllie, J.C. [1979J: "The Complexity of Parallel Computation",
Ph.D. Thesis, Comp.Sei. Dept., Corne11 Univ., U.S.A.
Wynn, P. [1962J: "Upon a Second Confluent Form of the £Algorithm",
Proe. Glasgow Math.Assoe., Vol. 5, pp.160.
Young, D. [1954J: "Iterative Methods for Solving Partial Difference
Equations of Elliptic Type", Trans .Amer .Math .Soe., Vol. 76,
pp.92111.
Zakian, V. [1969J: E1eetron.Lett. 5, pp.120121.
289
APPENDIX A
A LIST OF SOME SELECTED PROGRAMS
1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 c 15 c 16 c 17 c 18 c 19 20 c 21 22 c 23 24 25 26 27 28 29 30 31 32 33 c 34 35 36 37 38 39 40
* *
Program 4.1 OneDimensional Boundary Value Problem Solved Using the Parallel AGE iterative Method with the Thrid Strategy. d**2U/dx**2+U = x Boundary Conditions U(O) = 1 ,U(1tI2)=1tI2 1 . Exact Solution U(x)= cosx  sinx + x . n : The order of the system m: The iteration counter a : The subdiagonal of A c : The superdiagonal of A d : The diagonal of A hd : Half the diagonal of A b: The right hand side vector r : The iteration parameter
Set up common block to hold shared variables COMMON /SHCOM/a,b,c,hd,m,n,r,v,w,eps,det,u/ Declare subroutine age as argument to a function EXTERNAL age Declare system calls integer m_fork,m_killprocs,m_setprocs integer time1,time2,bite_size ParameterOmax=50) dimension aOmax),bUmax),cOmax),dOmax),hdOmax),detOmax)
,u(0:jmax),u1 (0:jmax),u2(0:jmax),u3(0:jmax),wOmax) ,vOmax)
data eps,pi/0.000001,3.1415926/ r=1.44 h=(pi/2.0)/real(n+ 1) hh=h*h Input matrix size. do 10 i=1,n a(i)= 1.0 c(i)=a(i) d(i)=(2.0hh) hd(i)=d(i)/2.0 b(i)=  hh*i*h u(i)=0.1
290
41 10 42 43 44 45 46 47 48 c 49 50 51 52 53 54 55 c 56 c 57 58 c 59 60 c 61 62 c 63 64 65 100 66 67 101 68 69 70 71 c 72 73 74 75 76 c 77 78 79 c
continue a(1 )=0.0 c(n)=O.O u(0)=1.0 u(n+ 1 )=piJ2.0  1.0 b(1 )=1.0  hh*h b(n)=(pi/2.)hh*h*n Divide work up into rn/bite_size] chunks x=real(n)/real(bite_size) if(real(int(x)).eq.x)then noprocs=x else noprocs=int(x)+ 1 endif Set the number of processors to be used and therefore the number of chunks to be evaluated at the same time i=m_setprocs(noprocs) Start timing call 310ck_time(time1) Fork offspring then assist them to evaluate subroutine i=m_fork(age) Finish time call _clock_time(time2) write(*,1 00)(time2time1 )/1 00.0 format(5x,'The time = ',f8.5) write(*,1 01 )(u(i),i=1,n) format(5x,f8.5) stop end subroutine age Common block contains shared variable list
. COMMON /SHCOM/a,b,c,hd,m,n,r,v,w,eps,det,u/ parameterOmax=50) dimension aUmax),bUmax),cUmax),hdOmax),u(O:jmax),
* u1 (O:jmax),u2(O:jmax),detUmax) System call INTEGER m_next integer path,start,finish,bite_size Get process path numbers
291
80 81 c 82 c 83 102 84 85 c 86 c 87 88 89 90 91 92 93
·94 11 95 96 30 97 98 99 12 100 c 101 c 102 c 103 104 105 106 107 108 109 110 111 13 112 c 113 c 114 c 115 c 116 c 117 118 c 119
path=m_nextO start=(path1 )*bite_size1 Whilst there is more work loop if(start.le. n)then finish=start+bite_size1 If this is the last item of work,then set finish pointing to the end of the last item if(path.eq.noprocs) finish=n do 103 k=start, finish n1 =n1 n2=n2 do 11 i=1,n w(i)=hd(i)H v(i)=hd(i)r det(i)=(w(i)·w(i+ 1 )a(i+ 1 )*c(i)) m=O· iflag=O m=m+1 do 12 i=1,n u1 (i)=u(i) ••••• The first Sweep •••••
==============
u(k+ 1/2)=(G1 HI)<1 >·[b(G2rl)·u(k)] do 13 i=1,n1,2 i1=i+1 i2=i+2 i3=i1
v1 =b(i)a(i)·u1 (i3)v(i)·u1 (i) v2=b(i1 )v(i1 )*u1 (i1 )c(i 1 )·u1 (i2) u2(i)=(w(i1 )·v1c(i)·v2)/det(i) u2(i 1 )=( a(i 1 )·v 1 +w(i)·v2)/det(i) continue
..... The second sweep •••••
================
u(k+ 1 )=(G2HI)<1 > ·[b(G1rl)·u(k+ 1/2)]
u(1 )=(b(1 )v(1 )·u2(1 )c(1 )·u2(2))/w(1)
do 15 i=2,n,2
292
120 . 121
122 123 124 125 126 127 15 128 c 129 130 c 131 132 17 133 134 103 135 136 137 138 139 c 140 141
i1=i+1 i2=i+2 i3=i1 v1 =b(i)a(i)*u2(i3)v(i)*u2(i) v2=b(i)v(i)*u2(i1 )c(i1 )*u2(i2) u(i)=(w(i 1 )*v1c(i)*v2)/det(i) u(i1 )=(a(i1 )*v1 +w(i)*v2)/det(i) continue
u(n)=(b(n)a(n)*u2(n 1 )v(n)*u2(n))/w(n)
do 17 i=1 ,n if(abs(u(i)u1 (i)).gt.eps) iflag=1 if(iflag .eq.1) goto 30 continue path=m_nextO start=bite_size*(path1 ) goto 102 endif All done so return to main program return end
293
•. . '. "
1 2 3 4 5 6 7 8 9 10. 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
c c c c c c c c
c
10 c
11
Program 4.2 This program is to calculate the eigenvalues and the corresponding eigenvectors,by the use of the AGE iterative method.
d**U/dx**2+(lemda2*q*cos2x)U=0. Boundary Conditins, U(O)=O , U(7tl2.)=1t parameter(jmax=50) data eps,pi/0.000001 ,3.14159261 reallemda,lemda2,le read(*:)n
dimension a(jmax),b(jmax),c(jmax),d(jmax),u(jmax) * ,da(jmax),ud(jmax),le(jmax),ude(jmax)
choose the accelaration parameter r=1.50 h=pi/real(n+ 1) hh=h*h q=1.0 do 10 i=1,n xi=i*h r d(i)=(2.0/hh+2.0*q*COS(2·ce*h» hd(i)=d(i)/2.0
.do 11 i=1,n _.a(i)=1 .O/hh ~.c(i)=a(i) ' .. u(i)=0.1 ·u(O)=O.O 'c:u(n+ 1 )=0.0 '. a(1 )=0.0 7c(n)":0.O ;'~sum:l =0.0
.:_. _.7., _'_
"." ,::"'.: ~ .... : .
34 .. sum2=0.0 35 7od(1 )=hd(1 )*u(tJ+C(1)*d(2)" ~, •. , ....
36 do 13 i=2,n1 37 13 ud(i)=a(i)*u(i1 )+hd(i)*u(i)+c(i)*u(i+ 1) 38 ud(n)=a(n)*u(n1 )+hd(n)*u(n) 39 do 14 i=1 ,n
294
40 sum1 =sum1 +ud(i)*u(i) 41 14 sum2=sum2+u(i)*u(i) 42 lemda2=sum1/sum2 43 1=0 44 15 lemda=lemda2 45 30 1=1+1 46 16 do 17 i=1 ,n 47 b(i)=lemda*u(i) 48 c call the AGE subroutine. 49 c 50 call age(a,b,c,hd,m,n,eps,u,r) 51 sum1=0.0 52 sum2=0.0 53 ud(1 )=hd(1 )*u(1 )+c(1 )*u(2) 54 do 17 i=2,n1 55 17 ud(i)=a(i)*u(i1 )+hd(i)*u(i)+c(i)*u(i+ 1) 56 . ud(n)=a(n)*u(n1 )+hd(n)*u(n) 57 d018i=1,n 58 . sum1=sum1+u(i)*ud(i) 59 18 . sum2=sum2+u(i)*u(i) 60 c Calculate a new value for lemda 61 'c 62 63 64 65 66 67 68 19 69 70 71 20 .72 21
 <73 c 74 75 76 22 77 78
lemda2=sum1/sum2 if(abs((lemda2lemda)/lemda).gt.eps) goto 30
• i ·.lemda=lemda2; '." .~ ... write(*,19) n,eps,m,r write(*,20)(u(i),i=1,n) write(* ,21) lemda,1
, "format(/5x,'n,,;~';i3,5x/epsc', f9.6, . c* /5x,' No. of iterations=',i4,5x,'r = ',f9.6
* /5x,'The solution vector is '/5x,22(' " ')) .  format(5x,f1 0.6)~.. format(5x,'~em~a= ',f9.6,4x;T= ',i3) ; ! '" , .. ude(1 )=hd(1 )*u(1 )+c(1 )*u(2)  .
do 22 i=2,n1 ude(i)=a(i)*u(i1 )+hd(i)*u(i)+c(itu(i+ 1) ude(n)=a(n)*u(n1 )+hd(n)*u(n) do 23 i=1,n
295
t!.1· .
79 23 le(i)=lemda·u(i) 80 write(·,24) 81 write(· ,25) 82 24 format(1I2x,'A·u(i)',8x,' lemda·u(i)' ) 83 25 format(2x,·"nllllnnnllllllnll""" I,ax,'""""""""""""" ') 84 do 26 i=1 ,n 85 26 write(·,27) ude(i),le(i) 86 27 format(2x,f9.6, 11 x,f9.6) 87 stop 88 end 89 C
90 C
91 subroutine age(a,b,c,hd,m,n,eps,u,r) 92 parameterGmax,;,,50) 93 dimension aGmax),bGmax),cGmax),hdGmax),uGmax), ' 94 • u1 Gmax),u2Gmax),detGmax),wGamx),vUmax) 95 real length 96 n1=n1 97 n2=n2 98 do 28 i=1,n 99 w(i)=hd(i)+r 100 v(i)=hd(i)r 101 28 det(i)=(w(i)·w(i+ 1 )a(i+ 1 )·c(i» 102 m=O 103 29 m=m+1 104 do;31 i=1,n 105 31 u1 (i)=u(i) 106 c ••••• The first sweep •••••
107 c ============== 108 do 32 i=1 ,n1 ,2 109 i1=i+1 110 111 112 113 114 115 116 117 118
32 c c
119 c
i2=i+2 i3=i1
vi =b(i)a(i)·u1 (i3)v(i)·u1 (i) v2=b(i1 )v(i1 )·u1 (i1 )c(i1 )·u1 (i2) u2(i)=(w(i1 )·v1c(i)*v2)/det(i) u2(i1 )=(a(i1 )*v1 +w(i)·v2)/det(i) continue
..... The second sweep •••••
================
296
1
297
120 c 121 u(1 )=(b(1 )v(1 )·u2(1 )c(1 )·u2(2))/w(1) 122 do 33 i=2,n2,2
123 i1=i+1 124 i2=i+2 125 i3=i1
126 v1 =b(i)a(i)·u2(i3)v(i)·u2(i)
127 v2=b(i1 )v(i)·u2(i1 )c(i1 )·u2(i2)
128 u(i)=(w(i1 )*v1c(i)*v2)/det(i)
129 u(i1 )=(a(i1 )·v1 +w(i)·v2)/det(i)
130 33 continue 131 u( n )=(b(n )a( n )*u2( n 1 )v( n)· u2( n) )/w( n)
132 do 34 i=1,n 133 33 if(abs(u(i)u1 (i)).gt.eps) goto 29
134 sumsq=O.O. 135 do 34 1=1,n
136 34 sumsq=sumsq+u(i)**2
137 length=sqrt(sunsq) .
138 do 351=1,N
139 35 u (i) =u (i)/length 140 return 141 end
  ~ ...... .. ... :.... .............. .
'': '."
1 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 c 15 c 16 c 17 c 18 c 19 c 20 c 21 c 22 c 23 c 24 c 25 c 26 c 27 c 28 c 29 c 30 c 31 32 33 34 35 36 37 38 39 40
*
Program 5.1 Onedimensional diffusionconvection equation solved using the parallel AGE algorithm with the second strategy. dU/dt=edU**2U/dx**2kdU/dx , 0<x<1 ,O<t<oo. Boundary conditions, U(O,t)=O, U(1,t)=1 Initial condition U(x,O)=O 0<x<1.
n: The order of the system a: The subdiagonal of A c: The superdiagonal of A d: The diagonal of A hd: Half the diagonal of A r: The acceleration parameter tl: Maximum value in the tdirection k: Increments in the time sx: Maximum value in the xdirection h: Increments in the xaxis nt: Number of time level parameterOmax=50) dimension u(0:jmax),u1 (0:jmax),u2(0:jmax),u3(0:jmax)
,u40max)
integer*4 nprocs,status,time1, time2 nprocs: The number of processors for each run status: This variable is used to store the return value of
the libr~ry routine that actually sets the number of processors; : ,",:
write(*,,) 'Numberof procs',n ' read(*, *) nprocs;"n'.':: data·eps,kmax;pilO.00005',300,3.14159261 tl 0 1 :. ... = ; " " .. , .. ,. sx';l.O r=0.5 k=0.001 nt=21 i1=1 i2=1
.: =.
298
299 ,
41 h=sxl(n+ 1) 42 hh=h·h 43 nt1 =nt1 44 rl=klhh 45 c Input matrix size 46 alf=0.5·rl·h 47 a=0.5·(rlalf) 48 c=0.5·(rl+alf) 49 d=(1.0HI) 50 hd=d/2. 51 v=rhd 52 w=r+hd 53 det=w·wa·c 54 c 55 do 10 i=O,n 56 u(i)=O.O 57 10 . u3(i)=u(i) 58 . a1=c·w 59 b1=rl·w 60 c1 =a·rl 61 d1=a·a 62 a2=c·c 63 b2=c·rl 64 d2=a·w 65 s1=d1 66 s2=d2 67 do 11 jt=1,nt1 68 t1 =jt·k 69 jt2=jt+ 1 70 n(n+1)=1.0 71 u3(n+ 1 )=u(n+ 1) 72 ea=0.5·(rl+alf) 73 f=1.0rl 74 _ g=0.5·(rlalfj ..
75 do 12 k=1,kmax ... :.' '.: ."; <'.~':'~ ~.
76 ic=1 77 c ••••• Fi rst sweep .....
78 c =========== 79 c u(k+ 1/2)=(G1 HI)<1 >·[(rlG2)·u(k)+b]
80 c 81 82 83 c 84· c 85 c 86 c 87 88 c 89 90 c
b11 =ea·u3(O)c·u3(O)+f·u3(1 )+g·u3(2) u1 (1 )=(v·u(1 )a·u(2)+b11 )/w Set the number of processors. Makes the beginning of the timed section of code, which includes everything except VO.
status=m_setJ>rocs(n0J>rocs)
call_clock_time(time1)
91 c$doacross share(u1,u,ea,u3,f,g,w,a,c,a1,b1,c1,d1,det,a2,b2, 92 c$& d2),local(b11,b22,r1) 93 do 13 i=3,n2,2
.94 b11 =ea·u3(i1 )+f·u3(i)+g·u3(i+ 1) 95 b22=ea·u3(i)+f·u3(i+ 1 )+g·u3(i+2) 96 r1 =w·b11a·b22 97 13 u1 (i)=(a1·u(i1 )+b1·u(i)+c1·u(i+ 1 )+d1·u(i+2)+r1 )/det 98 c$doacross share(u1,u,ea,u3,f,g,w,a,c,a1,b1,c1,d1,det,a2,b2,d2) 99 c$& ,local(b11,b22,r2) 100 do 14 i=2,n3,2 101 b11 =ea·u3(i1 )+fu3(i)+g·u3(i+ 1) 1 Q2 b22=ea·u3(i)+f·u3(i+ 1 )+g·u3(i+2) 103 r2=w·b22c·b11 104 14 u1 (i)=(a2·u(i1 )+b2·u(i)+b1·u(i+ 1 )+d2·u(i+2)+r2)/det 1 C5 b11 =ea·u3(n2)+f·u3(n1 )+g·u3(n) 106 b22=ea·u3(n1 )+f·u3(n)+g·u3(n+ 1 )a 107 r1=w·b11a·b22 1Q8 109 11':) 11, c 112 c 113 c 114 c 115 11:5· 117 11,3 11'3
r2=w·b22c·b11 u1 (n1 )=(a1·u(n2)+b1·u(n1 )+c1·u(n)+r1 )/det u1 (n)=(a2·u(n2)+b2·u(n1 )+b1·u(n)+r2)/det ••••• Second sweep •••••
============
u(k+ 1 )=(G2+rI)<1 >·[(rlG1 )*u(k+ 1/2)+b]
b11 =ea·u3(O)+f·u3(1 )+g·u3(2) b22=ea·u3(1 )+f·u3(2)+g·u3(3) r11 =w·b11a·b22 r22=w'b22c·b11 u2(1 )=(b1·u1 (1 )+c1·u1 (2)+s1·u1 (3)+r11 )/det
300
120 121 122 123 124 125
u2(2)=(b2*u1 (1 )+b1*u1 (2)+s2*u1 (3)+r22)/det c$doacross share(u2.ea.u3.f.g.w.a.c.a1.b1.c1.s1.u1.det.a2.b2. c$& s2).local(b11.b22.r11)
126 15 127
do 15 i=3.n2.2 b11 =ea*u3(i1 )+f*u3(i)+g*u3(i+ 1) b22=ea*u3(i)+f*u3(i+ 1 )+g*u3(i+2) u2(i)=(a1*u1 (i1 )+b1*u1 (i)+c1*u1 (i+ 1 )+s1*u1 (i+2)
* +r11 )/det 128 129 130 131
c$doacross share(u2.ea.u3.f.g.w.a.c.a1.u1.b1.c1.s1.det.a2.b2. c$& s2).local(b11.b22.r22)
132 133 134 16 135 136 137 138 c 139 c 140 c 141 c 142 143 144 18 145 17 146 147 19 148 149 150 20 151 c 152 c 153 c 154 155 156 157 158 159
do 16 i=4.n1.2 b11 =ea*u3(i1 )+f*u3(i)+g*u3(i+ 1) b22=ea*u3(i)+f*u3(i+ 1)+g*u3(i+2) r22=w*b22c*b11 u2(i)=(a2*u1 (i1 )+b2*u1 (i)+b1*u1 (i+ 1 )+s2*u1 (i+2)
* +r22)/det bn=ea*u3(n1 )+f*u3(n)+g*u3(n+ 1 )a u2(n)=(c*u1 (n1 )+v*u1 (n)+bn)/w Generate solutions on each time level. Set ic=1 for successful convergence and 0 otherwise. begin iterative process.
do 17i=1.n if(abs(u2(i)u(i))eps)17.17.18 ic=O continue do 19i=1.n u(i)=u2(i) if(ic.ne.1) goto 12 do 20 i=1.n u3(i)=u2(i) Terminates the timing. Terminates the child processors creatrd by the c$doacross. call 310ck_time(time2) call m_kill..J)rocs print* ,'The time = ·.(time2time1 )/1 00.0 do 21 jw=i1.nt1.i1 ifUt.eq.jw)then write(* .23)jt2.t1
301
160 161 23 162 163 22 164 165 166 167 168 21 169 170 12 171 172 24 173 * 174 11 175 176
write(*,22)(u(i),i=1,n) format(/,'The Age iterative solutions at time level
* nt=',i31 'time t c' ,f1 0.6) format{2x,1 Of1 0.6)
goto 21 els goto 21 endif continue goto 11 continue write(* ,24 )kmax,jt2 format(,Method fails to converge in ',i4, 'iteration',!
'at time level nt =' ,i3) continu stop end
302
1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 13 c 14 c 15 c 16 c 17 c 18 c 19 c 20 c 21 22 23 24 25 c 26 c 27 c 28 c 29 30 c 31 32 33 34 35 36 37 38 39 40
*
Program 5.2 Onedimensional heat equation solved using the parallel AGE algorithm with the first strategy. AGE with D'Yakonov splitting. dU/dt=d**2U/dx**2 , 0<x<1, ,O<t<nt. Boundary conditions, U(o,t)=O ,U(7t,t)=O ,O<t<nt. Initial condition, U(x,O)=sinx O<x<7t. n: The order of the syste a: The subdiagonal of A c: The superdiagonal of A d: The diagonal of A hd: Half the diagonal of A r: The acceleration parameter tl: Maximum value in the tdirection k: Increments in the time
sx: Maximum value in the xdirection h: Increments in the xaxis nt: Number of time level parameter Gmax= 100) dimension u(0:jmax),u1 (0:jmax),u2(0:jmax),u3(0:jmax),
u4(0:jmax),erGmax),reGmax) integer*4 nprocs,status,time1,time2 nprocs: The number of processors for each run status: This variable is used to store the return value of the
library routine that actually sets the number of processors.
data pi,eps,kmaxl3.1415926,0.000001 ,3001
write(*, *)'Number of procs,n' read(* :)nnprocs,n tl=0.1 k=0.005 sx=pi r=0.5 nt=21 i1=4 i2=1 h=pi/(n+ 1)
303
41 42 43 44 c 45 c 46 47 48 49
.50 51 52 53 54 55 56 57 10 58 c 59 c 60 61 62 63 64 65 66 67
.68 c 69 70 71 72 73 74 75 14 76 77 c 78 c 79
hh=h·h nt1 =nt1 rl=klhh Input data
a=rl c=a hd=1.0+r1 w=hd+r v=rhd aa=a·a cc=c·c w=v·v det=w·wa·c do10i=1,n u(i)=sin(i·h) u4(i)=u(i)
a1 =(v·w·c+a·cc) b1 =(w·w+v·a·c) c1 =(v·w·a+w·a) d1 =(aa'w+v'aa) a2=(v·cc+w·cc) b2=(w·c+v·w·c) c2=(v·a·c+w·w) d2=(aa·c+v·w·a)
do 12 jt=1,nt1 t1 =jt*k jt2=jt+ 1 do 13 k=1 ,kmax ic=1 do 14 i=1 ,n u3(i)=exp(t1 )*sin(i·h) ••••• First Sweep .....
========== u(k+ 1/2)=(G1 +ri)<1 >·[(riG2)·u(k)+bl b11 =(2.02.0·rl)·u4(1 )+rI·u4(2)
304
80 81 c 82c 83 c 84 c 85 86 c 87
c
u1 (1 )=(w·u(1 )v·a·u(2)+b11 )/w Set the number of processors. Makes the beginning of the timed section of code, which includes everything except 1/0.
status=m_setJlrocs(noJlrocs)
call_clock_time(time1 ) 88 89 90 91 92
c$doacross share(u4,rl,u1,u,a1,b1,c1,d1,a2,b2,c2,d2,det,w,c,a), c$& local(b11,b22,r1,r2)
93 94 95 96 97 15 98 99 100
101 102 103 104 105 106 c
do 15 i=2,n3,2 . b11 =rl·(u4(i1 )+u4(i+ 1 ))+(2.02.0·rl)·u4(i)
b22=rl·(u4(i)+u4(i+2))+(2.02.0·rl)·u4(i+ 1) . r1 =(w·b11a·b22) r2=(c·b11w·b22) u1 (i)=(a1·u(i1 )+b1·u(i)+c1·u(i+ 1 )+d1·u(i+2)+r1 )/det u1 (i+ 1 )=(a2·u(i1 )+b2·u(i)+c2·u(i+ 1 )+d2·u(i+2)+r2)/det b11 =rl·(u4(n2)+u4(n))+(2.02.0·rl)*u4(n1) b22=rl·(u4(n1 )+(2.02.0·rl)·u4(n) r1 =(w·b11a·b22)
r2=(c·b11w·b22) u1 (n1 )=(a1·u(n2)+b1·u(n1 )+c1·u(n)+r1 )/det u1 (n)=(a2·u(n2)+b2·u(n1 )+c2·u(n)+r2)/det ••••• Second Sweep •••••
=============
u(k+ 1 )=(G2+ri)<1 >··u(k+ 1/2) 107 c 108 109 110
u2(1 )=(w·u1 (1 )a·u1 (2))/det u2(2)=(c·u1 (1 )+w·u1 (2))/det
c$doacross share(u2,u1,a,c,w,det),local(i) do 16 i=3,n1,2 111 u2(i)=(w·u1 (i)a·u1 (i+ 1 ))/det
16 u2(i+ 1 )=(c·u1 (i)+w·u1 (i+ 1 ))/det u2(n)=(u1 (n))/w
112 113
.114 115 c 116 c 117 c 118 c 119 c
Generate solutions on each time level. Set ic=1 for successful convergence and 0 otherwise. Begin iterative process.
305
306
120 d017 i=1,n 121 if(abs(u2(i)u(i))eps)17, 17,18 122 18 ic=O 123 17 continue 124 do 19 i=1,n 125 19 u(i)=u2(i) 126 do 20 i=1,n 127 20 er(i)=abs(u(i)u3(i)) 128 do 21 i=1,n 129 21 re(i)=abs( (u3(i)u(i) )/u3(i)) 130 if(ic.ne.1) goto 13 131 do 22 i=1,n 132 22 u4(i)=u2(i) 133 c Terminates the timing .. 134 c Terminates the child processors created by 135 c the c$doacross. 136 call _clock_time(time2) 137 call m_killprocs 138 print· ,'The time =',(time2time1 )/1 00.0 139 do 23 jw=i1 ,nt1 ,i1 140 ifUt.eq.jw) then 141 write(· ,24) jt2,t1 142 write(· ,25) I 143 write(· ,26) 144 write(· ,27) 145 write(· ,28)( u (i), i= 1 ,n, ic2) 146 write(· ,29) 147 write(· ,27) 148 write(· ,28)( u3(i) ,i= 1 ,n, ic2) 149 write(· ,30) 150 write(· ,27) 151 write(·,28)(er(i),i=1,n,ic2) 152 write(· ,31 ) 153 write(· ,27) 154 write(·,28)(re(i),i=1,n,ic2) 155 24 format(/,'AGE iterative solution at time level nt = ',i31 156 • 'Time t = ',f10.6) 157 25 format(, Method converges with I = ',i4, ' iterations ',I) 158 26 format(, The numerical solutions is ') 159 27 format(, nlllltltlllllllllnllllllftllllllltllll"""""l1nn",,"""n ')
160 28 161 29 162 30 163 31
164 165 166 167 168 23 169 170 13
format(1 Of1 0.6) formate' The exact solution is formate' The absolute error is format(f The relative error is
goto 23 else goto 23 endif continue goto 12 continue write(" ,32) kmalS,jt2
') ') ')
171 172 173
32 format(, Method fails to converge in ',i4, ' At time level nt = ',i4) "
174 175 176
177
c 12continu
stop end
307
iterations' ,I