(advances in computational economics 15) erricos john kontoghiorghes (auth.)-parallel algorithms for...
TRANSCRIPT
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
1/195
PARALLEL
ALGORITHMS
FOR
LINEAR
MODELS
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
2/195
Advances in Computational Economics
VOLUME
15
SERIES EDITORS
Hans Amman, University ofAmsterdam, Amsterdam, The Netherlands
Anna Nagurney,
University ofMassachusetts at Amherst, USA
EDITORIAL BOARD
Anantha
K.
Duraiappah,
European University Institute
John Geweke,
University ofMinnesota
Manfred Gilli,
University ofGeneva
Kenneth
L.
Judd, Stanford University
David Kendrick,
University ofTexas at Austin
Daniel McFadden,
University ofCalifornia at Berkeley
Ellen McGrattan,
Duke University
Reinhard Neck,
University ofKlagenfurt
Adrian R. Pagan, Australian National University
John Rust, University ofWisconsin
Berc Rustem,
University ofLondon
Hal
R.
Varian,
University ofMichigan
The titles published in this series are listed at the end
of
his volume.
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
3/195
Parallel Algorithms for
Linear Models
Numerical Methods and
Estimation Problems
by
Erricos John Kontoghiorghes
Universite de Neuchtel Switzerland
pringer Science+Business Media, LLC
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
4/195
Library
of
Congress Cataloging-in-Publication ata
Kontoghiorghes, Erricos John.
Parallel algorithms for linear models : numerical methods and estimation problems / by
Erricos John Kontoghiorghes.
p cm -- (Advances in computational economics; v 15)
lncludes bibliographical references and indexes.
ISBN 978-1-4613-7064-2 ISBN 978-1-4615-4571-2 (eBook)
DOI 10.1007/978-1-4615-4571-2
1 Linear models (Statistics)--Data processing. 2. Parallel algorithms. 1 Title. II
Series.
QA276 .K645 2000
519.5 35--dc21
99-056040
Copyright
2000 by Springer Science+Business Media
New York
Originally published by Kluwer Academic Publishers, New
York
in 1992
Softcover reprint
ofthe
hardcover lst edition 1992
AII rights reserved. No part
of
this publication may be reproduced, stored in a retrieval
system or transmitted in any form or by any means, mechanical, photo-copying, recording,
or
otherwise, without the prior written permission
of
the publisher, Springer Science
Business Media, LLC
Printed
on
acid-free p per.
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
5/195
To Laurence and Louisa
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
6/195
Contents
List
of
Figures ix
List
of
Tables xi
List
of
Algorithms xiii
Preface xv
1. LINEAR MODELS AND QR DECOMPOSmON 1
1 Introduction 1
2 Linear model specification 1
2.1
The ordinary linear model 2
2.2 The general linear model 7
3 Forming the QR decomposition 10
3.1 The Householder method 11
3.2 The Givens rotation method 13
3.3 The Gram-Schmidt orthogonalization method 16
4 Data parallel algorithms for computing the QR decomposition 17
4.1
Data:
parallelism and the MasPar SIMD system
17
4.2 The Householder method
19
4.3 The Gram-Schmidt method
21
4.4 The Givens rotation method 22
4.5 Computational results 23
5 QRD of large and skinny matrices 23
5.1 The CPP GAMMA SIMD system 24
5.2 The Householder QRD algorithm 25
5.3 QRD of skinny matrices 27
6 QRD of a set of matrices 29
6.1
Equal size matrices 29
6.2 Mattices with different number
of
columns 34
2. OLM Nor OF FULL RANK 39
1 Introduction 39
2 The QLD
of
the coefficient matrix 40
2.1 SIMD implementation 41
3 Triangularizing the lower trapezoid 43
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
7/195
viii PARALLEL
ALGORITHMS
FOR
UNEAR MODELS
4
5
3.1
The Householder method
3.2 The Givens method
Computing the orthogonal matrices
Discussion
43
46
49
54
3. UPDATING AND DOWNDATING THE
OLM
57
1 Introduction 57
2 Adding observations 58
2.1
The hybrid Householder algorithm 60
2.2 The Bitonic and Greedy Givens sequences 67
2.3 Updating with a block lower-triangular matrix 75
2.4 QRD of structured banded matrices 82
2.5 Recursive and linearly constrained least-squares 87
3 Adding exogenous variables 90
4 Deleting observations 92
4.1 Parallel strategies
94
5 Deleting exogenous variables 99
4. THE GENERAL LINEAR MODEL
105
1 Introduction 105
2 Parallel algorithms
108
3 Implementation and performance analysis
111
5. SUREMODELS 117
1 Introduction 117
2 The generalized linear least squares method 121
3 Triangular SURE models 123
3.1 Implementation aspects 127
4 Covariance restrictions 129
4;1 The QLD of the block bi-diagonal matrix 133
4.2 Parallel strategies
138
4.3 Common exogenous variables 140
6. SIMULTANEOUS EQUATIONS
MODELS 147
1 Generalized linear least squares 149
1.1
Estimating the disturbance covariance matrix 151
1.2 Redundancies 152
1.3
Inconsistencies 153
2 Modifying the SEM 154
3 Linear Equality Constraints 157
3.1 Basis
of
the null space and direct elimination methods
158
4 Computational Strategies 160
References
163
Author Index 177
Subject Index 1
79
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
8/195
List
of
Figures
1.1
Geometric interpretation of least-squares for the OLM
problem. 4
1.2
Illustration of Algorithm 1.2, where
m
= 4 and
n
= 3.
15
1.3
The column and diagonally based Givens sequences for
computing the QRD. 15
1.4
Cyclic mapping
of
a matrix and a vector on the MasPar
MP-1208. 18
1.5
Examples of Givens rotations schemes for computing
theQRD.
22
1.6
Execution time ratio between
2-D
and 3-D algorithms
for computing the QRDs, where G = 16.
34
1.7
Stages of computing the QRDs (1.47). 36
2.1
Annihilation pattern
of
(2.4) using Householder reflections. 44
2.2
Givens sequences for computing the orthogonal factor-
ization (2.4). 47
2.3 Illustration of the implementation phases ofPGS, where
es=4.
49
2.4
Thefill-in
of the submatrix PI:n,I:n at each phase
of
Al-
gorithm 2.4.
53
3.1 Updating Givens sequences for computing the orthog-
onal factorizations (3.6), where
k
= 8 and
n
= 4.
59
3.2 Ratio of the execution times produced by the models of
the
cyclic-layout
and
column-layout
implementations. 63
3.3
Computing (3.21) using Givens rotations.
71
3.4 The
bitonic
algorithm, where
n
= 6,
k
=
18
and
PI =
P2 = P3 = 6.
72
3.5 The
Greedy
sequence for computing (3.6a), where
n =
6andk= 18.
73
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
9/195
x
PARAILELALGORITHMS FOR liNEAR MODELS
3.6 Computing the factorization (3.23) using the diagonally-
based method, where G = 5.
76
3.7 Parallel strategies for computing the factorization (3.24)
77
3.8 Computing factorization (3.23). 78
3.9 The column-based method using the UGS-2 scheme.
80
3.10 The
column-based
method using the
Greedy
scheme.
81
3.11
Illustration
of
the annihilation patterns of method-I.
83
3.12
Computing (3.31) for b = 8,
l'}*
= 3 and
j
= 2. Only
the affected matrices are shown.
85
3.13 Illustration of method-3, where p = 4 and g =
1.
86
3.14 Givens parallel strategies for downdating the QRD.
96
3.15 Illustration of the SK-based scheme for computing the
QRDofRS.
102
3.16
Greedy-based schemes for computing the QRD
of
RS.
104
4.1
Sequential Givens sequences for computing the QLD (4.3a).
107
4.2 The SK sequence.
109
4.3
G(16)B
with e{16, 18,8) = 8.
109
4.4 The application of the SK sequence to compute (4.3)
on a 2-D SIMD computer.
109
4.5
Examples
of
the
MSK(p)
sequence for computing the QLD.
110
5.1
The correlations
Pj,j
in the SURE--CC model for
l'}j
=
i
and l'}j =
Iii.
131
5.2 Factorization process for computing the QLD (5.35)
using Algorithm 5.3.
136
5.3 Annihilation sequences of computing the factorization (5.40). 137
5.4 Givens sequences of computing the factorization (5.45).
138
5.5 Number of CDGRs for computing the orthogonal fac-
torization (5.40) using the PDS.
139
5.6 Annihilation sequence
of
triangularizing (5.55). 144
6.1
Givens sequence for computing the QRD of RSj.
161
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
10/195
List
of
Tables
1.1 Times (in seconds) of computing the QRD of a
128M
x
64N matrix.
23
1.2 Execution times (in seconds) of
the CPP
LALIB QR_FACTOR
subroutine and the BPHA.
27
1.3 Execution times (in seconds)
ofthe
CPP LA LIB QR-FACTOR
subroutine and Algorithm 1.9.
28
1.4 Times (in seconds)
of
simultaneously computing the
QRDs (1.47).
33
1.5
The
task-farming
and
scattering
methods for comput-
ing the QRDs (1.47).
38
2.1 Computing the QLD (2.3) (in seconds), where m = Mes
and n = Nes.
44
2.2 The CDGRs
of
the PGS for computing the factorization (2.4).
47
2.3 Computing (2.4) (in seconds), where k = Kes and
n-
k
=
Tles.
50
2.4 Times (in seconds) of reconstructing the orthogonal ma-
trices
QT
and
P
on the DAP.
54
3.1 Execution times (msec) of the Householder and Givens
methods for updating the QRD on the DAP.
60
3.2 Execution times (in seconds) for
k =
11264.
63
3.3 Execution times (in seconds) of the RLS Householder
algorithm on the MasPar.
65
3.4 Execution times (in seconds)
of
the RLS Householder
algorithm on the GAMMA.
66
3.5
Number
of
CDGRs required to compute the factoriza-
tion (3.6a).
74
3.6
Times (in seconds) for computing the orthogonal fac-
torization (3.6a).
74
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
11/195
xu
PARALLEL ALGORITHMS FOR LINEAR MODELS
3.7
Computing the QRD
of
a structured banded matrix us-
ing method-3.
87
3.8
Estimated time (msec) required to compute x(i) (i =
2,3,
...
),
where
mi
=
96, n
=
32N and k
=
32K.
91
3.9
Execution time (in seconds) for downdating the OLM.
98
4.1
Execution times (in seconds) of the MSK(Aes/2). 114
4.2
Computing (4.3) (in seconds) without explicitly con-
structing
QT
and P.
115
5.1
Computing (5.24), where
T
-
k
- 1
= -res, G - 1 =
Jles
and es = 32.
128
5.2
Execution times of Algorithm 5.2 for solving
Rr = A. 129
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
12/195
List
of
Algorithms
1.1
Computing the QRD of
A
E 9\mxn using Householder trans-
formations.
12
1.2 The column-based Givens sequence for computing the
QRD of
A
E
S)\mxn.
14
1.3
The diagonally-based Givens sequence for computing the
QRD of
A
E
S)\mxn.
16
1.4 The Classical Gram-Schmidt method for computing the
QRD of
A
E
S)\mxn.
16
1.5 The Modified Gram-Schmidt method for computing the
QRD.
17
1.6 QR factorization by Householder transformations on SIMD
systems.
20
1.7 The MGS method for computing the QRD on SIMD systems.
21
1.8 The CPP
LALIB
method for computing the QR Decomposition.
26
1.9 Householder with parallelism in the first dimension. 28
1.10 The Householder algorithm.
30
1.11 The Modified Gram-Schmidt algorithm.
31
1.12 The task-farming approach for computing the QRDs (1.47)
on p (p
G)
processors using a SPMD paradigm.
37
2.1
The QL decomposition of
A.
43
2.2 Triangularizing the lower trapezoid using Householder re-
flections.
45
2.3
The reconstruction
of
the orthogonal matrix Q in (2.3).
51
2.4 The reconstruction of the orthogonal matrix P in (2.4).
53
3.1
The data-parallel Householder algorithm.
61
3.2
The bitonic algorithm for updating the QRD, where
R
== R i ~ I . 70
3.3 The computation of (3.63) using Householder transformations.
97
5.1 An iterative algorithm for solving tSURE models.
126
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
13/195
XIV PARALLEL ALGORITHMS FOR LINEAR MODELS
5.2 The parallel solution of the triangular system
Rr
= L\.
129
5.3 Computing the QLD (5.35). 135
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
14/195
Preface
The monograph provides a complete and detailed account of the design,
analysis and implementation of parallel algorithms for solving large-scale lin
ear models.
It
investigates and presents efficient, numerically stable algorithms
for computing the least-squares estimators and other quantities of interest on
massively parallel systems.
The least-squares computations are based on orthogonal transformations,
in particular the QR and QL decompositions. Parallel algorithms employ
ing Givens rotations and Householder transformations have been designed for
various linear model estimation problems. Some of the algorithms presented
are parallel versions
of
serial methods while others are original designs. The
implementation of the major parallel algorithms is described. The necessary
techniques and insights needed for implementing efficient parallel algorithms
on multiprocessor systems are illustrated in detail. Although most
of
the al
gorithms have been implemented on SIMD systems the data parallel compu
tations
of
these algorithms should, in general, be applicable to any massively
parallel computer.
The monograph is in two parts. The first part consists
of
four chapters
and deals with the computational aspects for solving linear models that have
applicability in diverse areas. The remaining two chapters form the second
part which concentrates on numerical and computational methods for solv
ing various problems associated with seemingly unrelated regression equations
(SURE) and simultaneous equations models.
Chapter 1 provides a brief introduction
to
linear models and considers var
ious forms for solving the QR decomposition on serial and parallel systems.
Emphasis is given to the design and efficient implementation of the parallel
algorithms. The second chapter investigates the performance and practical is
sues for solving the ordinary linear model (OLM), with the exogenous matrix
being ill-conditioned or having deficient rank, on a SIMD system.
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
15/195
xvi PARALLEL ALGORITHMS FOR LINEAR MODELS
Chapter 3 is devoted to methods for
up-
and down-dating the OLM. It pro
vides the necessary computational tools and techniques that are often required
in econometrics and optimization. The efficient parallel strategies for modi
fying the OLM can be used as primitives for designing fast econometric al
gorithms. For example, the Givens and Householder algorithms used to com
pute the QR decomposition after rows have been added or columns have been
deleted from the original matrix have been efficiently employed to the solution
of
the SURE and simultaneous equations models. The updating methods are
also employed to solve the recursive ordinary linear model with linear equal
ity constraints. The numerical methods based on the basis of the null space
and direct elimination methods are in turn adopted for the solution of linearly
constrained simultaneous equations models.
The fourth chapter investigates parallel algorithms for solving the general
linear model - the parent model of econometrics - when it is considered as
a generalized linear least-squares problem. This approach has subsequently
been efficiently used to compute solutions
of
SURE and simultaneous equa
tions models without having as prerequisite the non-singularity
of
the variance
covariance matrix
of
the disturbances. Chapter 5 presents a parallel algorithm
for solving triangular SURE models. The problem
of
computing estimates
of
parameters in SURE models with variance inequalities and positivity
of
corre
lations constraints is also considered. Finally, chapter 6 presents algorithms for
computing the three-stage least squares estimator
of
simultaneous equations
models (SEMs). Numerical and computational methods for solving SEMs with
separable linear equalities constraints and when the SEM has been modified
by deleting or adding new observations or variables are discussed. Expres
sions revealing linear combinations between the observations which become
redundant are also presented.
These novel computational methods for solving SURE and simultaneous
equations models provide new insights that can be useful to econometric mod
elling. Furthermore, the computational and numerical efficient treatment of
these models, which are regarded as the core of econometric theory, can be
considered as the basis for future research. The algorithms can be extended or
modified to deal with models that occur in particular econometric applications
and have specific characteristics that need to be taken into account.
The practical issues of the parallel algorithms and the theoretical aspects
of
the numerical methods will be
of
interest to a broad range
of
researchers
working in the areas of numerical and computational methods in statistics and
econometrics, parallel numerical algorithms, parallel computing and numeri
cal linear algebra. The aim
of
this monograph is to promote research in the
interface of econometrics, computational statistics, numerical linear algebra
and parallelism.
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
16/195
Preface xvii
The research described in this monograph is based on the work that I have
pursued in the last ten years. During this period I was privileged to have the
opportunity to discuss various issues related to my work with Maurice Clint.
His numerous suggestions and constructive comments have been both inspiring
and invaluable. I am grateful to Dennis Parkinson for his valuable information
that he has provided on many occasions on various aspects related to SIMD
systems, David
A.
Belsley for his constructive comments and advice on the
solution of SURE and simultaneous equations models, Hans-Heinrich Nageli
for his comments and constructive criticism on performance issues of parallel
algorithms and the late Mike R.B. Clarke for his suggestions on Givens se
quences and matrix computations. I am indebted to Paolo Foschi and Manfred
Gilli for their comments on this monograph and to Sharon Silverne for proof
reading the manuscript. The author accepts full responsibility for any errors
that may be found in this work.
Some of the results
of
this monograph were originally published in various
papers [69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 84, 85, 86, 87, 88] and
reproduced by kind permission
of
Elsevier Science Publishers B.Y. 1993,
1994, 1995, 1999; Gordon and Breach Publishers 1993, 1995; John Wiley
& Sons Limited
1996, 1999; IEEE
1993; Kluwer Academic Publishers
1997, 1999; Principia Scientia
1996, 1997; SAP-Slovak Academic Press
Ltd.
1995; and Springer-Verlag
1993, 1996, 1999.
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
17/195
Chapter
1
LINEAR MODELS AND QR DECOMPOSITION
1 INTRODUCTION
A common problem in statistics is that of estimating parameters of some
assumed relationship between one or more variables. One such relationship is
(1.1)
where y is the dependent (endogenous, explained) variable and al,..
,an
are
the independent (exogenous, explanatory) variables. Regression analysis es
timates the form of the relationship (1.1) by using the observed values of the
variables. This attempt at describing how these variables are related to each
other is known as model building.
Exact functional relationships such as (1.1) are inadequate descriptions of
statistical behavior. Thus, the specification of the relationship (1.1) is explained
as
(1.2)
where
is the disturbance term or error, whose specific value in any single
observation cannot be predicted. The purpose of
is to characterize the dis
crepancies that emerge between the actual observed value of
y
and the values
that would be assigned by an exact functional relationship. The difference
between the observed and predicted value of y is called the residual.
2 LINEAR MODEL SPECIFICATION
A linear model
is
one
in
which y, or some transformation of y, can be ex
pressed as a linear function of ai, or some transformation of ai (i = 1, ... ,n).
Here only linear models where endogenous and exogenous variables do not
require any transformations will be considered. In this case, the relationship
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
18/195
2 PARALLEL ALGORITHMS FOR LINEAR MODELS
(1.2) can be written as
(1.3)
where Xi
(i = 1, ...
,
n)
are unknown constants.
If there are m (m > n) sample observations, the linear model (1.3) gives rise
to the following set
of m
equations
YI = al1
x
I +a12
x
2
+ ...
+alnX
n
+1
Y2 = a21XI + a22
X
2
+ ... +a2n
X
n
+
2
or
(
~ )
= ( : : : : ~
: ~ )
( ~ : ) + (::) .
(1.4)
Ym ami
a
m
2 a
mn
Xn
m
In
compact form the latter can be written as
y=Ax+,
(1.5)
where y,
E
SRm, A
E
SRmxn
and x
E
SRn.
To
complete the description of the linear model (1.5), characteristics of the
error term
and the matrix A must be specified. The first assumption is that
the expected value
of
is zero, that is,
E(
) = O. The second assumption is
that the various values
of
are normally distributed. The final assumption is
that
A
is a non-stochastic matrix, which implies
E(A
T
)
=
O.
In summary,
the complete mathematical specification of the (general) linear model which is
being considered is
(1.6)
The notation
,...., N(O,a
2
n) indicates that the error vector is assumed to
come from a normal distribution with mean zero and variance-covariance (or
dispersion) matrix a
2
n, where
n
is a symmetric non-negative definite matrix
and
a
is an unknown scalar [124].
2.1 THE ORDINARY LINEAR MODEL
Consider the Ordinary Linear Model (OLM):
y=Ax+,
,....,N(O,a
2
/
m
). (1.7)
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
19/195
Linear models and QR decomposition
3
The OLM assumptions are that each Ei has the same variance and all distur
bances are pairwise uncorrelated. That is, Var(Ei) = cr
2
and \:Ii =I-
j:
E(ET Ej)
=
o. The first assumption is known as homoscedasticity (homogeneous vari
ances).
The most frequently used estimating technique for the OLM (1.7) is least
squares. Least-squares (LS) estimation involves minimizing the sum of squares
of
residuals: that is, finding an
n
element vector
x
which minimizes
e
T
e = (y-Axf(y-Ax) .
(1.8)
For the minimization of (1.8) e
T
e is differentiated with respect to x which is
treated as a variable vector, and the differentials are equated to zero. Thus,
a(e
T
e)
= _2yT A+2xTAT A
ax
and, setting a(e
T
e) lax =
0,
gives the least-squares normal equations
AT
Ax
= AT y. (1.9)
Assuming that A is
of
full column rank, that is, (AT A)
-I
exists, the least
squares estimator can be computed as
and the variance-covariance matrix of the estimator x s given by
Var(x) = cr
2
(ATA)-I.
(1.10)
(1.11)
The terminology of normal equations is expressed in terms of the following
geometric interpretation of least-squares. The columns of A span a subspace
in SRm which is referred to as a manifold of
A
and it is denoted by M (A). The
dimension
of
M(A) cannot exceed n and can only be equal to n
if A
is
of
full column rank. The vector
Ax
resides in
:M
(A) but the vector y lies outside
M(A) where it is assumed that E
=I-
0 in (1.7). For each different vector x
there is a corresponding vector
of
residuals e, so that y is the sum
of
the two
vectors Ax and e. The length of e needs to be minimized and this is achieved
by making the residual vector e perpendicular to
M(A)
(see Fig. 1.1). This
implies that e = y - Ax must be orthogonal to any linear combination of the
columns
of
A. If Ac is any such linear combination, where c is non-zero,
then the orthogonality condition gives c
T
AT (y-Ax)
=
0 from which the least
squares normal equations (1.9) are derived.
Among econometricians, Maximum Likelihood (ML) is another popular
technique for deriving estimators
of
linear models. The likelihood function
of
the observed sample is the probability density function
of
y, namely,
L(x,cr
2
) = (21tcr
2
)-(m/2)e-(y-
Ax
f(Y-Ax)/2
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
20/195
4 PARALLEL ALGORITHMS FOR LINEAR MODELS
Figure
1.1.
Geometric interpretation of least-squares for the OLM problem.
where
e
denotes the exponential function. Maximizing
L(x,
cr
2 )
is equivalent
to maximizing
InL(x,cr
2
)
where,
2 m m
2 1
T
InL(x, cr )
= -
2
In
(21t) - 2
1n
(
cr ) - 2cr
2
(y - Ax) (y - Ax).
Setting
a
n
Ljax
= nd
a n Ljacr
2
= 0, the ML estimators are obtained as
XML=(ATA)-IATy
(1.12)
and
(1.13)
The ML estimator XML, is identical to the least-squares estimator x which is
the best linear unbiased estimator (BLUE) of x in (1.7). I f
x
is the BLUE of
x,
it follows that
E(x)
=
x
and
\fq
E 9\n,
Var(qTx)
::;
Var(qT
x), where
x
is any
linear unbiased estimator of
x.
However, the ML estimator
0 - ~ 1 L
differs from the unbiased estimator of
cr
2
which is given by
0-
2
=
( y -Ax )T (y -Ax ) j (m-n) .
(1.14)
Numerous methods exist for solving the least-squares problem. Some
of
the
best known methods are Gaussian elimination, Gauss-Jordan elimination,
LU
decomposition, Cholesky factorization, Singular Value Decomposition (SVD)
and QR decomposition (QRD). When the coefficient matrix is large and sparse,
these methods, which are called
direct methods, suffer from fill-in and they
can be impracticable. In such cases, iterative methods are more efficient, even
though there are intelligent adaptations of the direct methods which minimize
the fill-in. Iterative methods (e.g. Conjugate Gradient) have the advantage
that minimal storage space is required for implementation since no fill-in of
the zero positions of the coefficient matrix occurs during computation. Further
more, if a good initial guess is known and preconditioning can be used, the iter
ative methods can converge in an acceptable number of steps. Full details of the
direct and iterative methods are given in textbooks such as [7,40,51,93,98].
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
21/195
Linear models and QR decomposition 5
Here the numerically reliable direct method of
QR
decomposition (QRD) is
used under the assumption that the coefficient matrix is dense. The QRD
of
the explanatory data matrix A is given by
(1.15)
where Q
E SRmxm
is orthogonal, i.e. it satisfies
QT
Q =
QQT
=1
m
,
and R
E SRnxn
is upper triangular. Substituting (1.15) in the normal equations (1.9), gives
RTRx=RTYI,
where
QTy=
(YI)
n .
Y2
m - n
Under the assumption that A is of full column rank, which implies that R is
non-singular, the least-squares estimator
of
the OLM (1.7) is computed by
solving the upper triangular system of equations
RX=YI.
(1.16)
Another approach to deriving
(1.16)
is to use the property of orthogonal
transformation matrices which leave the Euclidean length of a vector invariant.
The
Euclidean length or 2-norm of z
E
SRm is given by
Ilzll
=
(ZT
z) 1/2 and so,
IIQzl1
2
=
ZTQT
Qz =
ZT
Z= Ilz112. Hereafter, the Euclidean norm will be denoted
by II . II. In the context
of
minimizing lie
11
2
, it follows that
x
=
argmin I ell
2
x
= argminllQ
T
el1
2
x
= argminllQ
T
Y_
QTAxll2
x
=
a r g ~ i n l l (Yl
~ R X )
112
=
r g ~ i n
(IIYI
- Rxll2
+
IY2
2
)
= argminllYl -Rx112
x
= R-1YI.
The
quantity
lIy
- AxII2 = IIY211
2
is termed the residual sum of squares (RSS).
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
22/195
6 PARALLEL ALGORITHMS FOR UNEAR MODELS
2.1.1
OLM WITH LINEAR EQUALITY RESTRICTIONS
Many regression models incorporate additional information in the form
of
restrictions (constraints) on the parameters
of
the model. In these cases the
models are called restricted or constrained models.
If
the vector
x
of the OLM
(1.7) is subject to
k
consistent restrictions expressed as
Cx=d,
(1.17)
where C is a
k
x
n
matrix and
d
is of order
k,
then the
restricted least squares
(RLS) solution is an
n
element vector
x*
satisfying
x* =
argmin
Ily - Ax112.
(1.18)
Cx=d
The assumptions for the matrix C are rank(C) = k and k
2(Mesl,Nes2) and u s ~
ing regression analysis, the estimated execution time (seconds x 10
2
)
of
Algo
rithm 1.6 is found to be
TI(M,N)
= N(14.15+3.09N
-0.62N
2
+5.71M +3.67MN).
The
above timing model includes the overheads which arise mainly from the
reference to the submatrix
Ai:,i:
in line 3. This matrix reference results in the
assignment
of
an array section
of A
to a temporary array and then, when the
procedure transform in line 6 has been completed, the reassignment of the tem
porary array to A. The overheads can
be
reduced by referencing a submatrix of
A
only
if
it uses fewer memory layers than a previous extracted submatrix (see
for example the Modified Gram-Schmidt algorithm). This slight modification
improves significantly the execution time of the algorithm which now becomes
T2(M,N) = N(14.99+2.09N -0.20N
2
+3.19M
+
1.
17MN).
The accuracy of the timing models is illustrated in Table 1.1.
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
37/195
Linear models
and
QR decomposition
21
4.3 THE GRAM-SCHMIDT METHOD
As in the case of the Householder algorithm, the performance of the straight
forward implementation
of
the
MGS
method will
be
significantly impaired
by
the overheads. Therefore, the
n = Nes2
steps of the
MGS
method are used in
N
stages.
At
the
ith
stage,
eS2
steps are used to orthogonalize the
(i
-
1 es2
+
1
to
ies2
columns
of
A
and
also to construct the corresponding rows
of
R.
Each
step
of
the
ith
(i =
1,
. . .
,
N)
stage has the same execution time, namely
Thus, the execution
time
to apply all
Nes2
steps of the MGS method is given
by
N
3(Mesl,Nes2) = eS2
L
1>1 (
Mes
l, (N
-
i + l)es2).
i=l
The
data
parallel
MGS
orthogonalization method is given in Algorithm 1.7,
where
A
is overwritten
by Ql ==
Q- the orthogonal basis of
A.
The total exe
cution time
of
Algorithm 1.7 is given
by
T3(M,N)
=N(9.15+3.12N -O.OIN
2
+4.95M
+ 1.31MN).
Algorithm
1.7 The
MGS
method for computing the
QRD
on SIMD systems.
1: defMGS_QRD(A,Mesl,Nes2) =
2:
for i
= 1, . . . ,
Nes2 with steps eS2 do
3: apply orthogonal(A:,i: ,Ri:,i:,Mesl, (N - i+ l)es2)
4: end for
5: enddef
6:
def orthogonal(A,R,m,n) =
7: for
i
=
1, . . . ,eS2 do
8: Ri,i := sqrt(sum(A:,i *A:,i))
9: A:,i
:=A:,i/Ri,i
10: forall(j = i + 1 : n) W,j :=
A:,i
*A:,j
11:
Ri,i+l:
:=
sum(W,i+l:, 1)
12: forall(j = i+
1:
n) A:,j :=A:,j -
Ri;j
*A:,i
13: end for
14:
end
def
It
can
be
seen from Table 1.1 that the (improved) Householder method per
forms better than the
MGS
method. The difference in the performance
of
the
two
methods arises mainly because,
at
the
ith
step, the
MGS
and Householder
methods work with
m x
(n
-
i
+
1) and (m
-
i
+ 1) x
(n
- i+ 1)
matrices, re
spectively. An analysis
of T2(M,N)
and
T3(M,N)
reveals that for
M > N,
the
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
38/195
22 PARALLEL ALGORITHMS FOR
UNEAR
MODELS
MGS algorithm is expected to perform better than the Householder algorithm
only when
N
= 1 and
M
= 2.
4.4 THE GIVENS
ROTATION
METHOD
A Givens rotation, when applied from the left of a matrix, affects only two of
its rows: thus a number
of
them can be applied simultaneously. This particular
feature underpins the development
of
parallel Givens algorithms for solving a
range
of
matrix factorization problems [29, 30, 69, 94, 95, 102, 103, 129].
The orthogonal matrix QT in (1.32a) is the product of a sequence of Com
pound Disjoint Givens Rotations (CDGRs), with each compound rotation re
ducing to zero elements
of A
below the main diagonal while preserving pre
viously annihilated elements. Figure 1.5 shows two sequences of CDGRs for
computing the QRD
of
a 12 x 6 matrix, where a numerical entry denotes an el
ement annihilated by the corresponding CDGR. The first Givens sequence was
developed by Sameh and Kuck [129]. This sequence - the SK sequence - ap
plies a total ofm+n-2 CDGRs to triangularize an
m
x
n
matrix (m >
n),
com
pared to
n(2m - n
-1) Givens rotations needed when the serial Algorithm 1.2
is used. The elements are annihilated by rotating adjacent rows. The second
Givens sequence - the Greedy sequence - applies fewer CDGRs than the SK
sequence but, when it comes to implementation, the advantage
of
the
Greedy
sequence is offset by the communication overheads arising from the construc
tion and application
of
the compound rotations [30,67, 102, 103]. For
m
n,
the Greedy sequence applies approximately log m + (n - 1) log log m CDGRs
11
4
10 12
3 6
9 11 13
2
5 8
8
10 12 14
2 4 7
10
7
911
13 15
2 4 6 9
12
6 8
10 12 14
16
1 3 6
811 14
5 7
911
13 15 1 3 5 7 10
13
4 6 8
10 12 14
1 3 5
7
9
12
3 5 7
911
13
1
2 4
6
8 11
2
4 6
8
10
12
1
2
4
6 8
10
1 3
5 7 9
11
1
2 3
5 7 9
(a) SK sequence (b) Greedy sequence
Figure 1.5.
Examples of Givens rotations schemes for computing the QRD.
The adaptation, implementation and performance evaluation
of
the SK se
quence to compute various forms
of
orthogonal factorizations on SIMD sys
tems will be discussed in the subsequent chapters. On the MP-1208, the ex-
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
39/195
Linear models and QR decomposition 23
ecution time of computing the QRD of an
Mesl
x Nes2 matrix using the SK
sequence, is found to be
T
4
(M,N)
=
N(25.64+5.51N
-7.94N
2
+ 11.1M
+
15.99MN)
+41.96M.
4.5 COMPUTATIONAL RESULTS
The Householder factorization method is found to be the most efficient in
terms of speed, followed by the MGS algorithm which is only slightly slower
than the data parallel Householder algorithm. Use of the SK sequence produces
by far the worst performance.
The comparison of the performances of the data parallel implementations
was made using accurate timing models. These models provide an effective
tool for measuring the computational speed of algorithms and they can also
be used to reveal inefficiencies of parallel implementations [80]. Comparisons
with performance models of various algorithms implemented on other similar
SIMD systems, demonstrate the scalability of the execution time models [67,
75]. If the dimensions of the data matrix A
do
not satisfy the assumption that
they are multiples of the size of the physical array processor, then the timing
models can be used to give a range of the expected execution times of the
algorithms.
Table 1.1.
Times (in seconds) of computing the QRD
of
a 128M x 64N matrix.
Algor.
1.6
Improved Algor.
1.6
Algor.
1.7 Algor.
SK
M N Exec.
Tl(M,N)
Exec.
T2(M,N)
Exec.
T.l(M,N)
Exec.
T4(M,N)
Time
X
10-
2
Time
X
10-
2
Time
x 10-
2
Time
X
10-
2
10
3 5.48 5.55 2.58 2.59
3.21
3.22 21.16 21.04
10
7
22.15 22.34 9.28 9.34 12.07 12.08 67.73 67.59
10 9 33.80
34.09 13.80 13.92 18.49 18.48
92.65 92.62
14 5 17.48 17.54 7.36 7.35 9.30 9.30 62.27 62.36
14 9 47.86 48.03 18.75 18.86 24.52 24.50 150.05 150.11
14 13 90.30 90.56
34.38 34.54 46.64 46.64
242.63 242.68
18
5
22.34 22.35 9.14 9.15 11.60 11.59
82.03 82.25
18
9
61.90 61.98 23.77 23.79 30.68 30.52 207.47
207.61
18
17
189.16 189.03 69.42 69.31 94.41 94.26 503.44 503.51
22
7
48.61 48.71 19.10 18.90 23.98 23.93 175.85
175.96
22 15 188.79 188.49 68.88 68.57 89.86 89.82 585.56 585.64
22 19 287.23
286.33
103.31
102.81
138.59 138.27 805.45 805.71
5
QRD
OF LARGE AND SKINNY MATRICES
The development of SIMD algorithms to compute the QRD when matrices
do not have dimensions which are multiples
of
the physical array processor
size are considered [90]. Implementation aspects of the QRD algorithm from
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
40/195
24
PARALLEL ALGORITHMS FOR LINEAR MODELS
the Cambridge Parallel Processing (CPP) linear algebra library (LALIB) are
investigated [19]. The
LALIB
QRD algorithm is a data-parallel version of
the serial Householder algorithm proposed by Bowgen and Modi [17]. The
performances
of
Algorithm 1.6 and the QRD
LALIB
routine are compared. A
second Householder algorithm which is efficient for skinny matrices is also
proposed.
5.1 THE
CPP
GAMMA SIMn SYSTEM
The Cambridge Parallel Processing (CPP) GAMMA series has a Master
Control Unit (MCU) and 1024 or 4096 Processing Elements (PEs) arranged
in a 2-D square array.
It
has an interconnection network for PE-to-PE com
munication and for broadcast between the MCU and the PEs. The GAMMA
SIMD systems are based on fine grain massively parallel computer systems
known as the AMT DAP (Distributed Array of Processors) [116, 118].
A macro assembler called APAL (Array of Processors Assembly Language)
is available to support low-level programming the GAMMA-I. Two high level
language systems are also available for the GAMMA-I. These are extended
versions of Fortran (called Fortran-Plus enhanced or for short, F-PLUS) and
C++. These languages interact with the language that the user selects to run
on the host machine, typically Fortran or C
[1,
2]
Both high level languages
allow the programmer to assume the availability of a virtual processor array of
arbitrary size. As in the MasPar, using the default cyclic distribution, an m x n
matrix is mapped on the PEs using r
m/
es 1 r
n/
es 1 layers of memory, while
an m-element vector is mapped on the PEs using rm/es
2
1layers of memory,
where es x es (es
=
32 or es
=
64) is the dimension of the SIMD array pro
cessor. An m x n matrix can also be considered as an array of n m-element
column vectors (parallelism in the first dimension) or
m
n-element row vec
tors (parallelism in the second dimension), requiring respectively,
nr
m/es
2
1
and
mr
n/es
2
1layers
of
memory to map the matrices onto the PEs [25].
In most non-trivial cases, the complexity of performing a computation on
an array is not reduced if some of the PEs are disabled, since the disabled
PEs will become idle only during the assignment process. In such cases the
programmer is responsible for avoiding computations on unaffected submatri
ces. To illustrate this, let h = (hI, . . . ,h
m
) and u = (UI,
...
,un) be real vectors,
L ==
(11,-- -- -.
,In) a logical vector and A an m x n real matrix. The F-PLUS state
ment
u(L) =
sumr(matc(h,n) *A)
(1.45)
is equivalent to the HPF statement
forall(i
=
1 :
n,li =
true)
Ui =
sum(h*A:,i)
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
41/195
Linear models and
QR
decomposition 25
which computes the inner-product Ui
=
hT
A,i
for all i, where
Ii
has value true.
In Fortran-90 the F-PLUS functions sumr(A),
matc(h,n)
and
matr(u,m)
can
be expressed
as
sum(A, 1), spread(h,2,n) and spread(u, I,m), respectively.
The main difference, however, between F-PLUS and HPF, is that the F-PLUS
statement computes
all
the inner-products hT A and then assigns simultane
ously the results to the elements of
u,
where the corresponding elements of L
have value true. This difference may cause degradation of the performance
with respect to execution speed, if the logical vector
L
has a significant num
ber of false values. Consider, for example, the three cases, where (i) all ele
ments of L have a true value, (ii) the first n/2 elements of L have a true value
and (iii) only the first element of L has a true value. For
m
= 1000 and
n =
500, the execution time in msec for computing (l.45)
on
the 1024-processor
GAMMA-I (hereafter abbreviated to GAMMA-I) for all three cases is 249.7,
while, without masking the time required to compute all inner-products
is
given by 247.79. Explicitly performing operations only on the affected ele
ments of
u,
the execution times (including overheads) in cases (ii) and (iii) are
found to be 147.84 and 13.21, respectively. This example shows the degrada
tion
in
performance that might occur when implementing an algorithm without
taking into consideration the systems software of the particular parallel com
puter.
5.2 THE HOUSEHOLDER QRD ALGORITHM
The CPP LAUB implementation of the QRD Householder algorithm could
be considered a straightforward one. Initially the algorithm was implemented
on the AMT DAP using an earlier version of F-PLUS which required the data
matrix to be partitioned into submatrices having the same dimensions as the
array processor [17]. The re-implementation of the algorithm using the new
F-PLUS has removed this constraint. Algorithm 1.8 shows broadly how the
Householder method has been implemented in this library routine for comput
ing the QRD. The information needed for generating the orthogonal matrix
Q
is stored in the annihilated parts of A and in two n-element vectors. For sim
plicity Algorithm 1.8 ignores this, neither does it emphasize other details of
the LAUB QRD subroutine
QR_FACTOR,
such as those dealing with overflow,
that do not play an important role
in
the performance of the algorithm [19].
Clearly the performance of Algorithm 1.8 is dominated by the computa
tions in the 10th and
11
th lines, while computations on logical arrays and
scalars are less significant. The computation
of
the Euclidean norm of the
m-element
vector
h
in line 5
is
a function of r
m/es
2
l
and is therefore im
portant only for large matrices, where m es
2
Notice that the first i-I
elements of h are zero and the corresponding rows and columns of A remain
unchanged. Thus the computations in lines 5, 10 and 11 can be written as
follows:
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
42/195
26 PARALLEL ALGORITHMS FOR LINEAR MODELS
Ui:n
:=
sumr(mate(hi:m,n - i
+
1)
*Ai:m,i:n)/Pi
Ai:m,i:n
:
= Ai:m,i:n
-
mate (
h
i
:
m
, n - i
+
1)
*matr(Ui:n,
m - i
+
1)
Algorithm 1.8
The
CPP
LA LIB method for computing the
QR
Decomposition.
1: L:=
true; M:= true
2:
for
i = 1,2, ... ,
n do
3:
h:=O
4:
h(L)
:=
A:,i
5: cr:=
sqrt(sum(h*h))
6: if hi ni) is the exogenous full column rank matrix in the
ith regression equation, Qi is an
m
x
m
orthogonal matrix and Ri is an upper
triangular matrix of order ni [78]. The fast simultaneous computation of the
QRDs (1.47) is considered.
6 1 EQUAL SIZE MATRICES
Consider, initially, the case where the matrices AI, ... ,AG have the same
dimension, that is, n}
=
...
= nG =
n. The equal-size matrices suggests that a
3-D array could be employed. The
m
x n data matrices A
I , . . .
,AG and the upper
triangular factors
RI, . . . ,RG
can be arranged in an
m
x
n
x G array
A
and the
n x n x G array R, respectively. Using a 2-D mapping, computations performed
on scalars, I-D and 2-D arrays correspond to computations on I-D, 2-D and
3-D arrays when a 3-D mapping is used. Thus, in theory, the advantage over a
2-D mapping is that a 3-D arrangement will increase the level of parallelism.
The algorithms have been implemented on the 8192-processor MasPar MP-
1208, using the high level language MasPar-Fortran. On the MasPar, the 3-D
arrangement
of
the equal-size matrices is mapped on the 2-D array
of
PEs plus
memory, with computations over the third dimension being performed serially.
This indicates that under a 3-D arrangement the increase in parallelism will not
be as large as is theoretically expected.
The indexing expressions
of
2-D matrices and the replication and reduc
tion functions can be used in a 3-D framework. That is, the function
spread
which replicates an array
by
adding a dimension and the function
sum
which
adds all
of
the elements of an array along a specified direction can be used.
For example,
if B
and C are
m
x
n
and
m
x
n
x G arrays respectively, then
C:=
spread(B,3,G) implies that forall
k, C:,:,k
=
B:,:
and B:=
sum(C,3)
is
equivalent to B(i,j) = Lf=1 Ci,j,k whereas sum(C) has a scalar value equal to
the sum of all of the elements of C.
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
46/195
30 PARALLEL ALGORITHMS FOR LINEAR MODELS
The
first
method
for
computing
the
QR
factorization
of
a matrix employs a
sequence of
Householder
reflections
H =
1 -
hhT
jb, where
b = h
T
hj2. The
application of H to the
data
matrix
Ai
involves the
vector-matrix
computation
ZT
=
hTAdb and a
rank-one update
Ai - hz
T
. Both
of
these operations
can be
efficiently
computed
on an SIMD array processor using the replication func
tion spread and the reduction function sum. The SIMD implementation of
the Householder algorithm for computing the
QRDs
(1.47) simultaneously is
illustrated in Algorithm 1.10. A total of n Compound Householder Transfor
mations
(CHTs) are applied. The ith
CHT
produces the ith rows of RI,
. . . ,RG
without effecting the first i-I
columns
and rows
of A I, . . . ,AG. The
simultane
ous data parallel
vector-matrix
computations and
rank-one
updates are shown
respectively in lines
12-14
of
Algorithm 1.10.
Algorithm
1.10
The
Householder
algorithm.
1:
defHouseh_QRD(A,m,n,G) =
2:
for i
=
1, . . . ,
n
do
3: apply transform(Ai:,i:,: ,m - i
+ 1,
n - i
+
1, G)
4: end for
5:
end def
6:
def
transform(A, m, n,
G)
=
7: H := A,I,:
8: S:= sqrt(sum(H
*H, 1))
9:
where (HI,:
n)
is the exogenous data matrix,
y
E
Sltm
is the response
vector and E
E
Sltm is the noise vector with zero mean and dispersion matrix
(J2/m.
The least squares estimator
of
the parameter vector
x
E Sltn
argminET E
=
argmin IIAx-
y112,
(2.2)
x x
has an infinite number
of
solutions when
A
does not have full rank. However,
a unique minimum 2-norm estimator of
x
say,
X,
can be computed.
Let the rank
of
A
be given by
k (k
:::;
n).
The solution
of
(2.2) is computed
in two stages. In the first stage the coefficient matrix
A
is reduced to a lower
trapezoidal form; in the second stage the lower trapezoid is triangularized. The
orthogonal decompositions of the first and second stages are given, respec
tively, by:
k
o)m-k
Lz
k
(2.3)
and
n-k
k
(LI L2)P= (0
L)'
(2.4)
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
56/195
40 PARALLEL ALGORITHMS FOR LINEAR MODELS
where Q E 9t
mxm
and P E 9t
nxn
are orthogonal, Land L2 are lower-triangular
and non-singular, and II E 9t
nxn
is a permutation matrix. That is,
n - k
k
Q T A n p ~ ( ~
o ) m- k
L k
(2.5)
The orthogonal decompositions (2.3) and (2.5) are called QL decomposition
(QLD) and
complete
QLD , respectively.
The minimum 2-norm best linear unbiased estimator of
x
is given by
A PL-1QT
x=
2
2Y,
where
and
Numerous methods have been proposed for computing the orthogonal factor
izations (2.3) and (2.4), on both serial computers and MIMD parallel systems
[16,
18,51,93].
Algorithms are designed, implemented and analyzed for computing the com
plete QLD on the CPP DAP 510 massively parallel SIMD computer (abbrevi
ated to DAP) [70]. The algorithms employ Householder reflections and Givens
plane rotations. Algorithms are also proposed for reconstructing the orthogo
nal matrices involved in the decompositions when the data which define the
orthogonal transformations are stored in the annihilated parts of the coefficient
matrix A. The implementation and execution time models of all algorithms
on the DAP are considered in detail. All of the algorithms were implemented
on the 1024-processor DAP using double precision arithmetic. The timing
models are expressed in msec.
2 THE
QLD OF
THE COEFFICIENT MATRIX
The computation
of
the QLD (2.3) using Householder reflections with col
umn pivoting is considered. This method is also used when
A
is of full column
rank but ill-conditioned. Let the elementary permutation matrix
I ~ i , J l )
denote
the identity
n x n
matrix
In
with columns
n - i+
1 and Jl interchanged and let
h(i)h(i)T
Qf = I m - hi
(2.6)
denote an
m
x
m
Householder matrix which annihilates the first
m - i
elements
ofA:,n-i+l
(pivot column), when it is multiplied by
A
on the right. The matrices
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
57/195
OLM not offull rank 41
Q and
n
in (2.3) are defined by
k
QT
=
TIQLi+l
=
QrQLI ... Qf
i=1
and
k
n
= TI (i,lIi) = /(I,JJJ) /(2,/-12) . . . /(k,lIk).
i=1
To describe briefly the process of computing the
QLD
(2.3) let, at the ith (0 :$
i:$ k) step,
A (i) = QT. Qf / ~ I ' I I J )
. . .
~ i l I j )
n - i
o
)m-i
Vii
i '
where L ~ i is non-singular and lower-triangular with its diagonal elements
in increasing order of magnitude. The value
of Jli+
I is the index
of
the col-
umn
of
LW with maximum Euclidean norm. The criterion used to decide that
()
.
II
(k) 112 k). (k)
rank A
=
k IS A I'm-k " < t, where A I'm-k " IS the JlH I column of Lit
, 'rk+l
'
,rk+l
and 't is an absolute tolerance parameter whose value depends on the scaling
of A
[51,63,93]. The value
of't
is assumed to be given.
The permutation matrix
n
can be stored and computed using one
of
the two
n element integer vectors ~ and where
A permutation
~ i ' l I i )
is equivalent to swapping first the elements ~ n - i + 1 and
~ l I i
of the ~ vector and then swapping the elements n - i + 1 and Jli of the ~ vector
where, initially,
= = i (i =
1,
...
,n).
2.1 SIMD IMPLEMENTATION
The QLD (2.3) has been computed on the DAP under the assumption that
n
=
Nes, m
=
Mes
and 1
n and To(n,n)
=
2n-3-
(3.22) is found empirically to be minimized for Pi having values closer to
n.
For simplicity let A. be an integer such that k = An and fIog2 'A.1 = flog2 ('A. +
1)
1.
This implies that (3.22) is minimized
if\fi:
Pi
=
nand
TI
(n, k,
'A.,
p)
is simplified
to
T2(n,'A.) = 2n - 3 +nflog2 ('A. + 1)l.
Figure 3.4 illustrates the computation of (3.6a) using the
bitonic
algorithm,
where n = 6,
k
= 18 and
Pi
=6 (i = 1,2,3) . The bold frames show the partition
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
87/195
72 PARALLEL ALGORITHMS FOR LINEAR MODELS
(DT RT{ =
(br
Dr Dr
RTV.
Initially the QRDs of D
,
D2 and D3
are computed simultaneously and, at stages i = 1,2, the updating is completed
by computing (3.21), where
g
=
2.
Compute (3.20) Compute (3.21) Compute (3.21)
for i
=
1 for i
=
2
5
4 6
3 5
7
2 4
6 8
113 5 7
19
16 17
18
19
2U
21
5
16
17
18
19
20
4 6
16
17
18
19
3 5
7
16 17 18
2 4 6 8
16 17
1 3 5 7
19
16
lU
11
12
13
14
15
5
10
11
12
1314
4 6
10
11
12
13
3 5
7
10 11 12
2 4 6 8
10
11
1
3 5 7 9
10
10
11
12
13
14 15
1U
11
12
1314
10
11 12
13
10 11 12
10 11
10
Figure 3.4. The bitonic algorithm, where n
= 6,
k
= 18
and
PI = P2 = P3 = 6.
The number of CDGRs applied to update the QRD using the UGSs
is
given
by
Ignoring additive constants, it follows that
This indicates the efficiency
of
the
bitonic
algorithm for computing (3.6a) for
A> 2, compared with that when using the UGSs.
The second parallel strategy for solving the updating problem is a slight
modification of the Greedy annihilation scheme in [30, 103]. Taking as be
fore
n
= 6 and k = 18, Fig. 3.5 indicates the order in which the elements are
annihilated. Observing that the elements in the diagonal of Rare annihilated
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
88/195
Updating and downdating the OLM 73
by successive rotations, it follows that at most k + n - 1 CDGRs are required
to compute (3.6a). An approximation to the number of CDGRs required to
compute (3.6a) when n is fixed and
k
approaches to infinity, is given by
T4(n,k)
= log2k+
(n-l)log210g2k.
The derivation
of
this approximation has been given in the context of comput
ing the QRD and, it is also found empirically to be valid for computing (3.6a)
[103]. In general, for k n, the
Greedy
sequence requires fewer CDGRs than
the bitonic method, while for small k (compared with n) the UGSs and Greedy
sequence require the same number of CDGRs. Table 3.5 shows the number
of CDGRs required to compute (3.6a) using the UGSs, bitonic method and
Greedy
sequence for some
n
and
A
(k
=
An
and
k
n) .
5
4 7
3 6
9
3 6
811
2 5
8
10
13
2 5 7
10
12
15
2 4 7 9
12 14
2 4 6 9 11
14
2 4 6 8
11
13
1 3 6 8
10 13
1 3 5 7
10 12
1 3
5 7 912
1 3
5 7
911
1 3
4 6 811
1 2
4 6
8
10
1 2 4 6
810
1 2 4 5 7 9
1 2 3 5 7 9
2 3 5 7 9
3 4 6 8
4 6 8
5 7
6
Figure 3.5. The Greedy sequence for computing (3.6a), where n = 6 and k = 18.
As regards implementation, the efficiency
of
the
Greedy
method is expected
to be reduced significantly by the organizational overheads so that the
bitonic
method is to be preferred [30, 103]. The simultaneous computations performed
at each stage
of
the
bitonic
method make it suitable for distributing memory
architectures [36]. Each processing unit will perform the same matrix com
putations without requiring any inter-processor communications. The simul
taneous QRD
of
the matrices
VI,
...
D28_Ion
a SIMD system has been con-
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
89/195
74
PARALLEL ALGORITHMS FOR LINEAR MODELS
Table 3.5.
Number of CDGRs required to compute the factorization (3.6a).
n A k = M UGSs bitonic Greedy
15
5 75 89 72
43
15
10 150 164
87 47
15
20 300 314 102
50
15
40
600 614 117
54
30 5
150
179
147
89
30 10 300 329 177
96
30
20
600 629 207 102
30
40 1200 1229 237
107
60
5 300 359 297
187
60 10 600 659 357 198
60
20
1200
1259
417
208
60
40
2400 2459 477
217
sidered within the context of the SURE model estimation [84]. In this case
the performance
of
the Householder algorithm was found to be superior to
that of the Givens algorithm (see Chapter 1). The simultaneous factorizations
(3.21) have been implemented on the MasPar within a
3-D
framework, us
ing Givens rotations and Householder reflections. The Householder algorithm
applies the reflections H(I,j),
. . .
,H(n,j), where H(l,j) annihilates the non-zero
elements of the
lth
column of
; ~ ; ( ~ _ i )
using the
lth
row of
RY-I) as
a pivot
row
(I
= 1,
... ,n).
Table 3.6.
Times (in seconds) for computing the orthogonal factorization (3.6a).
n
g bitonic Householder
Householder bitonic Givens UGS-J
64
2 0.84
0.23 1.29 1.45
64
3 1.55 0.35
2.34
2.46
64
5 4.83 0.82 7.97
9.04*
192 2 4.15 1.78 8.98
9.77
192 3 7.78 2.98 18.21 19.45
192 5
27.12 10.27
69.96
76.96*
320 2
10.15
5.51
27.96 32.41
320 3
19.69
10.05 58.78
67.51*
320 5
72.12 37.48 236.13
278.20*
* Estimated times.
Table 3.6 shows the execution times for the various algorithms for com
puting (3.6a) on the 8I92-processor MasPar using single precision arithmetic.
Clearly the bitonic algorithm based on Householder transformations performs
better than the bitonic algorithm based on CDGRs. However, the straightfor-
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
90/195
Updating
and
downdating the aLM
75
ward data-parallel implementation of the Householder algorithm is found to be
the fastest of all. The degradation in the performance of the bitonic algorithm
is due mainly to the large number of simultaneous matrix computations which
are performed serially in the
2-D
array MasPar processor [84]. The bitonic
algorithm based on CDGRs performs better than the direct implementation
of
UGS-l because of the initial triangularization of the submatrices DI,' .. ,D2g-1
using Householder transformations.
2.3 UPDATING WITH A MATRIX HAVING A BLOCK
LOWER-TRIANGULAR STRUCTURE
Computational and numerical methods for deriving the estimators of struc
tural equations models require the updating
of
a lower-triangular matrix with
a matrix having a block lower-triangular structure. Within this context the
updating problem can be expressed 'as the computation
of
the orthogonal fac
torization
-T
(A(I)) (0) E el
p ;\(1) = t ( G - l ) K - E+ e G
(3.23)
where
K
- e l
K -e2
K -eG-I
e2
-(I)
A
2
,1
0
0
-(I) -(I)
0
A(1) =
e3
A31
A
,
3,2
-(I) -(I)
-(I)
eG
AG,I
A
G
2
A
GG
-
1
K-e l K-e2
K
-eG-I
K-e l
A I)
LI
0 0
K-e2
A(
I)
A(I)
0
A(1)
=
A21
L2
K -eG-I
A(I)
A(I)
A(I)
AG_1,1
A
G
-
12
L
G
_
1
A
A(I) .
.
G
L
and
Li (I = 1,
... G -
1)
are lower tnangular and
E = Li= 1 ei
The factorization (3.23) can be computed in G - 1 stages, where each stage
annihilates a block-subdiagonal with the first stage annihilating the main
block-
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
91/195
76 PARALLEL ALGORITHMS FOR LINEAR MODELS
diagonal.
At
the ith (i
=
1,
...
, G - 1) stage the orthogonal factorizations
are computed simultaneously for j
=
1, ... , G - i, where the
t
Y+
I)
matrix is
lower triangular and Pi,) is a
(K
- ej +ei+j) x
(K
- ej +ei+j) orthogonal ma
trix. It follows that the triangular matrix
t
in (3.23) is given by
t ~ G ) 0 0
A(G-I) t (G-I)
0
t
= 2,1 2
Therefore, if TDI (e,
K,
i, j) denotes the number of CDGRs required to compute
the factorization (3.24) using this method (hereafter called
diagonally-based
method), then the total number of CDGRs needed to compute (3.23) is given
by
G-I
TD(e,K,G)
=
L max ( TDI (e,K,i, j)),
1=1
j=
1,
...
,G- i , (3.25)
where e =
(el, ...
,eG). Figure 3.6 shows the annihilation process for comput
ing the factorizations (3.23), where G = 5 and i i i denotes a submatrix elimi
nated at stage i (i = 1, .. . , G - 1).
Stage 1 Stage 2 Stage 3 Stage 4
4
"
\.
,
Figure
3.6. Computing the factorization
(3
.
23)
using the
diagonally-based
method, where
G=5.
Figure 3.7 illustrates various annihilation schemes for computing the fac
torization (3.24) by showing only the zeroed matrix A ~ 2 j, j and the
lower
triangular tY) matrix, where ei+j = 12, K - ej = 4. The annihilation schemes
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
92/195
Updating
and
downdating the
OLM
77
are equivalent to those of block-updating the QRD the only difference be
ing that an upper-triangular matrix is replaced by a lower-triangular matrix
[69, 75, 76, 81]. These annihilation schemes can be employed to annihilate
different submatrices
of
A(
1
),
that is, at step
i
(i
=
1,
...
, G -
1)
of
the factoriza-
. (3 23) th b . A-(i) A-(i) b
e d t h
. th
h o n .
e su matrices
i+ l , I ' G,G-i
can e zero
WI
out usmg e
same annihilation scheme. Assuming that only UGS-2 or UGS-3 schemes are
employed to annihilate each submatrix, then the number of CDGRs given by
(3.25) is
TDI (e,K,i,j) = K - ej +ei+j
-1 .
Hence, the total total number
of
CDGRs applied to compute the factorization
(3.23) is given by
G- l
T ~ 2 v ( e , K , G ) = ~ max(K-ej+ei+j-l)
1=1
G- l
=(G- l ) (K- l )+
L,max(ei+j-ej),
j=I , ...
,G-i .
(3.26)
i=1
4 3 2 1 15
14
13 12
6 5 3 1 4 3 2 1 4 3 2 1
5 4 3 2 14
13
1211
7 6
4 2
5 3 2
1
4 3 2
1
6 5 4 3
13
12 11
~
8 7 6 3 5 4 2 1 4 3 2 1
7 6
5 4
1"
11 111
9 9 8 7 6 6 4 3 1 4 3 2 1
8 7 6 5
11
1()
9 8 6 5 3
1
6 5
3 1 4 3 2 1
9 8 7 6
1 ~
9
8 7 7 6 4 2 7 5 3 1 4 3 2 1
H 9 8 7 9 8 7 6
8 7 6 3 7 5 4 2 4 3 2 1
U
1()
9 8
8 7 6 5 9 8 7 6 8 6
4 2
4 3 2 1
112
1
9 7 6 5 4 5 3
1
8 6
4 2
4 3 2
1
I l ~ I I
6 5 4 3
1 ~
4 2
9 7
5 3 4 3 2 1
114
1 1 5 4 3 2
11
11(1
3
9 7 5 3 4 3
2 1
15
l ,n.
4 3
2 1
1 ~ 1
III
1()
10
8 6 4 4 3 2 1
UGS-2 UGS-3
Bitonic
Greedy
HOUSEHOLDER
Figure
3.7. Parallel strategies for computing the factorization (3.24)
The factorization (3.23) is illustrated in Fig. 3.8 without showing the lower
triangular matrix J(1), where each submatrix
of
A:(I) is annihilated using only
the UGS-2 or Greedy schemes, K
=
10,
G =
4 and e
=
(2,3,6,8). This partic
ular example shows that both the schemes require the application
of
the same
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
93/195
78 PARALLEL ALGORITHMS FOR LINEAR MODELS
number
of
CDGRs to compute the factorization. However, for problems where
the number of rows far exceeds the number of columns in each submatrix, the
Greedy method will require fewer steps than the other schemes.
8 7 6 S 4 3 2 1
9 8
7
6 S 4 3 2
10
9 8
7
6 5
4
3
20 19
18
17
16
15
14 13 7 6 5 4 3 2 1
21
20 19 18 17 16 15 14
8 7 6 5 4 3 2
22 21 20 19 18 17 16 15
9 8 7 6 5 4 3
23 22 21 20 19 18
17
16 10 9 8
7
6 5
4
24
23
22 21 20 19 18
17
11
10
9 8
7 6 5
25
24
23
22
21 20
19 18
12
11
10
9 8 7 6
34
33 32 31 30
29 28
27
19 18
17
16
IS 14 13
4
3
2
1
35
34
33
32
31 30
29
28
O
19 18
17
16 15
14
5
4
3
2
36
35
34
33
32 31 30 29 21 20 19 18 17 16 15 6 5 4 3
37
36
35
34
33 32 31
30
2
21 20 19 18 17 16 7 6 5 4
38
37
36 35
34 33 32 31
23
22 21 20 19 18 17
8 7 6 5
39 38 37
36
35 34
33 32
24
23
22 21 20 19 18
9 8
7
6
40 39 38 37 36 35
34
33
25
24
23 22 21 20 19 10 9 8
7
41 40 39
38 37 36 35
34 26
2S
24
23
22
21
20
11
10 9
8
U ing
only the UGS2
scheme
8 7 6 5 4 3 2 1
9 8 7 6
5 4 3
1
10 9 8 7
6 5 4 2
20 19 18 17 16 15 14 13
7
6 5 4 3 2 1
21 20 19 18 17 16
14 13
8
7
6
5 4 2
1
22 21
20 19 18 16
15
13 9 8 7 6 4 3 1
23
22 21 20
19
17
15 14 10
9 8
7
5 3 2
24
23 22 21
20
18
16
14 11
10
9 8 6
4 2
25
24
23
22
21
19
17
15
12
11
10 9 7 5 3
34
33
32
31
30
29
28
27 19 18
17
16 15 14 13 4 3 2 1
35 34 33 32
31 30 28
27
20 19 18
17
16
14 13 S 4 2
1
36
35
34 33
32
30
29
27 21 20 19 18 17 15 13
6 4 3 1
37
36
35
34
32 31 29
27
22 21 20 19
17
16 13
6
5 3
1
38
37
36
35
33
31 30
1
8
23
22 21 19 18 16 14
7
5 4 2
39
38
37 36
34
32
30
28
24
23
22
20
18 17 14
8 6
4 2
40
39 38
37
35
33
31 29
25
24
23
21
19 17
15
9 7 5 3
41
40
39
38
36
34
32 30 26
25
24 22
20
18 16 10
8 6 4
Using
only
the
Greedy
cherne
Figure 3.B.
Computing factorization (3.23).
The intrinsically independent annihilation of the submatrices in a block
subdiagonal ofA
1)
makes this factorization strategy well suited for distributed
memory systems since it does not involve any inter-processor communication.
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
94/195
Updating and downdating the OLM 79
However, the diagonally-based method has the drawback that the computa
tional complexity at stage i (i = 1,
. . .
, G - 1) is dominated by the maximum
b
fCDGR
. ed ihl th b .
A-(i) A-(i)
num
er 0
s requlf to ann
1
ate e su matnces
Hl,l ' '
HG-i,G-i.
An alternative approach (called column-based method) which removes this
drawback is to start annihilating simultaneously the submatrices AI, ...
,AG-l '
where
j = I , ...
,G-1.
(3.27)
Consider the case
of
using the UGS-2 scheme. Initially UGS-2 is applied to
annihilate the matrix
..1(1)
under the assumption that it is dense. As a result the
steps within the zero submatrices are eliminated and the remaining steps are
adjusted so that the sequence starts from step 1. Figure 3.9 shows the derivation
of this sequence using the same problem dimensions as in Fig. 3.8. Generally,
for PI = 1, Pj = Pj- l +2ej - K ( l < j < G) and 11 =
min(Pl,
. . .
,PG-d,
the
annihilation
of
the submatrix
Ai
starts at step
si=Pi- I l+1, i = I ,
... ,G-1.
The number
of
CDGRs needed to compute the factorization (3.23) is given by
T ~ ~ v ( e , K , G ' I l ) = E + K
-2e l - l l .
(3.28)
Comparison o f T ~ ~ v ( e , K , G ) and T ~ ~ v ( e , K , G ' I l ) shows that, when the UGSs
are used, the diagonally-based method never performs better than the
column
based method. Both methods need the same number of steps in the exceptional
case where G
=
2.
The
column-based
method employing the
Greedy
scheme is illustrated in
Fig. 3.10. The first sequence is the result
of
directly applying the Greedy
-(i) -(i)
scheme on the Ai+l,l' ... ,Ai+G-i,G-i submatrices. Let the columns of each
submatrix be numbered from right to left, that is, in reverse order.
The
number
of elements annihilated by the qth (q > 0)
CDGR
in the jth (j = 0, ... ,K - ei)
column of the ith submatrix Ai is given by
where a(i,q) is defined as
J
o
ei+1
ry,q) = l(ay,q) +1)/2J,
(i,q-l) + (i,q-l) _ (i,q-l) + (i-l,q-l)
a
j
r
j
_
l
r j rK-ei_1
(i,q-l) + (i,q-l) (i,q-l)
a
j
r
j
_
l
-
r j
if j > q and j >
k
- ei,
if q = j = 1,
if j = 1 and q > 1,
otherwise.
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
95/195
80 PARALLEL ALGORITHMS FOR LINEAR MODELS
19
18
17
16
IS
14
13
1"
11 10
9 8
7 6 5
4 3 2
1
20
19
18
17
16 15
14 13
12
11
10
9 8
7
6
5 4 3
2
21 20 19
18 17 16
15 14 13 12
11
10
9 8
7
6
5
4
3
22
21
20 19 18 17 16
IS
14
13
12
11
10 9 8 7 6 5
4
23
22
21
20
19 18
17
16 15
14
13
12 11
10
9 8
7
6
5
24
23
22
21
20
19 18 17
16
15
14 13 12 11
Ie
9 8
7
6
25 24
23 22
21 20 19
18 17 16
15
14
13 12
11
10
9 8 7
26
25 24
23 22
21
20
19 18
17
16 15
14
13
12 11
10
9 8
27 26 25
24
23 22
21 20
19
18
17
16
IS
14
13
12
11
10
9
28
27
26
25
24
23
22 21 20
19
18 17 16
15 14
13 12 11 10
29
28 27 26 25
24 23
2..
21
20
19 18
17
16 15 14 13 12
11
30 29 28 27 26 25
24 23
22 21 20
19 18 17 16
15
14
13
12
31 30
29
28 27 26
25
24
23
22 21
20
19
18
17 16
15
14
13
32
31
30
29
28
27
26 25
24
23
22
21
20 19
18 17
16
IS
14
33 32 31 30
29
28
27
26 25 24 23
22
21
20
19 18
17
16
15
34 33 32 31 30 29 28
27
26
25
24
23
22
21 20
19
18
17
16
35
34 33 32
31
30 29 28
27
26
25
24
23
22
21 20
19
18
17
UGS-2
12 11
10 9 8 7
6 5
13
12 11
10
9 8 7 6
14
13
12
11
10
9 8 7
15 14
13
12
11
10
9
8 7 6 5 4 3 2 1
16
15
14
13
12
11
10
9 8
7
6
5 4
3
2
17
16
15
14 13
12
11
10
9 8 7 6 5 4 3
18 17 16 15
14
13 12
11 10
9 8
7
6
5
4
19
18
17
16 IS
14
13
12
11 10
9 8
7
6
5
20 19
18
17
16
15
14
13
12 11
10
9 8
7
6
21
20
19 18
17 16 15 14
13
12
11
10 9 8 7 6 5
4
3
22
21
20
19 18 17 16
IS
14
13
12 11
10
9 8 7 6 5 4
23
22
21 20 19
18
17
16
15
14 13 12 11
10
9 8 7 6 5
24 23 22
21
20
19 18
17 16
15
14
13
12 11 10
9 8
7
6
25
24
23
22
21
20
19 18 17
16
IS
14
13
12
11
10 9 8 7
26
25 24
23 22
21
20
19
18 17 16 IS 14
13
1"
11 10 9 8
27
26
25
24
23
22
21 20 19 18
17
16 15
14
13
12
I I
10
9
28
27
26
25
24
23
22 21
~ O
19 18
17 16
15
14
13
12
11 10
Modified UGS-2
Figure
3.9. The
column-based
method using the
UGS-2
scheme.
The sequence terminates at step
q
if
Vi, j:
ry,q)
=
O.
The second sequence in Fig. 3.10, called
Modified Greedy,
is generated from
the application of the
Greedy
algorithm in [69] by employing the same tech
nique for deriving the column-based sequence using the UGS-2 scheme. No
tice however that the second
Greedy sequence does not correspond to and is not
as efficient as the former sequence which applies the Greedy method directly
-
7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode
96/195
Updating
and
downdating the
OLM 81
8 7
6 5 4 3 2
1
9 8
7 6 5 4 3
1
10
9 8 7 6 5 4
2
IS
14
13
12
11
10
9 8 7 6 5 4 3 2 1
16 15
14 13
12
11
10 9 8
7
6
5 4 2
1
17 16 IS 14 13 12 11 10 9 8
7
6
4
3
1
18 17 16 IS 14 13
12