(advances in computational economics 15) erricos john kontoghiorghes (auth.)-parallel algorithms for...

7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

1/195

PARALLEL

ALGORITHMS

FOR

LINEAR

MODELS


2/195

Advances in Computational Economics

VOLUME

15

SERIES EDITORS

Hans Amman, University ofAmsterdam, Amsterdam, The Netherlands

Anna Nagurney,

University ofMassachusetts at Amherst, USA

EDITORIAL BOARD

Anantha

K.

Duraiappah,

European University Institute

John Geweke,

University ofMinnesota

Manfred Gilli,

University ofGeneva

Kenneth

L.

Judd, Stanford University

David Kendrick,

University ofTexas at Austin

Daniel McFadden,

University ofCalifornia at Berkeley

Ellen McGrattan,

Duke University

Reinhard Neck,

University ofKlagenfurt

Adrian R. Pagan, Australian National University

John Rust, University ofWisconsin

Berc Rustem,

University ofLondon

Hal

R.

Varian,

University ofMichigan

The titles published in this series are listed at the end

of

his volume.


3/195

Parallel Algorithms for

Linear Models

Numerical Methods and

Estimation Problems

by

Erricos John Kontoghiorghes

Universite de Neuchtel Switzerland

pringer Science+Business Media, LLC


4/195

Library

of

Congress Cataloging-in-Publication ata

Kontoghiorghes, Erricos John.

Parallel algorithms for linear models : numerical methods and estimation problems / by

Erricos John Kontoghiorghes.

p cm -- (Advances in computational economics; v 15)

lncludes bibliographical references and indexes.

ISBN 978-1-4613-7064-2 ISBN 978-1-4615-4571-2 (eBook)

DOI 10.1007/978-1-4615-4571-2

1 Linear models (Statistics)--Data processing. 2. Parallel algorithms. 1 Title. II

Series.

QA276 .K645 2000

519.5 35--dc21

99-056040

Copyright

2000 by Springer Science+Business Media

New York

Originally published by Kluwer Academic Publishers, New

York

in 1992

Softcover reprint

ofthe

hardcover lst edition 1992

AII rights reserved. No part

of

this publication may be reproduced, stored in a retrieval

system or transmitted in any form or by any means, mechanical, photo-copying, recording,

or

otherwise, without the prior written permission

of

the publisher, Springer Science

Business Media, LLC

Printed

on

acid-free p per.


5/195

To Laurence and Louisa


6/195

Contents

List

of

Figures ix

List

of

Tables xi

List

of

Algorithms xiii

Preface xv

1. LINEAR MODELS AND QR DECOMPOSmON 1

1 Introduction 1

2 Linear model specification 1

2.1

The ordinary linear model 2

2.2 The general linear model 7

3 Forming the QR decomposition 10

3.1 The Householder method 11

3.2 The Givens rotation method 13

3.3 The Gram-Schmidt orthogonalization method 16

4 Data parallel algorithms for computing the QR decomposition 17

4.1

Data:

parallelism and the MasPar SIMD system

17

4.2 The Householder method

19

4.3 The Gram-Schmidt method

21

4.4 The Givens rotation method 22

4.5 Computational results 23

5 QRD of large and skinny matrices 23

5.1 The CPP GAMMA SIMD system 24

5.2 The Householder QRD algorithm 25

5.3 QRD of skinny matrices 27

6 QRD of a set of matrices 29

6.1

Equal size matrices 29

6.2 Mattices with different number

of

columns 34

2. OLM Nor OF FULL RANK 39

1 Introduction 39

2 The QLD

of

the coefficient matrix 40

2.1 SIMD implementation 41

3 Triangularizing the lower trapezoid 43


7/195

viii PARALLEL

ALGORITHMS

FOR

UNEAR MODELS

4

5

3.1

The Householder method

3.2 The Givens method

Computing the orthogonal matrices

Discussion

43

46

49

54

3. UPDATING AND DOWNDATING THE

OLM

57

1 Introduction 57

2 Adding observations 58

2.1

The hybrid Householder algorithm 60

2.2 The Bitonic and Greedy Givens sequences 67

2.3 Updating with a block lower-triangular matrix 75

2.4 QRD of structured banded matrices 82

2.5 Recursive and linearly constrained least-squares 87

3 Adding exogenous variables 90

4 Deleting observations 92

4.1 Parallel strategies

94

5 Deleting exogenous variables 99

4. THE GENERAL LINEAR MODEL

105

1 Introduction 105

2 Parallel algorithms

108

3 Implementation and performance analysis

111

5. SUREMODELS 117

1 Introduction 117

2 The generalized linear least squares method 121

3 Triangular SURE models 123

3.1 Implementation aspects 127

4 Covariance restrictions 129

4;1 The QLD of the block bi-diagonal matrix 133

4.2 Parallel strategies

138

4.3 Common exogenous variables 140

6. SIMULTANEOUS EQUATIONS

MODELS 147

1 Generalized linear least squares 149

1.1

Estimating the disturbance covariance matrix 151

1.2 Redundancies 152

1.3

Inconsistencies 153

2 Modifying the SEM 154

3 Linear Equality Constraints 157

3.1 Basis

of

the null space and direct elimination methods

158

4 Computational Strategies 160

References

163

Author Index 177

Subject Index 1

79


8/195

List

of

Figures

1.1

Geometric interpretation of least-squares for the OLM

problem. 4

1.2

Illustration of Algorithm 1.2, where

m

= 4 and

n

= 3.

15

1.3

The column and diagonally based Givens sequences for

computing the QRD. 15

1.4

Cyclic mapping

of

a matrix and a vector on the MasPar

MP-1208. 18

1.5

Examples of Givens rotations schemes for computing

theQRD.

22

1.6

Execution time ratio between

2-D

and 3-D algorithms

for computing the QRDs, where G = 16.

34

1.7

Stages of computing the QRDs (1.47). 36

2.1

Annihilation pattern

of

(2.4) using Householder reflections. 44

2.2

Givens sequences for computing the orthogonal factor-

ization (2.4). 47

2.3 Illustration of the implementation phases ofPGS, where

es=4.

49

2.4

Thefill-in

of the submatrix PI:n,I:n at each phase

of

Al-

gorithm 2.4.

53

3.1 Updating Givens sequences for computing the orthog-

onal factorizations (3.6), where

k

= 8 and

n

= 4.

59

3.2 Ratio of the execution times produced by the models of

the

cyclic-layout

and

column-layout

implementations. 63

3.3

Computing (3.21) using Givens rotations.

71

3.4 The

bitonic

algorithm, where

n

= 6,

k

=

18

and

PI =

P2 = P3 = 6.

72

3.5 The

Greedy

sequence for computing (3.6a), where

n =

6andk= 18.

73


9/195

x

PARAILELALGORITHMS FOR liNEAR MODELS

3.6 Computing the factorization (3.23) using the diagonally-

based method, where G = 5.

76

3.7 Parallel strategies for computing the factorization (3.24)

77

3.8 Computing factorization (3.23). 78

3.9 The column-based method using the UGS-2 scheme.

80

3.10 The

column-based

method using the

Greedy

scheme.

81

3.11

Illustration

of

the annihilation patterns of method-I.

83

3.12

Computing (3.31) for b = 8,

l'}*

= 3 and

j

= 2. Only

the affected matrices are shown.

85

3.13 Illustration of method-3, where p = 4 and g =

1.

86

3.14 Givens parallel strategies for downdating the QRD.

96

3.15 Illustration of the SK-based scheme for computing the

QRDofRS.

102

3.16

Greedy-based schemes for computing the QRD

of

RS.

104

4.1

Sequential Givens sequences for computing the QLD (4.3a).

107

4.2 The SK sequence.

109

4.3

G(16)B

with e{16, 18,8) = 8.

109

4.4 The application of the SK sequence to compute (4.3)

on a 2-D SIMD computer.

109

4.5

Examples

of

the

MSK(p)

sequence for computing the QLD.

110

5.1

The correlations

Pj,j

in the SURE--CC model for

l'}j

=

i

and l'}j =

Iii.

131

5.2 Factorization process for computing the QLD (5.35)

using Algorithm 5.3.

136

5.3 Annihilation sequences of computing the factorization (5.40). 137

5.4 Givens sequences of computing the factorization (5.45).

138

5.5 Number of CDGRs for computing the orthogonal fac-

torization (5.40) using the PDS.

139

5.6 Annihilation sequence

of

triangularizing (5.55). 144

6.1

Givens sequence for computing the QRD of RSj.

161


10/195

List

of

Tables

1.1 Times (in seconds) of computing the QRD of a

128M

x

64N matrix.

23

1.2 Execution times (in seconds) of

the CPP

LALIB QR_FACTOR

subroutine and the BPHA.

27

1.3 Execution times (in seconds)

ofthe

CPP LA LIB QR-FACTOR

subroutine and Algorithm 1.9.

28

1.4 Times (in seconds)

of

simultaneously computing the

QRDs (1.47).

33

1.5

The

task-farming

and

scattering

methods for comput-

ing the QRDs (1.47).

38

2.1 Computing the QLD (2.3) (in seconds), where m = Mes

and n = Nes.

44

2.2 The CDGRs

of

the PGS for computing the factorization (2.4).

47

2.3 Computing (2.4) (in seconds), where k = Kes and

n-

k

=

Tles.

50

2.4 Times (in seconds) of reconstructing the orthogonal ma-

trices

QT

and

P

on the DAP.

54

3.1 Execution times (msec) of the Householder and Givens

methods for updating the QRD on the DAP.

60

3.2 Execution times (in seconds) for

k =

11264.

63

3.3 Execution times (in seconds) of the RLS Householder

algorithm on the MasPar.

65

3.4 Execution times (in seconds)

of

the RLS Householder

algorithm on the GAMMA.

66

3.5

Number

of

CDGRs required to compute the factoriza-

tion (3.6a).

74

3.6

Times (in seconds) for computing the orthogonal fac-

torization (3.6a).

74


11/195

xu

PARALLEL ALGORITHMS FOR LINEAR MODELS

3.7

Computing the QRD

of

a structured banded matrix us-

ing method-3.

87

3.8

Estimated time (msec) required to compute x(i) (i =

2,3,

...

),

where

mi

=

96, n

=

32N and k

=

32K.

91

3.9

Execution time (in seconds) for downdating the OLM.

98

4.1

Execution times (in seconds) of the MSK(Aes/2). 114

4.2

Computing (4.3) (in seconds) without explicitly con-

structing

QT

and P.

115

5.1

Computing (5.24), where

T

-

k

- 1

= -res, G - 1 =

Jles

and es = 32.

128

5.2

Execution times of Algorithm 5.2 for solving

Rr = A. 129


12/195

List

of

Algorithms

1.1

Computing the QRD of

A

E 9\mxn using Householder trans-

formations.

12

1.2 The column-based Givens sequence for computing the

QRD of

A

E

S)\mxn.

14

1.3

The diagonally-based Givens sequence for computing the

QRD of

A

E

S)\mxn.

16

1.4 The Classical Gram-Schmidt method for computing the

QRD of

A

E

S)\mxn.

16

1.5 The Modified Gram-Schmidt method for computing the

QRD.

17

1.6 QR factorization by Householder transformations on SIMD

systems.

20

1.7 The MGS method for computing the QRD on SIMD systems.

21

1.8 The CPP

LALIB

method for computing the QR Decomposition.

26

1.9 Householder with parallelism in the first dimension. 28

1.10 The Householder algorithm.

30

1.11 The Modified Gram-Schmidt algorithm.

31

1.12 The task-farming approach for computing the QRDs (1.47)

on p (p

G)

processors using a SPMD paradigm.

37

2.1

The QL decomposition of

A.

43

2.2 Triangularizing the lower trapezoid using Householder re-

flections.

45

2.3

The reconstruction

of

the orthogonal matrix Q in (2.3).

51

2.4 The reconstruction of the orthogonal matrix P in (2.4).

53

3.1

The data-parallel Householder algorithm.

61

3.2

The bitonic algorithm for updating the QRD, where

R

== R i ~ I . 70

3.3 The computation of (3.63) using Householder transformations.

97

5.1 An iterative algorithm for solving tSURE models.

126


13/195

XIV PARALLEL ALGORITHMS FOR LINEAR MODELS

5.2 The parallel solution of the triangular system

Rr

= L\.

129

5.3 Computing the QLD (5.35). 135


14/195

Preface

The monograph provides a complete and detailed account of the design,

analysis and implementation of parallel algorithms for solving large-scale lin

ear models.

It

investigates and presents efficient, numerically stable algorithms

for computing the least-squares estimators and other quantities of interest on

massively parallel systems.

The least-squares computations are based on orthogonal transformations,

in particular the QR and QL decompositions. Parallel algorithms employ

ing Givens rotations and Householder transformations have been designed for

various linear model estimation problems. Some of the algorithms presented

are parallel versions

of

serial methods while others are original designs. The

implementation of the major parallel algorithms is described. The necessary

techniques and insights needed for implementing efficient parallel algorithms

on multiprocessor systems are illustrated in detail. Although most

of

the al

gorithms have been implemented on SIMD systems the data parallel compu

tations

of

these algorithms should, in general, be applicable to any massively

parallel computer.

The monograph is in two parts. The first part consists

of

four chapters

and deals with the computational aspects for solving linear models that have

applicability in diverse areas. The remaining two chapters form the second

part which concentrates on numerical and computational methods for solv

ing various problems associated with seemingly unrelated regression equations

(SURE) and simultaneous equations models.

Chapter 1 provides a brief introduction

to

linear models and considers var

ious forms for solving the QR decomposition on serial and parallel systems.

Emphasis is given to the design and efficient implementation of the parallel

algorithms. The second chapter investigates the performance and practical is

sues for solving the ordinary linear model (OLM), with the exogenous matrix

being ill-conditioned or having deficient rank, on a SIMD system.


15/195

xvi PARALLEL ALGORITHMS FOR LINEAR MODELS

Chapter 3 is devoted to methods for

up-

and down-dating the OLM. It pro

vides the necessary computational tools and techniques that are often required

in econometrics and optimization. The efficient parallel strategies for modi

fying the OLM can be used as primitives for designing fast econometric al

gorithms. For example, the Givens and Householder algorithms used to com

pute the QR decomposition after rows have been added or columns have been

deleted from the original matrix have been efficiently employed to the solution

of

the SURE and simultaneous equations models. The updating methods are

also employed to solve the recursive ordinary linear model with linear equal

ity constraints. The numerical methods based on the basis of the null space

and direct elimination methods are in turn adopted for the solution of linearly

constrained simultaneous equations models.

The fourth chapter investigates parallel algorithms for solving the general

linear model - the parent model of econometrics - when it is considered as

a generalized linear least-squares problem. This approach has subsequently

been efficiently used to compute solutions

of

SURE and simultaneous equa

tions models without having as prerequisite the non-singularity

of

the variance

covariance matrix

of

the disturbances. Chapter 5 presents a parallel algorithm

for solving triangular SURE models. The problem

of

computing estimates

of

parameters in SURE models with variance inequalities and positivity

of

corre

lations constraints is also considered. Finally, chapter 6 presents algorithms for

computing the three-stage least squares estimator

of

simultaneous equations

models (SEMs). Numerical and computational methods for solving SEMs with

separable linear equalities constraints and when the SEM has been modified

by deleting or adding new observations or variables are discussed. Expres

sions revealing linear combinations between the observations which become

redundant are also presented.

These novel computational methods for solving SURE and simultaneous

equations models provide new insights that can be useful to econometric mod

elling. Furthermore, the computational and numerical efficient treatment of

these models, which are regarded as the core of econometric theory, can be

considered as the basis for future research. The algorithms can be extended or

modified to deal with models that occur in particular econometric applications

and have specific characteristics that need to be taken into account.

The practical issues of the parallel algorithms and the theoretical aspects

of

the numerical methods will be

of

interest to a broad range

of

researchers

working in the areas of numerical and computational methods in statistics and

econometrics, parallel numerical algorithms, parallel computing and numeri

cal linear algebra. The aim

of

this monograph is to promote research in the

interface of econometrics, computational statistics, numerical linear algebra

and parallelism.


16/195

Preface xvii

The research described in this monograph is based on the work that I have

pursued in the last ten years. During this period I was privileged to have the

opportunity to discuss various issues related to my work with Maurice Clint.

His numerous suggestions and constructive comments have been both inspiring

and invaluable. I am grateful to Dennis Parkinson for his valuable information

that he has provided on many occasions on various aspects related to SIMD

systems, David

A.

Belsley for his constructive comments and advice on the

solution of SURE and simultaneous equations models, Hans-Heinrich Nageli

for his comments and constructive criticism on performance issues of parallel

algorithms and the late Mike R.B. Clarke for his suggestions on Givens se

quences and matrix computations. I am indebted to Paolo Foschi and Manfred

Gilli for their comments on this monograph and to Sharon Silverne for proof

reading the manuscript. The author accepts full responsibility for any errors

that may be found in this work.

Some of the results

of

this monograph were originally published in various

papers [69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 84, 85, 86, 87, 88] and

reproduced by kind permission

of

Elsevier Science Publishers B.Y. 1993,

1994, 1995, 1999; Gordon and Breach Publishers 1993, 1995; John Wiley

& Sons Limited

1996, 1999; IEEE

1993; Kluwer Academic Publishers

1997, 1999; Principia Scientia

1996, 1997; SAP-Slovak Academic Press

Ltd.

1995; and Springer-Verlag

1993, 1996, 1999.


17/195

Chapter

1

LINEAR MODELS AND QR DECOMPOSITION

1 INTRODUCTION

A common problem in statistics is that of estimating parameters of some

assumed relationship between one or more variables. One such relationship is

(1.1)

where y is the dependent (endogenous, explained) variable and al,..

,an

are

the independent (exogenous, explanatory) variables. Regression analysis es

timates the form of the relationship (1.1) by using the observed values of the

variables. This attempt at describing how these variables are related to each

other is known as model building.

Exact functional relationships such as (1.1) are inadequate descriptions of

statistical behavior. Thus, the specification of the relationship (1.1) is explained

as

(1.2)

where

is the disturbance term or error, whose specific value in any single

observation cannot be predicted. The purpose of

is to characterize the dis

crepancies that emerge between the actual observed value of

y

and the values

that would be assigned by an exact functional relationship. The difference

between the observed and predicted value of y is called the residual.

2 LINEAR MODEL SPECIFICATION

A linear model

is

one

in

which y, or some transformation of y, can be ex

pressed as a linear function of ai, or some transformation of ai (i = 1, ... ,n).

Here only linear models where endogenous and exogenous variables do not

require any transformations will be considered. In this case, the relationship


18/195

2 PARALLEL ALGORITHMS FOR LINEAR MODELS

(1.2) can be written as

(1.3)

where Xi

(i = 1, ...

,

n)

are unknown constants.

If there are m (m > n) sample observations, the linear model (1.3) gives rise

to the following set

of m

equations

YI = al1

x

I +a12

x

2

+ ...

+alnX

n

+1

Y2 = a21XI + a22

X

2

+ ... +a2n

X

n

+

2

or

(

~ )

= ( : : : : ~

: ~ )

( ~ : ) + (::) .

(1.4)

Ym ami

a

m

2 a

mn

Xn

m

In

compact form the latter can be written as

y=Ax+,

(1.5)

where y,

E

SRm, A

E

SRmxn

and x

E

SRn.

To

complete the description of the linear model (1.5), characteristics of the

error term

and the matrix A must be specified. The first assumption is that

the expected value

of

is zero, that is,

E(

) = O. The second assumption is

that the various values

of

are normally distributed. The final assumption is

that

A

is a non-stochastic matrix, which implies

E(A

T

)

=

O.

In summary,

the complete mathematical specification of the (general) linear model which is

being considered is

(1.6)

The notation

,...., N(O,a

2

n) indicates that the error vector is assumed to

come from a normal distribution with mean zero and variance-covariance (or

dispersion) matrix a

2

n, where

n

is a symmetric non-negative definite matrix

and

a

is an unknown scalar [124].

2.1 THE ORDINARY LINEAR MODEL

Consider the Ordinary Linear Model (OLM):

y=Ax+,

,....,N(O,a

2

/

m

). (1.7)


19/195

Linear models and QR decomposition

3

The OLM assumptions are that each Ei has the same variance and all distur

bances are pairwise uncorrelated. That is, Var(Ei) = cr

2

and \:Ii =I-

j:

E(ET Ej)

=

o. The first assumption is known as homoscedasticity (homogeneous vari

ances).

The most frequently used estimating technique for the OLM (1.7) is least

squares. Least-squares (LS) estimation involves minimizing the sum of squares

of

residuals: that is, finding an

n

element vector

x

which minimizes

e

T

e = (y-Axf(y-Ax) .

(1.8)

For the minimization of (1.8) e

T

e is differentiated with respect to x which is

treated as a variable vector, and the differentials are equated to zero. Thus,

a(e

T

e)

= _2yT A+2xTAT A

ax

and, setting a(e

T

e) lax =

0,

gives the least-squares normal equations

AT

Ax

= AT y. (1.9)

Assuming that A is

of

full column rank, that is, (AT A)

-I

exists, the least

squares estimator can be computed as

and the variance-covariance matrix of the estimator x s given by

Var(x) = cr

2

(ATA)-I.

(1.10)

(1.11)

The terminology of normal equations is expressed in terms of the following

geometric interpretation of least-squares. The columns of A span a subspace

in SRm which is referred to as a manifold of

A

and it is denoted by M (A). The

dimension

of

M(A) cannot exceed n and can only be equal to n

if A

is

of

full column rank. The vector

Ax

resides in

:M

(A) but the vector y lies outside

M(A) where it is assumed that E

=I-

0 in (1.7). For each different vector x

there is a corresponding vector

of

residuals e, so that y is the sum

of

the two

vectors Ax and e. The length of e needs to be minimized and this is achieved

by making the residual vector e perpendicular to

M(A)

(see Fig. 1.1). This

implies that e = y - Ax must be orthogonal to any linear combination of the

columns

of

A. If Ac is any such linear combination, where c is non-zero,

then the orthogonality condition gives c

T

AT (y-Ax)

=

0 from which the least

squares normal equations (1.9) are derived.

Among econometricians, Maximum Likelihood (ML) is another popular

technique for deriving estimators

of

linear models. The likelihood function

of

the observed sample is the probability density function

of

y, namely,

L(x,cr

2

) = (21tcr

2

)-(m/2)e-(y-

Ax

f(Y-Ax)/2


20/195


Figure

1.1.

Geometric interpretation of least-squares for the OLM problem.

where

e

denotes the exponential function. Maximizing

L(x,

cr

2 )

is equivalent

to maximizing

InL(x,cr

2

)

where,

2 m m

2 1

T

InL(x, cr )

= -

2

In

(21t) - 2

1n

(

cr ) - 2cr

2

(y - Ax) (y - Ax).

Setting

a

n

Ljax

= nd

a n Ljacr

2

= 0, the ML estimators are obtained as

XML=(ATA)-IATy

(1.12)

and

(1.13)

The ML estimator XML, is identical to the least-squares estimator x which is

the best linear unbiased estimator (BLUE) of x in (1.7). I f

x

is the BLUE of

x,

it follows that

E(x)

=

x

and

\fq

E 9\n,

Var(qTx)

::;

Var(qT

x), where

x

is any

linear unbiased estimator of

x.

However, the ML estimator

0 - ~ 1 L

differs from the unbiased estimator of

cr

2

which is given by

0-

2

=

( y -Ax )T (y -Ax ) j (m-n) .

(1.14)

Numerous methods exist for solving the least-squares problem. Some

of

the

best known methods are Gaussian elimination, Gauss-Jordan elimination,

LU

decomposition, Cholesky factorization, Singular Value Decomposition (SVD)

and QR decomposition (QRD). When the coefficient matrix is large and sparse,

these methods, which are called

direct methods, suffer from fill-in and they

can be impracticable. In such cases, iterative methods are more efficient, even

though there are intelligent adaptations of the direct methods which minimize

the fill-in. Iterative methods (e.g. Conjugate Gradient) have the advantage

that minimal storage space is required for implementation since no fill-in of

the zero positions of the coefficient matrix occurs during computation. Further

more, if a good initial guess is known and preconditioning can be used, the iter

ative methods can converge in an acceptable number of steps. Full details of the

direct and iterative methods are given in textbooks such as [7,40,51,93,98].


21/195

Linear models and QR decomposition 5

Here the numerically reliable direct method of

QR

decomposition (QRD) is

used under the assumption that the coefficient matrix is dense. The QRD

of

the explanatory data matrix A is given by

(1.15)

where Q

E SRmxm

is orthogonal, i.e. it satisfies

QT

Q =

QQT

=1

m

,

and R

E SRnxn

is upper triangular. Substituting (1.15) in the normal equations (1.9), gives

RTRx=RTYI,

where

QTy=

(YI)

n .

Y2

m - n

Under the assumption that A is of full column rank, which implies that R is

non-singular, the least-squares estimator

of

the OLM (1.7) is computed by

solving the upper triangular system of equations

RX=YI.

(1.16)

Another approach to deriving

(1.16)

is to use the property of orthogonal

transformation matrices which leave the Euclidean length of a vector invariant.

The

Euclidean length or 2-norm of z

E

SRm is given by

Ilzll

=

(ZT

z) 1/2 and so,

IIQzl1

2

=

ZTQT

Qz =

ZT

Z= Ilz112. Hereafter, the Euclidean norm will be denoted

by II . II. In the context

of

minimizing lie

11

2

, it follows that

x

=

argmin I ell

2

x

= argminllQ

T

el1

2

x

= argminllQ

T

Y_

QTAxll2

x

=

a r g ~ i n l l (Yl

~ R X )

112

=

r g ~ i n

(IIYI

- Rxll2

+

IY2

2

)

= argminllYl -Rx112

x

= R-1YI.

The

quantity

lIy

- AxII2 = IIY211

2

is termed the residual sum of squares (RSS).


22/195

6 PARALLEL ALGORITHMS FOR UNEAR MODELS

2.1.1

OLM WITH LINEAR EQUALITY RESTRICTIONS

Many regression models incorporate additional information in the form

of

restrictions (constraints) on the parameters

of

the model. In these cases the

models are called restricted or constrained models.

If

the vector

x

of the OLM

(1.7) is subject to

k

consistent restrictions expressed as

Cx=d,

(1.17)

where C is a

k

x

n

matrix and

d

is of order

k,

then the

restricted least squares

(RLS) solution is an

n

element vector

x*

satisfying

x* =

argmin

Ily - Ax112.

(1.18)

Cx=d

The assumptions for the matrix C are rank(C) = k and k

2(Mesl,Nes2) and u s ~

ing regression analysis, the estimated execution time (seconds x 10

2

)

of

Algo

rithm 1.6 is found to be

TI(M,N)

= N(14.15+3.09N

-0.62N

2

+5.71M +3.67MN).

The

above timing model includes the overheads which arise mainly from the

reference to the submatrix

Ai:,i:

in line 3. This matrix reference results in the

assignment

of

an array section

of A

to a temporary array and then, when the

procedure transform in line 6 has been completed, the reassignment of the tem

porary array to A. The overheads can

be

reduced by referencing a submatrix of

A

only

if

it uses fewer memory layers than a previous extracted submatrix (see

for example the Modified Gram-Schmidt algorithm). This slight modification

improves significantly the execution time of the algorithm which now becomes

T2(M,N) = N(14.99+2.09N -0.20N

2

+3.19M

+

1.

17MN).

The accuracy of the timing models is illustrated in Table 1.1.


37/195

Linear models

and

QR decomposition

21

4.3 THE GRAM-SCHMIDT METHOD

As in the case of the Householder algorithm, the performance of the straight

forward implementation

of

the

MGS

method will

be

significantly impaired

by

the overheads. Therefore, the

n = Nes2

steps of the

MGS

method are used in

N

stages.

At

the

ith

stage,

eS2

steps are used to orthogonalize the

(i

-

1 es2

+

1

to

ies2

columns

of

A

and

also to construct the corresponding rows

of

R.

Each

step

of

the

ith

(i =

1,

. . .

,

N)

stage has the same execution time, namely

Thus, the execution

time

to apply all

Nes2

steps of the MGS method is given

by

N

3(Mesl,Nes2) = eS2

L

1>1 (

Mes

l, (N

-

i + l)es2).

i=l

The

data

parallel

MGS

orthogonalization method is given in Algorithm 1.7,

where

A

is overwritten

by Ql ==

Q- the orthogonal basis of

A.

The total exe

cution time

of

Algorithm 1.7 is given

by

T3(M,N)

=N(9.15+3.12N -O.OIN

2

+4.95M

+ 1.31MN).

Algorithm

1.7 The

MGS

method for computing the

QRD

on SIMD systems.

1: defMGS_QRD(A,Mesl,Nes2) =

2:

for i

= 1, . . . ,

Nes2 with steps eS2 do

3: apply orthogonal(A:,i: ,Ri:,i:,Mesl, (N - i+ l)es2)

4: end for

5: enddef

6:

def orthogonal(A,R,m,n) =

7: for

i

=

1, . . . ,eS2 do

8: Ri,i := sqrt(sum(A:,i *A:,i))

9: A:,i

:=A:,i/Ri,i

10: forall(j = i + 1 : n) W,j :=

A:,i

*A:,j

11:

Ri,i+l:

:=

sum(W,i+l:, 1)

12: forall(j = i+

1:

n) A:,j :=A:,j -

Ri;j

*A:,i

13: end for

14:

end

def

It

can

be

seen from Table 1.1 that the (improved) Householder method per

forms better than the

MGS

method. The difference in the performance

of

the

two

methods arises mainly because,

at

the

ith

step, the

MGS

and Householder

methods work with

m x

(n

-

i

+

1) and (m

-

i

+ 1) x

(n

- i+ 1)

matrices, re

spectively. An analysis

of T2(M,N)

and

T3(M,N)

reveals that for

M > N,

the


38/195

22 PARALLEL ALGORITHMS FOR

UNEAR

MODELS

MGS algorithm is expected to perform better than the Householder algorithm

only when

N

= 1 and

M

= 2.

4.4 THE GIVENS

ROTATION

METHOD

A Givens rotation, when applied from the left of a matrix, affects only two of

its rows: thus a number

of

them can be applied simultaneously. This particular

feature underpins the development

of

parallel Givens algorithms for solving a

range

of

matrix factorization problems [29, 30, 69, 94, 95, 102, 103, 129].

The orthogonal matrix QT in (1.32a) is the product of a sequence of Com

pound Disjoint Givens Rotations (CDGRs), with each compound rotation re

ducing to zero elements

of A

below the main diagonal while preserving pre

viously annihilated elements. Figure 1.5 shows two sequences of CDGRs for

computing the QRD

of

a 12 x 6 matrix, where a numerical entry denotes an el

ement annihilated by the corresponding CDGR. The first Givens sequence was

developed by Sameh and Kuck [129]. This sequence - the SK sequence - ap

plies a total ofm+n-2 CDGRs to triangularize an

m

x

n

matrix (m >

n),

com

pared to

n(2m - n

-1) Givens rotations needed when the serial Algorithm 1.2

is used. The elements are annihilated by rotating adjacent rows. The second

Givens sequence - the Greedy sequence - applies fewer CDGRs than the SK

sequence but, when it comes to implementation, the advantage

of

the

Greedy

sequence is offset by the communication overheads arising from the construc

tion and application

of

the compound rotations [30,67, 102, 103]. For

m

n,

the Greedy sequence applies approximately log m + (n - 1) log log m CDGRs

11

4

10 12

3 6

9 11 13

2

5 8

8

10 12 14

2 4 7

10

7

911

13 15

2 4 6 9

12

6 8

10 12 14

16

1 3 6

811 14

5 7

911

13 15 1 3 5 7 10

13

4 6 8

10 12 14

1 3 5

7

9

12

3 5 7

911

13

1

2 4

6

8 11

2

4 6

8

10

12

1

2

4

6 8

10

1 3

5 7 9

11

1

2 3

5 7 9

(a) SK sequence (b) Greedy sequence

Figure 1.5.

Examples of Givens rotations schemes for computing the QRD.

The adaptation, implementation and performance evaluation

of

the SK se

quence to compute various forms

of

orthogonal factorizations on SIMD sys

tems will be discussed in the subsequent chapters. On the MP-1208, the ex-


39/195

Linear models and QR decomposition 23

ecution time of computing the QRD of an

Mesl

x Nes2 matrix using the SK

sequence, is found to be

T

4

(M,N)

=

N(25.64+5.51N

-7.94N

2

+ 11.1M

+

15.99MN)

+41.96M.

4.5 COMPUTATIONAL RESULTS

The Householder factorization method is found to be the most efficient in

terms of speed, followed by the MGS algorithm which is only slightly slower

than the data parallel Householder algorithm. Use of the SK sequence produces

by far the worst performance.

The comparison of the performances of the data parallel implementations

was made using accurate timing models. These models provide an effective

tool for measuring the computational speed of algorithms and they can also

be used to reveal inefficiencies of parallel implementations [80]. Comparisons

with performance models of various algorithms implemented on other similar

SIMD systems, demonstrate the scalability of the execution time models [67,

75]. If the dimensions of the data matrix A

do

not satisfy the assumption that

they are multiples of the size of the physical array processor, then the timing

models can be used to give a range of the expected execution times of the

algorithms.

Table 1.1.

Times (in seconds) of computing the QRD

of

a 128M x 64N matrix.

Algor.

1.6

Improved Algor.

1.6

Algor.

1.7 Algor.

SK

M N Exec.

Tl(M,N)

Exec.

T2(M,N)

Exec.

T.l(M,N)

Exec.

T4(M,N)

Time

X

10-

2

Time

X

10-

2

Time

x 10-

2

Time

X

10-

2

10

3 5.48 5.55 2.58 2.59

3.21

3.22 21.16 21.04

10

7

22.15 22.34 9.28 9.34 12.07 12.08 67.73 67.59

10 9 33.80

34.09 13.80 13.92 18.49 18.48

92.65 92.62

14 5 17.48 17.54 7.36 7.35 9.30 9.30 62.27 62.36

14 9 47.86 48.03 18.75 18.86 24.52 24.50 150.05 150.11

14 13 90.30 90.56

34.38 34.54 46.64 46.64

242.63 242.68

18

5

22.34 22.35 9.14 9.15 11.60 11.59

82.03 82.25

18

9

61.90 61.98 23.77 23.79 30.68 30.52 207.47

207.61

18

17

189.16 189.03 69.42 69.31 94.41 94.26 503.44 503.51

22

7

48.61 48.71 19.10 18.90 23.98 23.93 175.85

175.96

22 15 188.79 188.49 68.88 68.57 89.86 89.82 585.56 585.64

22 19 287.23

286.33

103.31

102.81

138.59 138.27 805.45 805.71

5

QRD

OF LARGE AND SKINNY MATRICES

The development of SIMD algorithms to compute the QRD when matrices

do not have dimensions which are multiples

of

the physical array processor

size are considered [90]. Implementation aspects of the QRD algorithm from


40/195

24


the Cambridge Parallel Processing (CPP) linear algebra library (LALIB) are

investigated [19]. The

LALIB

QRD algorithm is a data-parallel version of

the serial Householder algorithm proposed by Bowgen and Modi [17]. The

performances

of

Algorithm 1.6 and the QRD

LALIB

routine are compared. A

second Householder algorithm which is efficient for skinny matrices is also

proposed.

5.1 THE

CPP

GAMMA SIMn SYSTEM

The Cambridge Parallel Processing (CPP) GAMMA series has a Master

Control Unit (MCU) and 1024 or 4096 Processing Elements (PEs) arranged

in a 2-D square array.

It

has an interconnection network for PE-to-PE com

munication and for broadcast between the MCU and the PEs. The GAMMA

SIMD systems are based on fine grain massively parallel computer systems

known as the AMT DAP (Distributed Array of Processors) [116, 118].

A macro assembler called APAL (Array of Processors Assembly Language)

is available to support low-level programming the GAMMA-I. Two high level

language systems are also available for the GAMMA-I. These are extended

versions of Fortran (called Fortran-Plus enhanced or for short, F-PLUS) and

C++. These languages interact with the language that the user selects to run

on the host machine, typically Fortran or C

[1,

2]

Both high level languages

allow the programmer to assume the availability of a virtual processor array of

arbitrary size. As in the MasPar, using the default cyclic distribution, an m x n

matrix is mapped on the PEs using r

m/

es 1 r

n/

es 1 layers of memory, while

an m-element vector is mapped on the PEs using rm/es

2

1layers of memory,

where es x es (es

=

32 or es

=

64) is the dimension of the SIMD array pro

cessor. An m x n matrix can also be considered as an array of n m-element

column vectors (parallelism in the first dimension) or

m

n-element row vec

tors (parallelism in the second dimension), requiring respectively,

nr

m/es

2

1

and

mr

n/es

2

1layers

of

memory to map the matrices onto the PEs [25].

In most non-trivial cases, the complexity of performing a computation on

an array is not reduced if some of the PEs are disabled, since the disabled

PEs will become idle only during the assignment process. In such cases the

programmer is responsible for avoiding computations on unaffected submatri

ces. To illustrate this, let h = (hI, . . . ,h

m

) and u = (UI,

...

,un) be real vectors,

L ==

(11,-- -- -.

,In) a logical vector and A an m x n real matrix. The F-PLUS state

ment

u(L) =

sumr(matc(h,n) *A)

(1.45)

is equivalent to the HPF statement

forall(i

=

1 :

n,li =

true)

Ui =

sum(h*A:,i)


41/195

Linear models and

QR

decomposition 25

which computes the inner-product Ui

=

hT

A,i

for all i, where

Ii

has value true.

In Fortran-90 the F-PLUS functions sumr(A),

matc(h,n)

and

matr(u,m)

can

be expressed

as

sum(A, 1), spread(h,2,n) and spread(u, I,m), respectively.

The main difference, however, between F-PLUS and HPF, is that the F-PLUS

statement computes

all

the inner-products hT A and then assigns simultane

ously the results to the elements of

u,

where the corresponding elements of L

have value true. This difference may cause degradation of the performance

with respect to execution speed, if the logical vector

L

has a significant num

ber of false values. Consider, for example, the three cases, where (i) all ele

ments of L have a true value, (ii) the first n/2 elements of L have a true value

and (iii) only the first element of L has a true value. For

m

= 1000 and

n =

500, the execution time in msec for computing (l.45)

on

the 1024-processor

GAMMA-I (hereafter abbreviated to GAMMA-I) for all three cases is 249.7,

while, without masking the time required to compute all inner-products

is

given by 247.79. Explicitly performing operations only on the affected ele

ments of

u,

the execution times (including overheads) in cases (ii) and (iii) are

found to be 147.84 and 13.21, respectively. This example shows the degrada

tion

in

performance that might occur when implementing an algorithm without

taking into consideration the systems software of the particular parallel com

puter.

5.2 THE HOUSEHOLDER QRD ALGORITHM

The CPP LAUB implementation of the QRD Householder algorithm could

be considered a straightforward one. Initially the algorithm was implemented

on the AMT DAP using an earlier version of F-PLUS which required the data

matrix to be partitioned into submatrices having the same dimensions as the

array processor [17]. The re-implementation of the algorithm using the new

F-PLUS has removed this constraint. Algorithm 1.8 shows broadly how the

Householder method has been implemented in this library routine for comput

ing the QRD. The information needed for generating the orthogonal matrix

Q

is stored in the annihilated parts of A and in two n-element vectors. For sim

plicity Algorithm 1.8 ignores this, neither does it emphasize other details of

the LAUB QRD subroutine

QR_FACTOR,

such as those dealing with overflow,

that do not play an important role

in

the performance of the algorithm [19].

Clearly the performance of Algorithm 1.8 is dominated by the computa

tions in the 10th and

11

th lines, while computations on logical arrays and

scalars are less significant. The computation

of

the Euclidean norm of the

m-element

vector

h

in line 5

is

a function of r

m/es

2

l

and is therefore im

portant only for large matrices, where m es

2

Notice that the first i-I

elements of h are zero and the corresponding rows and columns of A remain

unchanged. Thus the computations in lines 5, 10 and 11 can be written as

follows:


42/195


Ui:n

:=

sumr(mate(hi:m,n - i

+

1)

*Ai:m,i:n)/Pi

Ai:m,i:n

:

= Ai:m,i:n

-

mate (

h

i

:

m

, n - i

+

1)

*matr(Ui:n,

m - i

+

1)

Algorithm 1.8

The

CPP

LA LIB method for computing the

QR

Decomposition.

1: L:=

true; M:= true

2:

for

i = 1,2, ... ,

n do

3:

h:=O

4:

h(L)

:=

A:,i

5: cr:=

sqrt(sum(h*h))

6: if hi ni) is the exogenous full column rank matrix in the

ith regression equation, Qi is an

m

x

m

orthogonal matrix and Ri is an upper

triangular matrix of order ni [78]. The fast simultaneous computation of the

QRDs (1.47) is considered.

6 1 EQUAL SIZE MATRICES

Consider, initially, the case where the matrices AI, ... ,AG have the same

dimension, that is, n}

=

...

= nG =

n. The equal-size matrices suggests that a

3-D array could be employed. The

m

x n data matrices A

I , . . .

,AG and the upper

triangular factors

RI, . . . ,RG

can be arranged in an

m

x

n

x G array

A

and the

n x n x G array R, respectively. Using a 2-D mapping, computations performed

on scalars, I-D and 2-D arrays correspond to computations on I-D, 2-D and

3-D arrays when a 3-D mapping is used. Thus, in theory, the advantage over a

2-D mapping is that a 3-D arrangement will increase the level of parallelism.

The algorithms have been implemented on the 8192-processor MasPar MP-

1208, using the high level language MasPar-Fortran. On the MasPar, the 3-D

arrangement

of

the equal-size matrices is mapped on the 2-D array

of

PEs plus

memory, with computations over the third dimension being performed serially.

This indicates that under a 3-D arrangement the increase in parallelism will not

be as large as is theoretically expected.

The indexing expressions

of

2-D matrices and the replication and reduc

tion functions can be used in a 3-D framework. That is, the function

spread

which replicates an array

by

adding a dimension and the function

sum

which

adds all

of

the elements of an array along a specified direction can be used.

For example,

if B

and C are

m

x

n

and

m

x

n

x G arrays respectively, then

C:=

spread(B,3,G) implies that forall

k, C:,:,k

=

B:,:

and B:=

sum(C,3)

is

equivalent to B(i,j) = Lf=1 Ci,j,k whereas sum(C) has a scalar value equal to

the sum of all of the elements of C.


46/195


The

first

method

for

computing

the

QR

factorization

of

a matrix employs a

sequence of

Householder

reflections

H =

1 -

hhT

jb, where

b = h

T

hj2. The

application of H to the

data

matrix

Ai

involves the

vector-matrix

computation

ZT

=

hTAdb and a

rank-one update

Ai - hz

T

. Both

of

these operations

can be

efficiently

computed

on an SIMD array processor using the replication func

tion spread and the reduction function sum. The SIMD implementation of

the Householder algorithm for computing the

QRDs

(1.47) simultaneously is

illustrated in Algorithm 1.10. A total of n Compound Householder Transfor

mations

(CHTs) are applied. The ith

CHT

produces the ith rows of RI,

. . . ,RG

without effecting the first i-I

columns

and rows

of A I, . . . ,AG. The

simultane

ous data parallel

vector-matrix

computations and

rank-one

updates are shown

respectively in lines

12-14

of

Algorithm 1.10.

Algorithm

1.10

The

Householder

algorithm.

1:

defHouseh_QRD(A,m,n,G) =

2:

for i

=

1, . . . ,

n

do

3: apply transform(Ai:,i:,: ,m - i

+ 1,

n - i

+

1, G)

4: end for

5:

end def

6:

def

transform(A, m, n,

G)

=

7: H := A,I,:

8: S:= sqrt(sum(H

*H, 1))

9:

where (HI,:

n)

is the exogenous data matrix,

y

E

Sltm

is the response

vector and E

E

Sltm is the noise vector with zero mean and dispersion matrix

(J2/m.

The least squares estimator

of

the parameter vector

x

E Sltn

argminET E

=

argmin IIAx-

y112,

(2.2)

x x

has an infinite number

of

solutions when

A

does not have full rank. However,

a unique minimum 2-norm estimator of

x

say,

X,

can be computed.

Let the rank

of

A

be given by

k (k

:::;

n).

The solution

of

(2.2) is computed

in two stages. In the first stage the coefficient matrix

A

is reduced to a lower

trapezoidal form; in the second stage the lower trapezoid is triangularized. The

orthogonal decompositions of the first and second stages are given, respec

tively, by:

k

o)m-k

Lz

k

(2.3)

and

n-k

k

(LI L2)P= (0

L)'

(2.4)


56/195


where Q E 9t

mxm

and P E 9t

nxn

are orthogonal, Land L2 are lower-triangular

and non-singular, and II E 9t

nxn

is a permutation matrix. That is,

n - k

k

Q T A n p ~ ( ~

o ) m- k

L k

(2.5)

The orthogonal decompositions (2.3) and (2.5) are called QL decomposition

(QLD) and

complete

QLD , respectively.

The minimum 2-norm best linear unbiased estimator of

x

is given by

A PL-1QT

x=

2

2Y,

where

and

Numerous methods have been proposed for computing the orthogonal factor

izations (2.3) and (2.4), on both serial computers and MIMD parallel systems

[16,

18,51,93].

Algorithms are designed, implemented and analyzed for computing the com

plete QLD on the CPP DAP 510 massively parallel SIMD computer (abbrevi

ated to DAP) [70]. The algorithms employ Householder reflections and Givens

plane rotations. Algorithms are also proposed for reconstructing the orthogo

nal matrices involved in the decompositions when the data which define the

orthogonal transformations are stored in the annihilated parts of the coefficient

matrix A. The implementation and execution time models of all algorithms

on the DAP are considered in detail. All of the algorithms were implemented

on the 1024-processor DAP using double precision arithmetic. The timing

models are expressed in msec.

2 THE

QLD OF

THE COEFFICIENT MATRIX

The computation

of

the QLD (2.3) using Householder reflections with col

umn pivoting is considered. This method is also used when

A

is of full column

rank but ill-conditioned. Let the elementary permutation matrix

I ~ i , J l )

denote

the identity

n x n

matrix

In

with columns

n - i+

1 and Jl interchanged and let

h(i)h(i)T

Qf = I m - hi

(2.6)

denote an

m

x

m

Householder matrix which annihilates the first

m - i

elements

ofA:,n-i+l

(pivot column), when it is multiplied by

A

on the right. The matrices


57/195

OLM not offull rank 41

Q and

n

in (2.3) are defined by

k

QT

=

TIQLi+l

=

QrQLI ... Qf

i=1

and

k

n

= TI (i,lIi) = /(I,JJJ) /(2,/-12) . . . /(k,lIk).

i=1

To describe briefly the process of computing the

QLD

(2.3) let, at the ith (0 :$

i:$ k) step,

A (i) = QT. Qf / ~ I ' I I J )

. . .

~ i l I j )

n - i

o

)m-i

Vii

i '

where L ~ i is non-singular and lower-triangular with its diagonal elements

in increasing order of magnitude. The value

of Jli+

I is the index

of

the col-

umn

of

LW with maximum Euclidean norm. The criterion used to decide that

()

.

II

(k) 112 k). (k)

rank A

=

k IS A I'm-k " < t, where A I'm-k " IS the JlH I column of Lit

, 'rk+l

'

,rk+l

and 't is an absolute tolerance parameter whose value depends on the scaling

of A

[51,63,93]. The value

of't

is assumed to be given.

The permutation matrix

n

can be stored and computed using one

of

the two

n element integer vectors ~ and where

A permutation

~ i ' l I i )

is equivalent to swapping first the elements ~ n - i + 1 and

~ l I i

of the ~ vector and then swapping the elements n - i + 1 and Jli of the ~ vector

where, initially,

= = i (i =

1,

...

,n).

2.1 SIMD IMPLEMENTATION

The QLD (2.3) has been computed on the DAP under the assumption that

n

=

Nes, m

=

Mes

and 1

n and To(n,n)

=

2n-3-

(3.22) is found empirically to be minimized for Pi having values closer to

n.

For simplicity let A. be an integer such that k = An and fIog2 'A.1 = flog2 ('A. +

1)

1.

This implies that (3.22) is minimized

if\fi:

Pi

=

nand

TI

(n, k,

'A.,

p)

is simplified

to

T2(n,'A.) = 2n - 3 +nflog2 ('A. + 1)l.

Figure 3.4 illustrates the computation of (3.6a) using the

bitonic

algorithm,

where n = 6,

k

= 18 and

Pi

=6 (i = 1,2,3) . The bold frames show the partition


87/195


(DT RT{ =

(br

Dr Dr

RTV.

Initially the QRDs of D

,

D2 and D3

are computed simultaneously and, at stages i = 1,2, the updating is completed

by computing (3.21), where

g

=

2.

Compute (3.20) Compute (3.21) Compute (3.21)

for i

=

1 for i

=

2

5

4 6

3 5

7

2 4

6 8

113 5 7

19

16 17

18

19

2U

21

5

16

17

18

19

20

4 6

16

17

18

19

3 5

7

16 17 18

2 4 6 8

16 17

1 3 5 7

19

16

lU

11

12

13

14

15

5

10

11

12

1314

4 6

10

11

12

13

3 5

7

10 11 12

2 4 6 8

10

11

1

3 5 7 9

10

10

11

12

13

14 15

1U

11

12

1314

10

11 12

13

10 11 12

10 11

10

Figure 3.4. The bitonic algorithm, where n

= 6,

k

= 18

and

PI = P2 = P3 = 6.

The number of CDGRs applied to update the QRD using the UGSs

is

given

by

Ignoring additive constants, it follows that

This indicates the efficiency

of

the

bitonic

algorithm for computing (3.6a) for

A> 2, compared with that when using the UGSs.

The second parallel strategy for solving the updating problem is a slight

modification of the Greedy annihilation scheme in [30, 103]. Taking as be

fore

n

= 6 and k = 18, Fig. 3.5 indicates the order in which the elements are

annihilated. Observing that the elements in the diagonal of Rare annihilated


88/195

Updating and downdating the OLM 73

by successive rotations, it follows that at most k + n - 1 CDGRs are required

to compute (3.6a). An approximation to the number of CDGRs required to

compute (3.6a) when n is fixed and

k

approaches to infinity, is given by

T4(n,k)

= log2k+

(n-l)log210g2k.

The derivation

of

this approximation has been given in the context of comput

ing the QRD and, it is also found empirically to be valid for computing (3.6a)

[103]. In general, for k n, the

Greedy

sequence requires fewer CDGRs than

the bitonic method, while for small k (compared with n) the UGSs and Greedy

sequence require the same number of CDGRs. Table 3.5 shows the number

of CDGRs required to compute (3.6a) using the UGSs, bitonic method and

Greedy

sequence for some

n

and

A

(k

=

An

and

k

n) .

5

4 7

3 6

9

3 6

811

2 5

8

10

13

2 5 7

10

12

15

2 4 7 9

12 14

2 4 6 9 11

14

2 4 6 8

11

13

1 3 6 8

10 13

1 3 5 7

10 12

1 3

5 7 912

1 3

5 7

911

1 3

4 6 811

1 2

4 6

8

10

1 2 4 6

810

1 2 4 5 7 9

1 2 3 5 7 9

2 3 5 7 9

3 4 6 8

4 6 8

5 7

6

Figure 3.5. The Greedy sequence for computing (3.6a), where n = 6 and k = 18.

As regards implementation, the efficiency

of

the

Greedy

method is expected

to be reduced significantly by the organizational overheads so that the

bitonic

method is to be preferred [30, 103]. The simultaneous computations performed

at each stage

of

the

bitonic

method make it suitable for distributing memory

architectures [36]. Each processing unit will perform the same matrix com

putations without requiring any inter-processor communications. The simul

taneous QRD

of

the matrices

VI,

...

D28_Ion

a SIMD system has been con-


89/195

74


Table 3.5.

Number of CDGRs required to compute the factorization (3.6a).

n A k = M UGSs bitonic Greedy

15

5 75 89 72

43

15

10 150 164

87 47

15

20 300 314 102

50

15

40

600 614 117

54

30 5

150

179

147

89

30 10 300 329 177

96

30

20

600 629 207 102

30

40 1200 1229 237

107

60

5 300 359 297

187

60 10 600 659 357 198

60

20

1200

1259

417

208

60

40

2400 2459 477

217

sidered within the context of the SURE model estimation [84]. In this case

the performance

of

the Householder algorithm was found to be superior to

that of the Givens algorithm (see Chapter 1). The simultaneous factorizations

(3.21) have been implemented on the MasPar within a

3-D

framework, us

ing Givens rotations and Householder reflections. The Householder algorithm

applies the reflections H(I,j),

. . .

,H(n,j), where H(l,j) annihilates the non-zero

elements of the

lth

column of

; ~ ; ( ~ _ i )

using the

lth

row of

RY-I) as

a pivot

row

(I

= 1,

... ,n).

Table 3.6.

Times (in seconds) for computing the orthogonal factorization (3.6a).

n

g bitonic Householder

Householder bitonic Givens UGS-J

64

2 0.84

0.23 1.29 1.45

64

3 1.55 0.35

2.34

2.46

64

5 4.83 0.82 7.97

9.04*

192 2 4.15 1.78 8.98

9.77

192 3 7.78 2.98 18.21 19.45

192 5

27.12 10.27

69.96

76.96*

320 2

10.15

5.51

27.96 32.41

320 3

19.69

10.05 58.78

67.51*

320 5

72.12 37.48 236.13

278.20*

* Estimated times.

Table 3.6 shows the execution times for the various algorithms for com

puting (3.6a) on the 8I92-processor MasPar using single precision arithmetic.

Clearly the bitonic algorithm based on Householder transformations performs

better than the bitonic algorithm based on CDGRs. However, the straightfor-


90/195

Updating

and

downdating the aLM

75

ward data-parallel implementation of the Householder algorithm is found to be

the fastest of all. The degradation in the performance of the bitonic algorithm

is due mainly to the large number of simultaneous matrix computations which

are performed serially in the

2-D

array MasPar processor [84]. The bitonic

algorithm based on CDGRs performs better than the direct implementation

of

UGS-l because of the initial triangularization of the submatrices DI,' .. ,D2g-1

using Householder transformations.

2.3 UPDATING WITH A MATRIX HAVING A BLOCK

LOWER-TRIANGULAR STRUCTURE

Computational and numerical methods for deriving the estimators of struc

tural equations models require the updating

of

a lower-triangular matrix with

a matrix having a block lower-triangular structure. Within this context the

updating problem can be expressed 'as the computation

of

the orthogonal fac

torization

-T

(A(I)) (0) E el

p ;\(1) = t ( G - l ) K - E+ e G

(3.23)

where

K

- e l

K -e2

K -eG-I

e2

-(I)

A

2

,1

0

0

-(I) -(I)

0

A(1) =

e3

A31

A

,

3,2

-(I) -(I)

-(I)

eG

AG,I

A

G

2

A

GG

-

1

K-e l K-e2

K

-eG-I

K-e l

A I)

LI

0 0

K-e2

A(

I)

A(I)

0

A(1)

=

A21

L2

K -eG-I

A(I)

A(I)

A(I)

AG_1,1

A

G

-

12

L

G

_

1

A

A(I) .

.

G

L

and

Li (I = 1,

... G -

1)

are lower tnangular and

E = Li= 1 ei

The factorization (3.23) can be computed in G - 1 stages, where each stage

annihilates a block-subdiagonal with the first stage annihilating the main

block-


91/195


diagonal.

At

the ith (i

=

1,

...

, G - 1) stage the orthogonal factorizations

are computed simultaneously for j

=

1, ... , G - i, where the

t

Y+

I)

matrix is

lower triangular and Pi,) is a

(K

- ej +ei+j) x

(K

- ej +ei+j) orthogonal ma

trix. It follows that the triangular matrix

t

in (3.23) is given by

t ~ G ) 0 0

A(G-I) t (G-I)

0

t

= 2,1 2

Therefore, if TDI (e,

K,

i, j) denotes the number of CDGRs required to compute

the factorization (3.24) using this method (hereafter called

diagonally-based

method), then the total number of CDGRs needed to compute (3.23) is given

by

G-I

TD(e,K,G)

=

L max ( TDI (e,K,i, j)),

1=1

j=

1,

...

,G- i , (3.25)

where e =

(el, ...

,eG). Figure 3.6 shows the annihilation process for comput

ing the factorizations (3.23), where G = 5 and i i i denotes a submatrix elimi

nated at stage i (i = 1, .. . , G - 1).

Stage 1 Stage 2 Stage 3 Stage 4

4

"

\.

,

Figure

3.6. Computing the factorization

(3

.

23)

using the

diagonally-based

method, where

G=5.

Figure 3.7 illustrates various annihilation schemes for computing the fac

torization (3.24) by showing only the zeroed matrix A ~ 2 j, j and the

lower

triangular tY) matrix, where ei+j = 12, K - ej = 4. The annihilation schemes


92/195

Updating

and

downdating the

OLM

77

are equivalent to those of block-updating the QRD the only difference be

ing that an upper-triangular matrix is replaced by a lower-triangular matrix

[69, 75, 76, 81]. These annihilation schemes can be employed to annihilate

different submatrices

of

A(

1

),

that is, at step

i

(i

=

1,

...

, G -

1)

of

the factoriza-

. (3 23) th b . A-(i) A-(i) b

e d t h

. th

h o n .

e su matrices

i+ l , I ' G,G-i

can e zero

WI

out usmg e

same annihilation scheme. Assuming that only UGS-2 or UGS-3 schemes are

employed to annihilate each submatrix, then the number of CDGRs given by

(3.25) is

TDI (e,K,i,j) = K - ej +ei+j

-1 .

Hence, the total total number

of

CDGRs applied to compute the factorization

(3.23) is given by

G- l

T ~ 2 v ( e , K , G ) = ~ max(K-ej+ei+j-l)

1=1

G- l

=(G- l ) (K- l )+

L,max(ei+j-ej),

j=I , ...

,G-i .

(3.26)

i=1

4 3 2 1 15

14

13 12

6 5 3 1 4 3 2 1 4 3 2 1

5 4 3 2 14

13

1211

7 6

4 2

5 3 2

1

4 3 2

1

6 5 4 3

13

12 11

~

8 7 6 3 5 4 2 1 4 3 2 1

7 6

5 4

1"

11 111

9 9 8 7 6 6 4 3 1 4 3 2 1

8 7 6 5

11

1()

9 8 6 5 3

1

6 5

3 1 4 3 2 1

9 8 7 6

1 ~

9

8 7 7 6 4 2 7 5 3 1 4 3 2 1

H 9 8 7 9 8 7 6

8 7 6 3 7 5 4 2 4 3 2 1

U

1()

9 8

8 7 6 5 9 8 7 6 8 6

4 2

4 3 2 1

112

1

9 7 6 5 4 5 3

1

8 6

4 2

4 3 2

1

I l ~ I I

6 5 4 3

1 ~

4 2

9 7

5 3 4 3 2 1

114

1 1 5 4 3 2

11

11(1

3

9 7 5 3 4 3

2 1

15

l ,n.

4 3

2 1

1 ~ 1

III

1()

10

8 6 4 4 3 2 1

UGS-2 UGS-3

Bitonic

Greedy

HOUSEHOLDER

Figure

3.7. Parallel strategies for computing the factorization (3.24)

The factorization (3.23) is illustrated in Fig. 3.8 without showing the lower

triangular matrix J(1), where each submatrix

of

A:(I) is annihilated using only

the UGS-2 or Greedy schemes, K

=

10,

G =

4 and e

=

(2,3,6,8). This partic

ular example shows that both the schemes require the application

of

the same


93/195


number

of

CDGRs to compute the factorization. However, for problems where

the number of rows far exceeds the number of columns in each submatrix, the

Greedy method will require fewer steps than the other schemes.

8 7 6 S 4 3 2 1

9 8

7

6 S 4 3 2

10

9 8

7

6 5

4

3

20 19

18

17

16

15

14 13 7 6 5 4 3 2 1

21

20 19 18 17 16 15 14

8 7 6 5 4 3 2

22 21 20 19 18 17 16 15

9 8 7 6 5 4 3

23 22 21 20 19 18

17

16 10 9 8

7

6 5

4

24

23

22 21 20 19 18

17

11

10

9 8

7 6 5

25

24

23

22

21 20

19 18

12

11

10

9 8 7 6

34

33 32 31 30

29 28

27

19 18

17

16

IS 14 13

4

3

2

1

35

34

33

32

31 30

29

28

O

19 18

17

16 15

14

5

4

3

2

36

35

34

33

32 31 30 29 21 20 19 18 17 16 15 6 5 4 3

37

36

35

34

33 32 31

30

2

21 20 19 18 17 16 7 6 5 4

38

37

36 35

34 33 32 31

23

22 21 20 19 18 17

8 7 6 5

39 38 37

36

35 34

33 32

24

23

22 21 20 19 18

9 8

7

6

40 39 38 37 36 35

34

33

25

24

23 22 21 20 19 10 9 8

7

41 40 39

38 37 36 35

34 26

2S

24

23

22

21

20

11

10 9

8

U ing

only the UGS2

scheme

8 7 6 5 4 3 2 1

9 8 7 6

5 4 3

1

10 9 8 7

6 5 4 2

20 19 18 17 16 15 14 13

7

6 5 4 3 2 1

21 20 19 18 17 16

14 13

8

7

6

5 4 2

1

22 21

20 19 18 16

15

13 9 8 7 6 4 3 1

23

22 21 20

19

17

15 14 10

9 8

7

5 3 2

24

23 22 21

20

18

16

14 11

10

9 8 6

4 2

25

24

23

22

21

19

17

15

12

11

10 9 7 5 3

34

33

32

31

30

29

28

27 19 18

17

16 15 14 13 4 3 2 1

35 34 33 32

31 30 28

27

20 19 18

17

16

14 13 S 4 2

1

36

35

34 33

32

30

29

27 21 20 19 18 17 15 13

6 4 3 1

37

36

35

34

32 31 29

27

22 21 20 19

17

16 13

6

5 3

1

38

37

36

35

33

31 30

1

8

23

22 21 19 18 16 14

7

5 4 2

39

38

37 36

34

32

30

28

24

23

22

20

18 17 14

8 6

4 2

40

39 38

37

35

33

31 29

25

24

23

21

19 17

15

9 7 5 3

41

40

39

38

36

34

32 30 26

25

24 22

20

18 16 10

8 6 4

Using

only

the

Greedy

cherne

Figure 3.B.

Computing factorization (3.23).

The intrinsically independent annihilation of the submatrices in a block

subdiagonal ofA

1)

makes this factorization strategy well suited for distributed

memory systems since it does not involve any inter-processor communication.


94/195

Updating and downdating the OLM 79

However, the diagonally-based method has the drawback that the computa

tional complexity at stage i (i = 1,

. . .

, G - 1) is dominated by the maximum

b

fCDGR

. ed ihl th b .

A-(i) A-(i)

num

er 0

s requlf to ann

1

ate e su matnces

Hl,l ' '

HG-i,G-i.

An alternative approach (called column-based method) which removes this

drawback is to start annihilating simultaneously the submatrices AI, ...

,AG-l '

where

j = I , ...

,G-1.

(3.27)

Consider the case

of

using the UGS-2 scheme. Initially UGS-2 is applied to

annihilate the matrix

..1(1)

under the assumption that it is dense. As a result the

steps within the zero submatrices are eliminated and the remaining steps are

adjusted so that the sequence starts from step 1. Figure 3.9 shows the derivation

of this sequence using the same problem dimensions as in Fig. 3.8. Generally,

for PI = 1, Pj = Pj- l +2ej - K ( l < j < G) and 11 =

min(Pl,

. . .

,PG-d,

the

annihilation

of

the submatrix

Ai

starts at step

si=Pi- I l+1, i = I ,

... ,G-1.

The number

of

CDGRs needed to compute the factorization (3.23) is given by

T ~ ~ v ( e , K , G ' I l ) = E + K

-2e l - l l .

(3.28)

Comparison o f T ~ ~ v ( e , K , G ) and T ~ ~ v ( e , K , G ' I l ) shows that, when the UGSs

are used, the diagonally-based method never performs better than the

column

based method. Both methods need the same number of steps in the exceptional

case where G

=

2.

The

column-based

method employing the

Greedy

scheme is illustrated in

Fig. 3.10. The first sequence is the result

of

directly applying the Greedy

-(i) -(i)

scheme on the Ai+l,l' ... ,Ai+G-i,G-i submatrices. Let the columns of each

submatrix be numbered from right to left, that is, in reverse order.

The

number

of elements annihilated by the qth (q > 0)

CDGR

in the jth (j = 0, ... ,K - ei)

column of the ith submatrix Ai is given by

where a(i,q) is defined as

J

o

ei+1

ry,q) = l(ay,q) +1)/2J,

(i,q-l) + (i,q-l) _ (i,q-l) + (i-l,q-l)

a

j

r

j

_

l

r j rK-ei_1

(i,q-l) + (i,q-l) (i,q-l)

a

j

r

j

_

l

-

r j

if j > q and j >

k

- ei,

if q = j = 1,

if j = 1 and q > 1,

otherwise.


95/195


19

18

17

16

IS

14

13

1"

11 10

9 8

7 6 5

4 3 2

1

20

19

18

17

16 15

14 13

12

11

10

9 8

7

6

5 4 3

2

21 20 19

18 17 16

15 14 13 12

11

10

9 8

7

6

5

4

3

22

21

20 19 18 17 16

IS

14

13

12

11

10 9 8 7 6 5

4

23

22

21

20

19 18

17

16 15

14

13

12 11

10

9 8

7

6

5

24

23

22

21

20

19 18 17

16

15

14 13 12 11

Ie

9 8

7

6

25 24

23 22

21 20 19

18 17 16

15

14

13 12

11

10

9 8 7

26

25 24

23 22

21

20

19 18

17

16 15

14

13

12 11

10

9 8

27 26 25

24

23 22

21 20

19

18

17

16

IS

14

13

12

11

10

9

28

27

26

25

24

23

22 21 20

19

18 17 16

15 14

13 12 11 10

29

28 27 26 25

24 23

2..

21

20

19 18

17

16 15 14 13 12

11

30 29 28 27 26 25

24 23

22 21 20

19 18 17 16

15

14

13

12

31 30

29

28 27 26

25

24

23

22 21

20

19

18

17 16

15

14

13

32

31

30

29

28

27

26 25

24

23

22

21

20 19

18 17

16

IS

14

33 32 31 30

29

28

27

26 25 24 23

22

21

20

19 18

17

16

15

34 33 32 31 30 29 28

27

26

25

24

23

22

21 20

19

18

17

16

35

34 33 32

31

30 29 28

27

26

25

24

23

22

21 20

19

18

17

UGS-2

12 11

10 9 8 7

6 5

13

12 11

10

9 8 7 6

14

13

12

11

10

9 8 7

15 14

13

12

11

10

9

8 7 6 5 4 3 2 1

16

15

14

13

12

11

10

9 8

7

6

5 4

3

2

17

16

15

14 13

12

11

10

9 8 7 6 5 4 3

18 17 16 15

14

13 12

11 10

9 8

7

6

5

4

19

18

17

16 IS

14

13

12

11 10

9 8

7

6

5

20 19

18

17

16

15

14

13

12 11

10

9 8

7

6

21

20

19 18

17 16 15 14

13

12

11

10 9 8 7 6 5

4

3

22

21

20

19 18 17 16

IS

14

13

12 11

10

9 8 7 6 5 4

23

22

21 20 19

18

17

16

15

14 13 12 11

10

9 8 7 6 5

24 23 22

21

20

19 18

17 16

15

14

13

12 11 10

9 8

7

6

25

24

23

22

21

20

19 18 17

16

IS

14

13

12

11

10 9 8 7

26

25 24

23 22

21

20

19

18 17 16 IS 14

13

1"

11 10 9 8

27

26

25

24

23

22

21 20 19 18

17

16 15

14

13

12

I I

10

9

28

27

26

25

24

23

22 21

~ O

19 18

17 16

15

14

13

12

11 10

Modified UGS-2

Figure

3.9. The

column-based

method using the

UGS-2

scheme.

The sequence terminates at step

q

if

Vi, j:

ry,q)

=

O.

The second sequence in Fig. 3.10, called

Modified Greedy,

is generated from

the application of the

Greedy

algorithm in [69] by employing the same tech

nique for deriving the column-based sequence using the UGS-2 scheme. No

tice however that the second

Greedy sequence does not correspond to and is not

as efficient as the former sequence which applies the Greedy method directly


96/195

Updating

and

downdating the

OLM 81

8 7

6 5 4 3 2

1

9 8

7 6 5 4 3

1

10

9 8 7 6 5 4

2

IS

14

13

12

11

10

9 8 7 6 5 4 3 2 1

16 15

14 13

12

11

10 9 8

7

6

5 4 2

1

17 16 IS 14 13 12 11 10 9 8

7

6

4

3

1

18 17 16 IS 14 13

12

(advances in computational economics 15) erricos john kontoghiorghes (auth.)-parallel algorithms for...

Documents