(advances in computational economics 15) erricos john kontoghiorghes (auth.)-parallel algorithms for...

Upload: cepade-projectos

Post on 17-Feb-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    1/195

    PARALLEL

    ALGORITHMS

    FOR

    LINEAR

    MODELS

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    2/195

    Advances in Computational Economics

    VOLUME

    15

    SERIES EDITORS

    Hans Amman, University ofAmsterdam, Amsterdam, The Netherlands

    Anna Nagurney,

    University ofMassachusetts at Amherst, USA

    EDITORIAL BOARD

    Anantha

    K.

    Duraiappah,

    European University Institute

    John Geweke,

    University ofMinnesota

    Manfred Gilli,

    University ofGeneva

    Kenneth

    L.

    Judd, Stanford University

    David Kendrick,

    University ofTexas at Austin

    Daniel McFadden,

    University ofCalifornia at Berkeley

    Ellen McGrattan,

    Duke University

    Reinhard Neck,

    University ofKlagenfurt

    Adrian R. Pagan, Australian National University

    John Rust, University ofWisconsin

    Berc Rustem,

    University ofLondon

    Hal

    R.

    Varian,

    University ofMichigan

    The titles published in this series are listed at the end

    of

    his volume.

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    3/195

    Parallel Algorithms for

    Linear Models

    Numerical Methods and

    Estimation Problems

    by

    Erricos John Kontoghiorghes

    Universite de Neuchtel Switzerland

    pringer Science+Business Media, LLC

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    4/195

    Library

    of

    Congress Cataloging-in-Publication ata

    Kontoghiorghes, Erricos John.

    Parallel algorithms for linear models : numerical methods and estimation problems / by

    Erricos John Kontoghiorghes.

    p cm -- (Advances in computational economics; v 15)

    lncludes bibliographical references and indexes.

    ISBN 978-1-4613-7064-2 ISBN 978-1-4615-4571-2 (eBook)

    DOI 10.1007/978-1-4615-4571-2

    1 Linear models (Statistics)--Data processing. 2. Parallel algorithms. 1 Title. II

    Series.

    QA276 .K645 2000

    519.5 35--dc21

    99-056040

    Copyright

    2000 by Springer Science+Business Media

    New York

    Originally published by Kluwer Academic Publishers, New

    York

    in 1992

    Softcover reprint

    ofthe

    hardcover lst edition 1992

    AII rights reserved. No part

    of

    this publication may be reproduced, stored in a retrieval

    system or transmitted in any form or by any means, mechanical, photo-copying, recording,

    or

    otherwise, without the prior written permission

    of

    the publisher, Springer Science

    Business Media, LLC

    Printed

    on

    acid-free p per.

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    5/195

    To Laurence and Louisa

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    6/195

    Contents

    List

    of

    Figures ix

    List

    of

    Tables xi

    List

    of

    Algorithms xiii

    Preface xv

    1. LINEAR MODELS AND QR DECOMPOSmON 1

    1 Introduction 1

    2 Linear model specification 1

    2.1

    The ordinary linear model 2

    2.2 The general linear model 7

    3 Forming the QR decomposition 10

    3.1 The Householder method 11

    3.2 The Givens rotation method 13

    3.3 The Gram-Schmidt orthogonalization method 16

    4 Data parallel algorithms for computing the QR decomposition 17

    4.1

    Data:

    parallelism and the MasPar SIMD system

    17

    4.2 The Householder method

    19

    4.3 The Gram-Schmidt method

    21

    4.4 The Givens rotation method 22

    4.5 Computational results 23

    5 QRD of large and skinny matrices 23

    5.1 The CPP GAMMA SIMD system 24

    5.2 The Householder QRD algorithm 25

    5.3 QRD of skinny matrices 27

    6 QRD of a set of matrices 29

    6.1

    Equal size matrices 29

    6.2 Mattices with different number

    of

    columns 34

    2. OLM Nor OF FULL RANK 39

    1 Introduction 39

    2 The QLD

    of

    the coefficient matrix 40

    2.1 SIMD implementation 41

    3 Triangularizing the lower trapezoid 43

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    7/195

    viii PARALLEL

    ALGORITHMS

    FOR

    UNEAR MODELS

    4

    5

    3.1

    The Householder method

    3.2 The Givens method

    Computing the orthogonal matrices

    Discussion

    43

    46

    49

    54

    3. UPDATING AND DOWNDATING THE

    OLM

    57

    1 Introduction 57

    2 Adding observations 58

    2.1

    The hybrid Householder algorithm 60

    2.2 The Bitonic and Greedy Givens sequences 67

    2.3 Updating with a block lower-triangular matrix 75

    2.4 QRD of structured banded matrices 82

    2.5 Recursive and linearly constrained least-squares 87

    3 Adding exogenous variables 90

    4 Deleting observations 92

    4.1 Parallel strategies

    94

    5 Deleting exogenous variables 99

    4. THE GENERAL LINEAR MODEL

    105

    1 Introduction 105

    2 Parallel algorithms

    108

    3 Implementation and performance analysis

    111

    5. SUREMODELS 117

    1 Introduction 117

    2 The generalized linear least squares method 121

    3 Triangular SURE models 123

    3.1 Implementation aspects 127

    4 Covariance restrictions 129

    4;1 The QLD of the block bi-diagonal matrix 133

    4.2 Parallel strategies

    138

    4.3 Common exogenous variables 140

    6. SIMULTANEOUS EQUATIONS

    MODELS 147

    1 Generalized linear least squares 149

    1.1

    Estimating the disturbance covariance matrix 151

    1.2 Redundancies 152

    1.3

    Inconsistencies 153

    2 Modifying the SEM 154

    3 Linear Equality Constraints 157

    3.1 Basis

    of

    the null space and direct elimination methods

    158

    4 Computational Strategies 160

    References

    163

    Author Index 177

    Subject Index 1

    79

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    8/195

    List

    of

    Figures

    1.1

    Geometric interpretation of least-squares for the OLM

    problem. 4

    1.2

    Illustration of Algorithm 1.2, where

    m

    = 4 and

    n

    = 3.

    15

    1.3

    The column and diagonally based Givens sequences for

    computing the QRD. 15

    1.4

    Cyclic mapping

    of

    a matrix and a vector on the MasPar

    MP-1208. 18

    1.5

    Examples of Givens rotations schemes for computing

    theQRD.

    22

    1.6

    Execution time ratio between

    2-D

    and 3-D algorithms

    for computing the QRDs, where G = 16.

    34

    1.7

    Stages of computing the QRDs (1.47). 36

    2.1

    Annihilation pattern

    of

    (2.4) using Householder reflections. 44

    2.2

    Givens sequences for computing the orthogonal factor-

    ization (2.4). 47

    2.3 Illustration of the implementation phases ofPGS, where

    es=4.

    49

    2.4

    Thefill-in

    of the submatrix PI:n,I:n at each phase

    of

    Al-

    gorithm 2.4.

    53

    3.1 Updating Givens sequences for computing the orthog-

    onal factorizations (3.6), where

    k

    = 8 and

    n

    = 4.

    59

    3.2 Ratio of the execution times produced by the models of

    the

    cyclic-layout

    and

    column-layout

    implementations. 63

    3.3

    Computing (3.21) using Givens rotations.

    71

    3.4 The

    bitonic

    algorithm, where

    n

    = 6,

    k

    =

    18

    and

    PI =

    P2 = P3 = 6.

    72

    3.5 The

    Greedy

    sequence for computing (3.6a), where

    n =

    6andk= 18.

    73

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    9/195

    x

    PARAILELALGORITHMS FOR liNEAR MODELS

    3.6 Computing the factorization (3.23) using the diagonally-

    based method, where G = 5.

    76

    3.7 Parallel strategies for computing the factorization (3.24)

    77

    3.8 Computing factorization (3.23). 78

    3.9 The column-based method using the UGS-2 scheme.

    80

    3.10 The

    column-based

    method using the

    Greedy

    scheme.

    81

    3.11

    Illustration

    of

    the annihilation patterns of method-I.

    83

    3.12

    Computing (3.31) for b = 8,

    l'}*

    = 3 and

    j

    = 2. Only

    the affected matrices are shown.

    85

    3.13 Illustration of method-3, where p = 4 and g =

    1.

    86

    3.14 Givens parallel strategies for downdating the QRD.

    96

    3.15 Illustration of the SK-based scheme for computing the

    QRDofRS.

    102

    3.16

    Greedy-based schemes for computing the QRD

    of

    RS.

    104

    4.1

    Sequential Givens sequences for computing the QLD (4.3a).

    107

    4.2 The SK sequence.

    109

    4.3

    G(16)B

    with e{16, 18,8) = 8.

    109

    4.4 The application of the SK sequence to compute (4.3)

    on a 2-D SIMD computer.

    109

    4.5

    Examples

    of

    the

    MSK(p)

    sequence for computing the QLD.

    110

    5.1

    The correlations

    Pj,j

    in the SURE--CC model for

    l'}j

    =

    i

    and l'}j =

    Iii.

    131

    5.2 Factorization process for computing the QLD (5.35)

    using Algorithm 5.3.

    136

    5.3 Annihilation sequences of computing the factorization (5.40). 137

    5.4 Givens sequences of computing the factorization (5.45).

    138

    5.5 Number of CDGRs for computing the orthogonal fac-

    torization (5.40) using the PDS.

    139

    5.6 Annihilation sequence

    of

    triangularizing (5.55). 144

    6.1

    Givens sequence for computing the QRD of RSj.

    161

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    10/195

    List

    of

    Tables

    1.1 Times (in seconds) of computing the QRD of a

    128M

    x

    64N matrix.

    23

    1.2 Execution times (in seconds) of

    the CPP

    LALIB QR_FACTOR

    subroutine and the BPHA.

    27

    1.3 Execution times (in seconds)

    ofthe

    CPP LA LIB QR-FACTOR

    subroutine and Algorithm 1.9.

    28

    1.4 Times (in seconds)

    of

    simultaneously computing the

    QRDs (1.47).

    33

    1.5

    The

    task-farming

    and

    scattering

    methods for comput-

    ing the QRDs (1.47).

    38

    2.1 Computing the QLD (2.3) (in seconds), where m = Mes

    and n = Nes.

    44

    2.2 The CDGRs

    of

    the PGS for computing the factorization (2.4).

    47

    2.3 Computing (2.4) (in seconds), where k = Kes and

    n-

    k

    =

    Tles.

    50

    2.4 Times (in seconds) of reconstructing the orthogonal ma-

    trices

    QT

    and

    P

    on the DAP.

    54

    3.1 Execution times (msec) of the Householder and Givens

    methods for updating the QRD on the DAP.

    60

    3.2 Execution times (in seconds) for

    k =

    11264.

    63

    3.3 Execution times (in seconds) of the RLS Householder

    algorithm on the MasPar.

    65

    3.4 Execution times (in seconds)

    of

    the RLS Householder

    algorithm on the GAMMA.

    66

    3.5

    Number

    of

    CDGRs required to compute the factoriza-

    tion (3.6a).

    74

    3.6

    Times (in seconds) for computing the orthogonal fac-

    torization (3.6a).

    74

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    11/195

    xu

    PARALLEL ALGORITHMS FOR LINEAR MODELS

    3.7

    Computing the QRD

    of

    a structured banded matrix us-

    ing method-3.

    87

    3.8

    Estimated time (msec) required to compute x(i) (i =

    2,3,

    ...

    ),

    where

    mi

    =

    96, n

    =

    32N and k

    =

    32K.

    91

    3.9

    Execution time (in seconds) for downdating the OLM.

    98

    4.1

    Execution times (in seconds) of the MSK(Aes/2). 114

    4.2

    Computing (4.3) (in seconds) without explicitly con-

    structing

    QT

    and P.

    115

    5.1

    Computing (5.24), where

    T

    -

    k

    - 1

    = -res, G - 1 =

    Jles

    and es = 32.

    128

    5.2

    Execution times of Algorithm 5.2 for solving

    Rr = A. 129

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    12/195

    List

    of

    Algorithms

    1.1

    Computing the QRD of

    A

    E 9\mxn using Householder trans-

    formations.

    12

    1.2 The column-based Givens sequence for computing the

    QRD of

    A

    E

    S)\mxn.

    14

    1.3

    The diagonally-based Givens sequence for computing the

    QRD of

    A

    E

    S)\mxn.

    16

    1.4 The Classical Gram-Schmidt method for computing the

    QRD of

    A

    E

    S)\mxn.

    16

    1.5 The Modified Gram-Schmidt method for computing the

    QRD.

    17

    1.6 QR factorization by Householder transformations on SIMD

    systems.

    20

    1.7 The MGS method for computing the QRD on SIMD systems.

    21

    1.8 The CPP

    LALIB

    method for computing the QR Decomposition.

    26

    1.9 Householder with parallelism in the first dimension. 28

    1.10 The Householder algorithm.

    30

    1.11 The Modified Gram-Schmidt algorithm.

    31

    1.12 The task-farming approach for computing the QRDs (1.47)

    on p (p

    G)

    processors using a SPMD paradigm.

    37

    2.1

    The QL decomposition of

    A.

    43

    2.2 Triangularizing the lower trapezoid using Householder re-

    flections.

    45

    2.3

    The reconstruction

    of

    the orthogonal matrix Q in (2.3).

    51

    2.4 The reconstruction of the orthogonal matrix P in (2.4).

    53

    3.1

    The data-parallel Householder algorithm.

    61

    3.2

    The bitonic algorithm for updating the QRD, where

    R

    == R i ~ I . 70

    3.3 The computation of (3.63) using Householder transformations.

    97

    5.1 An iterative algorithm for solving tSURE models.

    126

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    13/195

    XIV PARALLEL ALGORITHMS FOR LINEAR MODELS

    5.2 The parallel solution of the triangular system

    Rr

    = L\.

    129

    5.3 Computing the QLD (5.35). 135

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    14/195

    Preface

    The monograph provides a complete and detailed account of the design,

    analysis and implementation of parallel algorithms for solving large-scale lin

    ear models.

    It

    investigates and presents efficient, numerically stable algorithms

    for computing the least-squares estimators and other quantities of interest on

    massively parallel systems.

    The least-squares computations are based on orthogonal transformations,

    in particular the QR and QL decompositions. Parallel algorithms employ

    ing Givens rotations and Householder transformations have been designed for

    various linear model estimation problems. Some of the algorithms presented

    are parallel versions

    of

    serial methods while others are original designs. The

    implementation of the major parallel algorithms is described. The necessary

    techniques and insights needed for implementing efficient parallel algorithms

    on multiprocessor systems are illustrated in detail. Although most

    of

    the al

    gorithms have been implemented on SIMD systems the data parallel compu

    tations

    of

    these algorithms should, in general, be applicable to any massively

    parallel computer.

    The monograph is in two parts. The first part consists

    of

    four chapters

    and deals with the computational aspects for solving linear models that have

    applicability in diverse areas. The remaining two chapters form the second

    part which concentrates on numerical and computational methods for solv

    ing various problems associated with seemingly unrelated regression equations

    (SURE) and simultaneous equations models.

    Chapter 1 provides a brief introduction

    to

    linear models and considers var

    ious forms for solving the QR decomposition on serial and parallel systems.

    Emphasis is given to the design and efficient implementation of the parallel

    algorithms. The second chapter investigates the performance and practical is

    sues for solving the ordinary linear model (OLM), with the exogenous matrix

    being ill-conditioned or having deficient rank, on a SIMD system.

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    15/195

    xvi PARALLEL ALGORITHMS FOR LINEAR MODELS

    Chapter 3 is devoted to methods for

    up-

    and down-dating the OLM. It pro

    vides the necessary computational tools and techniques that are often required

    in econometrics and optimization. The efficient parallel strategies for modi

    fying the OLM can be used as primitives for designing fast econometric al

    gorithms. For example, the Givens and Householder algorithms used to com

    pute the QR decomposition after rows have been added or columns have been

    deleted from the original matrix have been efficiently employed to the solution

    of

    the SURE and simultaneous equations models. The updating methods are

    also employed to solve the recursive ordinary linear model with linear equal

    ity constraints. The numerical methods based on the basis of the null space

    and direct elimination methods are in turn adopted for the solution of linearly

    constrained simultaneous equations models.

    The fourth chapter investigates parallel algorithms for solving the general

    linear model - the parent model of econometrics - when it is considered as

    a generalized linear least-squares problem. This approach has subsequently

    been efficiently used to compute solutions

    of

    SURE and simultaneous equa

    tions models without having as prerequisite the non-singularity

    of

    the variance

    covariance matrix

    of

    the disturbances. Chapter 5 presents a parallel algorithm

    for solving triangular SURE models. The problem

    of

    computing estimates

    of

    parameters in SURE models with variance inequalities and positivity

    of

    corre

    lations constraints is also considered. Finally, chapter 6 presents algorithms for

    computing the three-stage least squares estimator

    of

    simultaneous equations

    models (SEMs). Numerical and computational methods for solving SEMs with

    separable linear equalities constraints and when the SEM has been modified

    by deleting or adding new observations or variables are discussed. Expres

    sions revealing linear combinations between the observations which become

    redundant are also presented.

    These novel computational methods for solving SURE and simultaneous

    equations models provide new insights that can be useful to econometric mod

    elling. Furthermore, the computational and numerical efficient treatment of

    these models, which are regarded as the core of econometric theory, can be

    considered as the basis for future research. The algorithms can be extended or

    modified to deal with models that occur in particular econometric applications

    and have specific characteristics that need to be taken into account.

    The practical issues of the parallel algorithms and the theoretical aspects

    of

    the numerical methods will be

    of

    interest to a broad range

    of

    researchers

    working in the areas of numerical and computational methods in statistics and

    econometrics, parallel numerical algorithms, parallel computing and numeri

    cal linear algebra. The aim

    of

    this monograph is to promote research in the

    interface of econometrics, computational statistics, numerical linear algebra

    and parallelism.

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    16/195

    Preface xvii

    The research described in this monograph is based on the work that I have

    pursued in the last ten years. During this period I was privileged to have the

    opportunity to discuss various issues related to my work with Maurice Clint.

    His numerous suggestions and constructive comments have been both inspiring

    and invaluable. I am grateful to Dennis Parkinson for his valuable information

    that he has provided on many occasions on various aspects related to SIMD

    systems, David

    A.

    Belsley for his constructive comments and advice on the

    solution of SURE and simultaneous equations models, Hans-Heinrich Nageli

    for his comments and constructive criticism on performance issues of parallel

    algorithms and the late Mike R.B. Clarke for his suggestions on Givens se

    quences and matrix computations. I am indebted to Paolo Foschi and Manfred

    Gilli for their comments on this monograph and to Sharon Silverne for proof

    reading the manuscript. The author accepts full responsibility for any errors

    that may be found in this work.

    Some of the results

    of

    this monograph were originally published in various

    papers [69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 84, 85, 86, 87, 88] and

    reproduced by kind permission

    of

    Elsevier Science Publishers B.Y. 1993,

    1994, 1995, 1999; Gordon and Breach Publishers 1993, 1995; John Wiley

    & Sons Limited

    1996, 1999; IEEE

    1993; Kluwer Academic Publishers

    1997, 1999; Principia Scientia

    1996, 1997; SAP-Slovak Academic Press

    Ltd.

    1995; and Springer-Verlag

    1993, 1996, 1999.

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    17/195

    Chapter

    1

    LINEAR MODELS AND QR DECOMPOSITION

    1 INTRODUCTION

    A common problem in statistics is that of estimating parameters of some

    assumed relationship between one or more variables. One such relationship is

    (1.1)

    where y is the dependent (endogenous, explained) variable and al,..

    ,an

    are

    the independent (exogenous, explanatory) variables. Regression analysis es

    timates the form of the relationship (1.1) by using the observed values of the

    variables. This attempt at describing how these variables are related to each

    other is known as model building.

    Exact functional relationships such as (1.1) are inadequate descriptions of

    statistical behavior. Thus, the specification of the relationship (1.1) is explained

    as

    (1.2)

    where

    is the disturbance term or error, whose specific value in any single

    observation cannot be predicted. The purpose of

    is to characterize the dis

    crepancies that emerge between the actual observed value of

    y

    and the values

    that would be assigned by an exact functional relationship. The difference

    between the observed and predicted value of y is called the residual.

    2 LINEAR MODEL SPECIFICATION

    A linear model

    is

    one

    in

    which y, or some transformation of y, can be ex

    pressed as a linear function of ai, or some transformation of ai (i = 1, ... ,n).

    Here only linear models where endogenous and exogenous variables do not

    require any transformations will be considered. In this case, the relationship

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    18/195

    2 PARALLEL ALGORITHMS FOR LINEAR MODELS

    (1.2) can be written as

    (1.3)

    where Xi

    (i = 1, ...

    ,

    n)

    are unknown constants.

    If there are m (m > n) sample observations, the linear model (1.3) gives rise

    to the following set

    of m

    equations

    YI = al1

    x

    I +a12

    x

    2

    + ...

    +alnX

    n

    +1

    Y2 = a21XI + a22

    X

    2

    + ... +a2n

    X

    n

    +

    2

    or

    (

    ~ )

    = ( : : : : ~

    : ~ )

    ( ~ : ) + (::) .

    (1.4)

    Ym ami

    a

    m

    2 a

    mn

    Xn

    m

    In

    compact form the latter can be written as

    y=Ax+,

    (1.5)

    where y,

    E

    SRm, A

    E

    SRmxn

    and x

    E

    SRn.

    To

    complete the description of the linear model (1.5), characteristics of the

    error term

    and the matrix A must be specified. The first assumption is that

    the expected value

    of

    is zero, that is,

    E(

    ) = O. The second assumption is

    that the various values

    of

    are normally distributed. The final assumption is

    that

    A

    is a non-stochastic matrix, which implies

    E(A

    T

    )

    =

    O.

    In summary,

    the complete mathematical specification of the (general) linear model which is

    being considered is

    (1.6)

    The notation

    ,...., N(O,a

    2

    n) indicates that the error vector is assumed to

    come from a normal distribution with mean zero and variance-covariance (or

    dispersion) matrix a

    2

    n, where

    n

    is a symmetric non-negative definite matrix

    and

    a

    is an unknown scalar [124].

    2.1 THE ORDINARY LINEAR MODEL

    Consider the Ordinary Linear Model (OLM):

    y=Ax+,

    ,....,N(O,a

    2

    /

    m

    ). (1.7)

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    19/195

    Linear models and QR decomposition

    3

    The OLM assumptions are that each Ei has the same variance and all distur

    bances are pairwise uncorrelated. That is, Var(Ei) = cr

    2

    and \:Ii =I-

    j:

    E(ET Ej)

    =

    o. The first assumption is known as homoscedasticity (homogeneous vari

    ances).

    The most frequently used estimating technique for the OLM (1.7) is least

    squares. Least-squares (LS) estimation involves minimizing the sum of squares

    of

    residuals: that is, finding an

    n

    element vector

    x

    which minimizes

    e

    T

    e = (y-Axf(y-Ax) .

    (1.8)

    For the minimization of (1.8) e

    T

    e is differentiated with respect to x which is

    treated as a variable vector, and the differentials are equated to zero. Thus,

    a(e

    T

    e)

    = _2yT A+2xTAT A

    ax

    and, setting a(e

    T

    e) lax =

    0,

    gives the least-squares normal equations

    AT

    Ax

    = AT y. (1.9)

    Assuming that A is

    of

    full column rank, that is, (AT A)

    -I

    exists, the least

    squares estimator can be computed as

    and the variance-covariance matrix of the estimator x s given by

    Var(x) = cr

    2

    (ATA)-I.

    (1.10)

    (1.11)

    The terminology of normal equations is expressed in terms of the following

    geometric interpretation of least-squares. The columns of A span a subspace

    in SRm which is referred to as a manifold of

    A

    and it is denoted by M (A). The

    dimension

    of

    M(A) cannot exceed n and can only be equal to n

    if A

    is

    of

    full column rank. The vector

    Ax

    resides in

    :M

    (A) but the vector y lies outside

    M(A) where it is assumed that E

    =I-

    0 in (1.7). For each different vector x

    there is a corresponding vector

    of

    residuals e, so that y is the sum

    of

    the two

    vectors Ax and e. The length of e needs to be minimized and this is achieved

    by making the residual vector e perpendicular to

    M(A)

    (see Fig. 1.1). This

    implies that e = y - Ax must be orthogonal to any linear combination of the

    columns

    of

    A. If Ac is any such linear combination, where c is non-zero,

    then the orthogonality condition gives c

    T

    AT (y-Ax)

    =

    0 from which the least

    squares normal equations (1.9) are derived.

    Among econometricians, Maximum Likelihood (ML) is another popular

    technique for deriving estimators

    of

    linear models. The likelihood function

    of

    the observed sample is the probability density function

    of

    y, namely,

    L(x,cr

    2

    ) = (21tcr

    2

    )-(m/2)e-(y-

    Ax

    f(Y-Ax)/2

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    20/195

    4 PARALLEL ALGORITHMS FOR LINEAR MODELS

    Figure

    1.1.

    Geometric interpretation of least-squares for the OLM problem.

    where

    e

    denotes the exponential function. Maximizing

    L(x,

    cr

    2 )

    is equivalent

    to maximizing

    InL(x,cr

    2

    )

    where,

    2 m m

    2 1

    T

    InL(x, cr )

    = -

    2

    In

    (21t) - 2

    1n

    (

    cr ) - 2cr

    2

    (y - Ax) (y - Ax).

    Setting

    a

    n

    Ljax

    = nd

    a n Ljacr

    2

    = 0, the ML estimators are obtained as

    XML=(ATA)-IATy

    (1.12)

    and

    (1.13)

    The ML estimator XML, is identical to the least-squares estimator x which is

    the best linear unbiased estimator (BLUE) of x in (1.7). I f

    x

    is the BLUE of

    x,

    it follows that

    E(x)

    =

    x

    and

    \fq

    E 9\n,

    Var(qTx)

    ::;

    Var(qT

    x), where

    x

    is any

    linear unbiased estimator of

    x.

    However, the ML estimator

    0 - ~ 1 L

    differs from the unbiased estimator of

    cr

    2

    which is given by

    0-

    2

    =

    ( y -Ax )T (y -Ax ) j (m-n) .

    (1.14)

    Numerous methods exist for solving the least-squares problem. Some

    of

    the

    best known methods are Gaussian elimination, Gauss-Jordan elimination,

    LU

    decomposition, Cholesky factorization, Singular Value Decomposition (SVD)

    and QR decomposition (QRD). When the coefficient matrix is large and sparse,

    these methods, which are called

    direct methods, suffer from fill-in and they

    can be impracticable. In such cases, iterative methods are more efficient, even

    though there are intelligent adaptations of the direct methods which minimize

    the fill-in. Iterative methods (e.g. Conjugate Gradient) have the advantage

    that minimal storage space is required for implementation since no fill-in of

    the zero positions of the coefficient matrix occurs during computation. Further

    more, if a good initial guess is known and preconditioning can be used, the iter

    ative methods can converge in an acceptable number of steps. Full details of the

    direct and iterative methods are given in textbooks such as [7,40,51,93,98].

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    21/195

    Linear models and QR decomposition 5

    Here the numerically reliable direct method of

    QR

    decomposition (QRD) is

    used under the assumption that the coefficient matrix is dense. The QRD

    of

    the explanatory data matrix A is given by

    (1.15)

    where Q

    E SRmxm

    is orthogonal, i.e. it satisfies

    QT

    Q =

    QQT

    =1

    m

    ,

    and R

    E SRnxn

    is upper triangular. Substituting (1.15) in the normal equations (1.9), gives

    RTRx=RTYI,

    where

    QTy=

    (YI)

    n .

    Y2

    m - n

    Under the assumption that A is of full column rank, which implies that R is

    non-singular, the least-squares estimator

    of

    the OLM (1.7) is computed by

    solving the upper triangular system of equations

    RX=YI.

    (1.16)

    Another approach to deriving

    (1.16)

    is to use the property of orthogonal

    transformation matrices which leave the Euclidean length of a vector invariant.

    The

    Euclidean length or 2-norm of z

    E

    SRm is given by

    Ilzll

    =

    (ZT

    z) 1/2 and so,

    IIQzl1

    2

    =

    ZTQT

    Qz =

    ZT

    Z= Ilz112. Hereafter, the Euclidean norm will be denoted

    by II . II. In the context

    of

    minimizing lie

    11

    2

    , it follows that

    x

    =

    argmin I ell

    2

    x

    = argminllQ

    T

    el1

    2

    x

    = argminllQ

    T

    Y_

    QTAxll2

    x

    =

    a r g ~ i n l l (Yl

    ~ R X )

    112

    =

    r g ~ i n

    (IIYI

    - Rxll2

    +

    IY2

    2

    )

    = argminllYl -Rx112

    x

    = R-1YI.

    The

    quantity

    lIy

    - AxII2 = IIY211

    2

    is termed the residual sum of squares (RSS).

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    22/195

    6 PARALLEL ALGORITHMS FOR UNEAR MODELS

    2.1.1

    OLM WITH LINEAR EQUALITY RESTRICTIONS

    Many regression models incorporate additional information in the form

    of

    restrictions (constraints) on the parameters

    of

    the model. In these cases the

    models are called restricted or constrained models.

    If

    the vector

    x

    of the OLM

    (1.7) is subject to

    k

    consistent restrictions expressed as

    Cx=d,

    (1.17)

    where C is a

    k

    x

    n

    matrix and

    d

    is of order

    k,

    then the

    restricted least squares

    (RLS) solution is an

    n

    element vector

    x*

    satisfying

    x* =

    argmin

    Ily - Ax112.

    (1.18)

    Cx=d

    The assumptions for the matrix C are rank(C) = k and k

    2(Mesl,Nes2) and u s ~

    ing regression analysis, the estimated execution time (seconds x 10

    2

    )

    of

    Algo

    rithm 1.6 is found to be

    TI(M,N)

    = N(14.15+3.09N

    -0.62N

    2

    +5.71M +3.67MN).

    The

    above timing model includes the overheads which arise mainly from the

    reference to the submatrix

    Ai:,i:

    in line 3. This matrix reference results in the

    assignment

    of

    an array section

    of A

    to a temporary array and then, when the

    procedure transform in line 6 has been completed, the reassignment of the tem

    porary array to A. The overheads can

    be

    reduced by referencing a submatrix of

    A

    only

    if

    it uses fewer memory layers than a previous extracted submatrix (see

    for example the Modified Gram-Schmidt algorithm). This slight modification

    improves significantly the execution time of the algorithm which now becomes

    T2(M,N) = N(14.99+2.09N -0.20N

    2

    +3.19M

    +

    1.

    17MN).

    The accuracy of the timing models is illustrated in Table 1.1.

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    37/195

    Linear models

    and

    QR decomposition

    21

    4.3 THE GRAM-SCHMIDT METHOD

    As in the case of the Householder algorithm, the performance of the straight

    forward implementation

    of

    the

    MGS

    method will

    be

    significantly impaired

    by

    the overheads. Therefore, the

    n = Nes2

    steps of the

    MGS

    method are used in

    N

    stages.

    At

    the

    ith

    stage,

    eS2

    steps are used to orthogonalize the

    (i

    -

    1 es2

    +

    1

    to

    ies2

    columns

    of

    A

    and

    also to construct the corresponding rows

    of

    R.

    Each

    step

    of

    the

    ith

    (i =

    1,

    . . .

    ,

    N)

    stage has the same execution time, namely

    Thus, the execution

    time

    to apply all

    Nes2

    steps of the MGS method is given

    by

    N

    3(Mesl,Nes2) = eS2

    L

    1>1 (

    Mes

    l, (N

    -

    i + l)es2).

    i=l

    The

    data

    parallel

    MGS

    orthogonalization method is given in Algorithm 1.7,

    where

    A

    is overwritten

    by Ql ==

    Q- the orthogonal basis of

    A.

    The total exe

    cution time

    of

    Algorithm 1.7 is given

    by

    T3(M,N)

    =N(9.15+3.12N -O.OIN

    2

    +4.95M

    + 1.31MN).

    Algorithm

    1.7 The

    MGS

    method for computing the

    QRD

    on SIMD systems.

    1: defMGS_QRD(A,Mesl,Nes2) =

    2:

    for i

    = 1, . . . ,

    Nes2 with steps eS2 do

    3: apply orthogonal(A:,i: ,Ri:,i:,Mesl, (N - i+ l)es2)

    4: end for

    5: enddef

    6:

    def orthogonal(A,R,m,n) =

    7: for

    i

    =

    1, . . . ,eS2 do

    8: Ri,i := sqrt(sum(A:,i *A:,i))

    9: A:,i

    :=A:,i/Ri,i

    10: forall(j = i + 1 : n) W,j :=

    A:,i

    *A:,j

    11:

    Ri,i+l:

    :=

    sum(W,i+l:, 1)

    12: forall(j = i+

    1:

    n) A:,j :=A:,j -

    Ri;j

    *A:,i

    13: end for

    14:

    end

    def

    It

    can

    be

    seen from Table 1.1 that the (improved) Householder method per

    forms better than the

    MGS

    method. The difference in the performance

    of

    the

    two

    methods arises mainly because,

    at

    the

    ith

    step, the

    MGS

    and Householder

    methods work with

    m x

    (n

    -

    i

    +

    1) and (m

    -

    i

    + 1) x

    (n

    - i+ 1)

    matrices, re

    spectively. An analysis

    of T2(M,N)

    and

    T3(M,N)

    reveals that for

    M > N,

    the

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    38/195

    22 PARALLEL ALGORITHMS FOR

    UNEAR

    MODELS

    MGS algorithm is expected to perform better than the Householder algorithm

    only when

    N

    = 1 and

    M

    = 2.

    4.4 THE GIVENS

    ROTATION

    METHOD

    A Givens rotation, when applied from the left of a matrix, affects only two of

    its rows: thus a number

    of

    them can be applied simultaneously. This particular

    feature underpins the development

    of

    parallel Givens algorithms for solving a

    range

    of

    matrix factorization problems [29, 30, 69, 94, 95, 102, 103, 129].

    The orthogonal matrix QT in (1.32a) is the product of a sequence of Com

    pound Disjoint Givens Rotations (CDGRs), with each compound rotation re

    ducing to zero elements

    of A

    below the main diagonal while preserving pre

    viously annihilated elements. Figure 1.5 shows two sequences of CDGRs for

    computing the QRD

    of

    a 12 x 6 matrix, where a numerical entry denotes an el

    ement annihilated by the corresponding CDGR. The first Givens sequence was

    developed by Sameh and Kuck [129]. This sequence - the SK sequence - ap

    plies a total ofm+n-2 CDGRs to triangularize an

    m

    x

    n

    matrix (m >

    n),

    com

    pared to

    n(2m - n

    -1) Givens rotations needed when the serial Algorithm 1.2

    is used. The elements are annihilated by rotating adjacent rows. The second

    Givens sequence - the Greedy sequence - applies fewer CDGRs than the SK

    sequence but, when it comes to implementation, the advantage

    of

    the

    Greedy

    sequence is offset by the communication overheads arising from the construc

    tion and application

    of

    the compound rotations [30,67, 102, 103]. For

    m

    n,

    the Greedy sequence applies approximately log m + (n - 1) log log m CDGRs

    11

    4

    10 12

    3 6

    9 11 13

    2

    5 8

    8

    10 12 14

    2 4 7

    10

    7

    911

    13 15

    2 4 6 9

    12

    6 8

    10 12 14

    16

    1 3 6

    811 14

    5 7

    911

    13 15 1 3 5 7 10

    13

    4 6 8

    10 12 14

    1 3 5

    7

    9

    12

    3 5 7

    911

    13

    1

    2 4

    6

    8 11

    2

    4 6

    8

    10

    12

    1

    2

    4

    6 8

    10

    1 3

    5 7 9

    11

    1

    2 3

    5 7 9

    (a) SK sequence (b) Greedy sequence

    Figure 1.5.

    Examples of Givens rotations schemes for computing the QRD.

    The adaptation, implementation and performance evaluation

    of

    the SK se

    quence to compute various forms

    of

    orthogonal factorizations on SIMD sys

    tems will be discussed in the subsequent chapters. On the MP-1208, the ex-

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    39/195

    Linear models and QR decomposition 23

    ecution time of computing the QRD of an

    Mesl

    x Nes2 matrix using the SK

    sequence, is found to be

    T

    4

    (M,N)

    =

    N(25.64+5.51N

    -7.94N

    2

    + 11.1M

    +

    15.99MN)

    +41.96M.

    4.5 COMPUTATIONAL RESULTS

    The Householder factorization method is found to be the most efficient in

    terms of speed, followed by the MGS algorithm which is only slightly slower

    than the data parallel Householder algorithm. Use of the SK sequence produces

    by far the worst performance.

    The comparison of the performances of the data parallel implementations

    was made using accurate timing models. These models provide an effective

    tool for measuring the computational speed of algorithms and they can also

    be used to reveal inefficiencies of parallel implementations [80]. Comparisons

    with performance models of various algorithms implemented on other similar

    SIMD systems, demonstrate the scalability of the execution time models [67,

    75]. If the dimensions of the data matrix A

    do

    not satisfy the assumption that

    they are multiples of the size of the physical array processor, then the timing

    models can be used to give a range of the expected execution times of the

    algorithms.

    Table 1.1.

    Times (in seconds) of computing the QRD

    of

    a 128M x 64N matrix.

    Algor.

    1.6

    Improved Algor.

    1.6

    Algor.

    1.7 Algor.

    SK

    M N Exec.

    Tl(M,N)

    Exec.

    T2(M,N)

    Exec.

    T.l(M,N)

    Exec.

    T4(M,N)

    Time

    X

    10-

    2

    Time

    X

    10-

    2

    Time

    x 10-

    2

    Time

    X

    10-

    2

    10

    3 5.48 5.55 2.58 2.59

    3.21

    3.22 21.16 21.04

    10

    7

    22.15 22.34 9.28 9.34 12.07 12.08 67.73 67.59

    10 9 33.80

    34.09 13.80 13.92 18.49 18.48

    92.65 92.62

    14 5 17.48 17.54 7.36 7.35 9.30 9.30 62.27 62.36

    14 9 47.86 48.03 18.75 18.86 24.52 24.50 150.05 150.11

    14 13 90.30 90.56

    34.38 34.54 46.64 46.64

    242.63 242.68

    18

    5

    22.34 22.35 9.14 9.15 11.60 11.59

    82.03 82.25

    18

    9

    61.90 61.98 23.77 23.79 30.68 30.52 207.47

    207.61

    18

    17

    189.16 189.03 69.42 69.31 94.41 94.26 503.44 503.51

    22

    7

    48.61 48.71 19.10 18.90 23.98 23.93 175.85

    175.96

    22 15 188.79 188.49 68.88 68.57 89.86 89.82 585.56 585.64

    22 19 287.23

    286.33

    103.31

    102.81

    138.59 138.27 805.45 805.71

    5

    QRD

    OF LARGE AND SKINNY MATRICES

    The development of SIMD algorithms to compute the QRD when matrices

    do not have dimensions which are multiples

    of

    the physical array processor

    size are considered [90]. Implementation aspects of the QRD algorithm from

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    40/195

    24

    PARALLEL ALGORITHMS FOR LINEAR MODELS

    the Cambridge Parallel Processing (CPP) linear algebra library (LALIB) are

    investigated [19]. The

    LALIB

    QRD algorithm is a data-parallel version of

    the serial Householder algorithm proposed by Bowgen and Modi [17]. The

    performances

    of

    Algorithm 1.6 and the QRD

    LALIB

    routine are compared. A

    second Householder algorithm which is efficient for skinny matrices is also

    proposed.

    5.1 THE

    CPP

    GAMMA SIMn SYSTEM

    The Cambridge Parallel Processing (CPP) GAMMA series has a Master

    Control Unit (MCU) and 1024 or 4096 Processing Elements (PEs) arranged

    in a 2-D square array.

    It

    has an interconnection network for PE-to-PE com

    munication and for broadcast between the MCU and the PEs. The GAMMA

    SIMD systems are based on fine grain massively parallel computer systems

    known as the AMT DAP (Distributed Array of Processors) [116, 118].

    A macro assembler called APAL (Array of Processors Assembly Language)

    is available to support low-level programming the GAMMA-I. Two high level

    language systems are also available for the GAMMA-I. These are extended

    versions of Fortran (called Fortran-Plus enhanced or for short, F-PLUS) and

    C++. These languages interact with the language that the user selects to run

    on the host machine, typically Fortran or C

    [1,

    2]

    Both high level languages

    allow the programmer to assume the availability of a virtual processor array of

    arbitrary size. As in the MasPar, using the default cyclic distribution, an m x n

    matrix is mapped on the PEs using r

    m/

    es 1 r

    n/

    es 1 layers of memory, while

    an m-element vector is mapped on the PEs using rm/es

    2

    1layers of memory,

    where es x es (es

    =

    32 or es

    =

    64) is the dimension of the SIMD array pro

    cessor. An m x n matrix can also be considered as an array of n m-element

    column vectors (parallelism in the first dimension) or

    m

    n-element row vec

    tors (parallelism in the second dimension), requiring respectively,

    nr

    m/es

    2

    1

    and

    mr

    n/es

    2

    1layers

    of

    memory to map the matrices onto the PEs [25].

    In most non-trivial cases, the complexity of performing a computation on

    an array is not reduced if some of the PEs are disabled, since the disabled

    PEs will become idle only during the assignment process. In such cases the

    programmer is responsible for avoiding computations on unaffected submatri

    ces. To illustrate this, let h = (hI, . . . ,h

    m

    ) and u = (UI,

    ...

    ,un) be real vectors,

    L ==

    (11,-- -- -.

    ,In) a logical vector and A an m x n real matrix. The F-PLUS state

    ment

    u(L) =

    sumr(matc(h,n) *A)

    (1.45)

    is equivalent to the HPF statement

    forall(i

    =

    1 :

    n,li =

    true)

    Ui =

    sum(h*A:,i)

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    41/195

    Linear models and

    QR

    decomposition 25

    which computes the inner-product Ui

    =

    hT

    A,i

    for all i, where

    Ii

    has value true.

    In Fortran-90 the F-PLUS functions sumr(A),

    matc(h,n)

    and

    matr(u,m)

    can

    be expressed

    as

    sum(A, 1), spread(h,2,n) and spread(u, I,m), respectively.

    The main difference, however, between F-PLUS and HPF, is that the F-PLUS

    statement computes

    all

    the inner-products hT A and then assigns simultane

    ously the results to the elements of

    u,

    where the corresponding elements of L

    have value true. This difference may cause degradation of the performance

    with respect to execution speed, if the logical vector

    L

    has a significant num

    ber of false values. Consider, for example, the three cases, where (i) all ele

    ments of L have a true value, (ii) the first n/2 elements of L have a true value

    and (iii) only the first element of L has a true value. For

    m

    = 1000 and

    n =

    500, the execution time in msec for computing (l.45)

    on

    the 1024-processor

    GAMMA-I (hereafter abbreviated to GAMMA-I) for all three cases is 249.7,

    while, without masking the time required to compute all inner-products

    is

    given by 247.79. Explicitly performing operations only on the affected ele

    ments of

    u,

    the execution times (including overheads) in cases (ii) and (iii) are

    found to be 147.84 and 13.21, respectively. This example shows the degrada

    tion

    in

    performance that might occur when implementing an algorithm without

    taking into consideration the systems software of the particular parallel com

    puter.

    5.2 THE HOUSEHOLDER QRD ALGORITHM

    The CPP LAUB implementation of the QRD Householder algorithm could

    be considered a straightforward one. Initially the algorithm was implemented

    on the AMT DAP using an earlier version of F-PLUS which required the data

    matrix to be partitioned into submatrices having the same dimensions as the

    array processor [17]. The re-implementation of the algorithm using the new

    F-PLUS has removed this constraint. Algorithm 1.8 shows broadly how the

    Householder method has been implemented in this library routine for comput

    ing the QRD. The information needed for generating the orthogonal matrix

    Q

    is stored in the annihilated parts of A and in two n-element vectors. For sim

    plicity Algorithm 1.8 ignores this, neither does it emphasize other details of

    the LAUB QRD subroutine

    QR_FACTOR,

    such as those dealing with overflow,

    that do not play an important role

    in

    the performance of the algorithm [19].

    Clearly the performance of Algorithm 1.8 is dominated by the computa

    tions in the 10th and

    11

    th lines, while computations on logical arrays and

    scalars are less significant. The computation

    of

    the Euclidean norm of the

    m-element

    vector

    h

    in line 5

    is

    a function of r

    m/es

    2

    l

    and is therefore im

    portant only for large matrices, where m es

    2

    Notice that the first i-I

    elements of h are zero and the corresponding rows and columns of A remain

    unchanged. Thus the computations in lines 5, 10 and 11 can be written as

    follows:

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    42/195

    26 PARALLEL ALGORITHMS FOR LINEAR MODELS

    Ui:n

    :=

    sumr(mate(hi:m,n - i

    +

    1)

    *Ai:m,i:n)/Pi

    Ai:m,i:n

    :

    = Ai:m,i:n

    -

    mate (

    h

    i

    :

    m

    , n - i

    +

    1)

    *matr(Ui:n,

    m - i

    +

    1)

    Algorithm 1.8

    The

    CPP

    LA LIB method for computing the

    QR

    Decomposition.

    1: L:=

    true; M:= true

    2:

    for

    i = 1,2, ... ,

    n do

    3:

    h:=O

    4:

    h(L)

    :=

    A:,i

    5: cr:=

    sqrt(sum(h*h))

    6: if hi ni) is the exogenous full column rank matrix in the

    ith regression equation, Qi is an

    m

    x

    m

    orthogonal matrix and Ri is an upper

    triangular matrix of order ni [78]. The fast simultaneous computation of the

    QRDs (1.47) is considered.

    6 1 EQUAL SIZE MATRICES

    Consider, initially, the case where the matrices AI, ... ,AG have the same

    dimension, that is, n}

    =

    ...

    = nG =

    n. The equal-size matrices suggests that a

    3-D array could be employed. The

    m

    x n data matrices A

    I , . . .

    ,AG and the upper

    triangular factors

    RI, . . . ,RG

    can be arranged in an

    m

    x

    n

    x G array

    A

    and the

    n x n x G array R, respectively. Using a 2-D mapping, computations performed

    on scalars, I-D and 2-D arrays correspond to computations on I-D, 2-D and

    3-D arrays when a 3-D mapping is used. Thus, in theory, the advantage over a

    2-D mapping is that a 3-D arrangement will increase the level of parallelism.

    The algorithms have been implemented on the 8192-processor MasPar MP-

    1208, using the high level language MasPar-Fortran. On the MasPar, the 3-D

    arrangement

    of

    the equal-size matrices is mapped on the 2-D array

    of

    PEs plus

    memory, with computations over the third dimension being performed serially.

    This indicates that under a 3-D arrangement the increase in parallelism will not

    be as large as is theoretically expected.

    The indexing expressions

    of

    2-D matrices and the replication and reduc

    tion functions can be used in a 3-D framework. That is, the function

    spread

    which replicates an array

    by

    adding a dimension and the function

    sum

    which

    adds all

    of

    the elements of an array along a specified direction can be used.

    For example,

    if B

    and C are

    m

    x

    n

    and

    m

    x

    n

    x G arrays respectively, then

    C:=

    spread(B,3,G) implies that forall

    k, C:,:,k

    =

    B:,:

    and B:=

    sum(C,3)

    is

    equivalent to B(i,j) = Lf=1 Ci,j,k whereas sum(C) has a scalar value equal to

    the sum of all of the elements of C.

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    46/195

    30 PARALLEL ALGORITHMS FOR LINEAR MODELS

    The

    first

    method

    for

    computing

    the

    QR

    factorization

    of

    a matrix employs a

    sequence of

    Householder

    reflections

    H =

    1 -

    hhT

    jb, where

    b = h

    T

    hj2. The

    application of H to the

    data

    matrix

    Ai

    involves the

    vector-matrix

    computation

    ZT

    =

    hTAdb and a

    rank-one update

    Ai - hz

    T

    . Both

    of

    these operations

    can be

    efficiently

    computed

    on an SIMD array processor using the replication func

    tion spread and the reduction function sum. The SIMD implementation of

    the Householder algorithm for computing the

    QRDs

    (1.47) simultaneously is

    illustrated in Algorithm 1.10. A total of n Compound Householder Transfor

    mations

    (CHTs) are applied. The ith

    CHT

    produces the ith rows of RI,

    . . . ,RG

    without effecting the first i-I

    columns

    and rows

    of A I, . . . ,AG. The

    simultane

    ous data parallel

    vector-matrix

    computations and

    rank-one

    updates are shown

    respectively in lines

    12-14

    of

    Algorithm 1.10.

    Algorithm

    1.10

    The

    Householder

    algorithm.

    1:

    defHouseh_QRD(A,m,n,G) =

    2:

    for i

    =

    1, . . . ,

    n

    do

    3: apply transform(Ai:,i:,: ,m - i

    + 1,

    n - i

    +

    1, G)

    4: end for

    5:

    end def

    6:

    def

    transform(A, m, n,

    G)

    =

    7: H := A,I,:

    8: S:= sqrt(sum(H

    *H, 1))

    9:

    where (HI,:

    n)

    is the exogenous data matrix,

    y

    E

    Sltm

    is the response

    vector and E

    E

    Sltm is the noise vector with zero mean and dispersion matrix

    (J2/m.

    The least squares estimator

    of

    the parameter vector

    x

    E Sltn

    argminET E

    =

    argmin IIAx-

    y112,

    (2.2)

    x x

    has an infinite number

    of

    solutions when

    A

    does not have full rank. However,

    a unique minimum 2-norm estimator of

    x

    say,

    X,

    can be computed.

    Let the rank

    of

    A

    be given by

    k (k

    :::;

    n).

    The solution

    of

    (2.2) is computed

    in two stages. In the first stage the coefficient matrix

    A

    is reduced to a lower

    trapezoidal form; in the second stage the lower trapezoid is triangularized. The

    orthogonal decompositions of the first and second stages are given, respec

    tively, by:

    k

    o)m-k

    Lz

    k

    (2.3)

    and

    n-k

    k

    (LI L2)P= (0

    L)'

    (2.4)

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    56/195

    40 PARALLEL ALGORITHMS FOR LINEAR MODELS

    where Q E 9t

    mxm

    and P E 9t

    nxn

    are orthogonal, Land L2 are lower-triangular

    and non-singular, and II E 9t

    nxn

    is a permutation matrix. That is,

    n - k

    k

    Q T A n p ~ ( ~

    o ) m- k

    L k

    (2.5)

    The orthogonal decompositions (2.3) and (2.5) are called QL decomposition

    (QLD) and

    complete

    QLD , respectively.

    The minimum 2-norm best linear unbiased estimator of

    x

    is given by

    A PL-1QT

    x=

    2

    2Y,

    where

    and

    Numerous methods have been proposed for computing the orthogonal factor

    izations (2.3) and (2.4), on both serial computers and MIMD parallel systems

    [16,

    18,51,93].

    Algorithms are designed, implemented and analyzed for computing the com

    plete QLD on the CPP DAP 510 massively parallel SIMD computer (abbrevi

    ated to DAP) [70]. The algorithms employ Householder reflections and Givens

    plane rotations. Algorithms are also proposed for reconstructing the orthogo

    nal matrices involved in the decompositions when the data which define the

    orthogonal transformations are stored in the annihilated parts of the coefficient

    matrix A. The implementation and execution time models of all algorithms

    on the DAP are considered in detail. All of the algorithms were implemented

    on the 1024-processor DAP using double precision arithmetic. The timing

    models are expressed in msec.

    2 THE

    QLD OF

    THE COEFFICIENT MATRIX

    The computation

    of

    the QLD (2.3) using Householder reflections with col

    umn pivoting is considered. This method is also used when

    A

    is of full column

    rank but ill-conditioned. Let the elementary permutation matrix

    I ~ i , J l )

    denote

    the identity

    n x n

    matrix

    In

    with columns

    n - i+

    1 and Jl interchanged and let

    h(i)h(i)T

    Qf = I m - hi

    (2.6)

    denote an

    m

    x

    m

    Householder matrix which annihilates the first

    m - i

    elements

    ofA:,n-i+l

    (pivot column), when it is multiplied by

    A

    on the right. The matrices

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    57/195

    OLM not offull rank 41

    Q and

    n

    in (2.3) are defined by

    k

    QT

    =

    TIQLi+l

    =

    QrQLI ... Qf

    i=1

    and

    k

    n

    = TI (i,lIi) = /(I,JJJ) /(2,/-12) . . . /(k,lIk).

    i=1

    To describe briefly the process of computing the

    QLD

    (2.3) let, at the ith (0 :$

    i:$ k) step,

    A (i) = QT. Qf / ~ I ' I I J )

    . . .

    ~ i l I j )

    n - i

    o

    )m-i

    Vii

    i '

    where L ~ i is non-singular and lower-triangular with its diagonal elements

    in increasing order of magnitude. The value

    of Jli+

    I is the index

    of

    the col-

    umn

    of

    LW with maximum Euclidean norm. The criterion used to decide that

    ()

    .

    II

    (k) 112 k). (k)

    rank A

    =

    k IS A I'm-k " < t, where A I'm-k " IS the JlH I column of Lit

    , 'rk+l

    '

    ,rk+l

    and 't is an absolute tolerance parameter whose value depends on the scaling

    of A

    [51,63,93]. The value

    of't

    is assumed to be given.

    The permutation matrix

    n

    can be stored and computed using one

    of

    the two

    n element integer vectors ~ and where

    A permutation

    ~ i ' l I i )

    is equivalent to swapping first the elements ~ n - i + 1 and

    ~ l I i

    of the ~ vector and then swapping the elements n - i + 1 and Jli of the ~ vector

    where, initially,

    = = i (i =

    1,

    ...

    ,n).

    2.1 SIMD IMPLEMENTATION

    The QLD (2.3) has been computed on the DAP under the assumption that

    n

    =

    Nes, m

    =

    Mes

    and 1

    n and To(n,n)

    =

    2n-3-

    (3.22) is found empirically to be minimized for Pi having values closer to

    n.

    For simplicity let A. be an integer such that k = An and fIog2 'A.1 = flog2 ('A. +

    1)

    1.

    This implies that (3.22) is minimized

    if\fi:

    Pi

    =

    nand

    TI

    (n, k,

    'A.,

    p)

    is simplified

    to

    T2(n,'A.) = 2n - 3 +nflog2 ('A. + 1)l.

    Figure 3.4 illustrates the computation of (3.6a) using the

    bitonic

    algorithm,

    where n = 6,

    k

    = 18 and

    Pi

    =6 (i = 1,2,3) . The bold frames show the partition

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    87/195

    72 PARALLEL ALGORITHMS FOR LINEAR MODELS

    (DT RT{ =

    (br

    Dr Dr

    RTV.

    Initially the QRDs of D

    ,

    D2 and D3

    are computed simultaneously and, at stages i = 1,2, the updating is completed

    by computing (3.21), where

    g

    =

    2.

    Compute (3.20) Compute (3.21) Compute (3.21)

    for i

    =

    1 for i

    =

    2

    5

    4 6

    3 5

    7

    2 4

    6 8

    113 5 7

    19

    16 17

    18

    19

    2U

    21

    5

    16

    17

    18

    19

    20

    4 6

    16

    17

    18

    19

    3 5

    7

    16 17 18

    2 4 6 8

    16 17

    1 3 5 7

    19

    16

    lU

    11

    12

    13

    14

    15

    5

    10

    11

    12

    1314

    4 6

    10

    11

    12

    13

    3 5

    7

    10 11 12

    2 4 6 8

    10

    11

    1

    3 5 7 9

    10

    10

    11

    12

    13

    14 15

    1U

    11

    12

    1314

    10

    11 12

    13

    10 11 12

    10 11

    10

    Figure 3.4. The bitonic algorithm, where n

    = 6,

    k

    = 18

    and

    PI = P2 = P3 = 6.

    The number of CDGRs applied to update the QRD using the UGSs

    is

    given

    by

    Ignoring additive constants, it follows that

    This indicates the efficiency

    of

    the

    bitonic

    algorithm for computing (3.6a) for

    A> 2, compared with that when using the UGSs.

    The second parallel strategy for solving the updating problem is a slight

    modification of the Greedy annihilation scheme in [30, 103]. Taking as be

    fore

    n

    = 6 and k = 18, Fig. 3.5 indicates the order in which the elements are

    annihilated. Observing that the elements in the diagonal of Rare annihilated

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    88/195

    Updating and downdating the OLM 73

    by successive rotations, it follows that at most k + n - 1 CDGRs are required

    to compute (3.6a). An approximation to the number of CDGRs required to

    compute (3.6a) when n is fixed and

    k

    approaches to infinity, is given by

    T4(n,k)

    = log2k+

    (n-l)log210g2k.

    The derivation

    of

    this approximation has been given in the context of comput

    ing the QRD and, it is also found empirically to be valid for computing (3.6a)

    [103]. In general, for k n, the

    Greedy

    sequence requires fewer CDGRs than

    the bitonic method, while for small k (compared with n) the UGSs and Greedy

    sequence require the same number of CDGRs. Table 3.5 shows the number

    of CDGRs required to compute (3.6a) using the UGSs, bitonic method and

    Greedy

    sequence for some

    n

    and

    A

    (k

    =

    An

    and

    k

    n) .

    5

    4 7

    3 6

    9

    3 6

    811

    2 5

    8

    10

    13

    2 5 7

    10

    12

    15

    2 4 7 9

    12 14

    2 4 6 9 11

    14

    2 4 6 8

    11

    13

    1 3 6 8

    10 13

    1 3 5 7

    10 12

    1 3

    5 7 912

    1 3

    5 7

    911

    1 3

    4 6 811

    1 2

    4 6

    8

    10

    1 2 4 6

    810

    1 2 4 5 7 9

    1 2 3 5 7 9

    2 3 5 7 9

    3 4 6 8

    4 6 8

    5 7

    6

    Figure 3.5. The Greedy sequence for computing (3.6a), where n = 6 and k = 18.

    As regards implementation, the efficiency

    of

    the

    Greedy

    method is expected

    to be reduced significantly by the organizational overheads so that the

    bitonic

    method is to be preferred [30, 103]. The simultaneous computations performed

    at each stage

    of

    the

    bitonic

    method make it suitable for distributing memory

    architectures [36]. Each processing unit will perform the same matrix com

    putations without requiring any inter-processor communications. The simul

    taneous QRD

    of

    the matrices

    VI,

    ...

    D28_Ion

    a SIMD system has been con-

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    89/195

    74

    PARALLEL ALGORITHMS FOR LINEAR MODELS

    Table 3.5.

    Number of CDGRs required to compute the factorization (3.6a).

    n A k = M UGSs bitonic Greedy

    15

    5 75 89 72

    43

    15

    10 150 164

    87 47

    15

    20 300 314 102

    50

    15

    40

    600 614 117

    54

    30 5

    150

    179

    147

    89

    30 10 300 329 177

    96

    30

    20

    600 629 207 102

    30

    40 1200 1229 237

    107

    60

    5 300 359 297

    187

    60 10 600 659 357 198

    60

    20

    1200

    1259

    417

    208

    60

    40

    2400 2459 477

    217

    sidered within the context of the SURE model estimation [84]. In this case

    the performance

    of

    the Householder algorithm was found to be superior to

    that of the Givens algorithm (see Chapter 1). The simultaneous factorizations

    (3.21) have been implemented on the MasPar within a

    3-D

    framework, us

    ing Givens rotations and Householder reflections. The Householder algorithm

    applies the reflections H(I,j),

    . . .

    ,H(n,j), where H(l,j) annihilates the non-zero

    elements of the

    lth

    column of

    ; ~ ; ( ~ _ i )

    using the

    lth

    row of

    RY-I) as

    a pivot

    row

    (I

    = 1,

    ... ,n).

    Table 3.6.

    Times (in seconds) for computing the orthogonal factorization (3.6a).

    n

    g bitonic Householder

    Householder bitonic Givens UGS-J

    64

    2 0.84

    0.23 1.29 1.45

    64

    3 1.55 0.35

    2.34

    2.46

    64

    5 4.83 0.82 7.97

    9.04*

    192 2 4.15 1.78 8.98

    9.77

    192 3 7.78 2.98 18.21 19.45

    192 5

    27.12 10.27

    69.96

    76.96*

    320 2

    10.15

    5.51

    27.96 32.41

    320 3

    19.69

    10.05 58.78

    67.51*

    320 5

    72.12 37.48 236.13

    278.20*

    * Estimated times.

    Table 3.6 shows the execution times for the various algorithms for com

    puting (3.6a) on the 8I92-processor MasPar using single precision arithmetic.

    Clearly the bitonic algorithm based on Householder transformations performs

    better than the bitonic algorithm based on CDGRs. However, the straightfor-

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    90/195

    Updating

    and

    downdating the aLM

    75

    ward data-parallel implementation of the Householder algorithm is found to be

    the fastest of all. The degradation in the performance of the bitonic algorithm

    is due mainly to the large number of simultaneous matrix computations which

    are performed serially in the

    2-D

    array MasPar processor [84]. The bitonic

    algorithm based on CDGRs performs better than the direct implementation

    of

    UGS-l because of the initial triangularization of the submatrices DI,' .. ,D2g-1

    using Householder transformations.

    2.3 UPDATING WITH A MATRIX HAVING A BLOCK

    LOWER-TRIANGULAR STRUCTURE

    Computational and numerical methods for deriving the estimators of struc

    tural equations models require the updating

    of

    a lower-triangular matrix with

    a matrix having a block lower-triangular structure. Within this context the

    updating problem can be expressed 'as the computation

    of

    the orthogonal fac

    torization

    -T

    (A(I)) (0) E el

    p ;\(1) = t ( G - l ) K - E+ e G

    (3.23)

    where

    K

    - e l

    K -e2

    K -eG-I

    e2

    -(I)

    A

    2

    ,1

    0

    0

    -(I) -(I)

    0

    A(1) =

    e3

    A31

    A

    ,

    3,2

    -(I) -(I)

    -(I)

    eG

    AG,I

    A

    G

    2

    A

    GG

    -

    1

    K-e l K-e2

    K

    -eG-I

    K-e l

    A I)

    LI

    0 0

    K-e2

    A(

    I)

    A(I)

    0

    A(1)

    =

    A21

    L2

    K -eG-I

    A(I)

    A(I)

    A(I)

    AG_1,1

    A

    G

    -

    12

    L

    G

    _

    1

    A

    A(I) .

    .

    G

    L

    and

    Li (I = 1,

    ... G -

    1)

    are lower tnangular and

    E = Li= 1 ei

    The factorization (3.23) can be computed in G - 1 stages, where each stage

    annihilates a block-subdiagonal with the first stage annihilating the main

    block-

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    91/195

    76 PARALLEL ALGORITHMS FOR LINEAR MODELS

    diagonal.

    At

    the ith (i

    =

    1,

    ...

    , G - 1) stage the orthogonal factorizations

    are computed simultaneously for j

    =

    1, ... , G - i, where the

    t

    Y+

    I)

    matrix is

    lower triangular and Pi,) is a

    (K

    - ej +ei+j) x

    (K

    - ej +ei+j) orthogonal ma

    trix. It follows that the triangular matrix

    t

    in (3.23) is given by

    t ~ G ) 0 0

    A(G-I) t (G-I)

    0

    t

    = 2,1 2

    Therefore, if TDI (e,

    K,

    i, j) denotes the number of CDGRs required to compute

    the factorization (3.24) using this method (hereafter called

    diagonally-based

    method), then the total number of CDGRs needed to compute (3.23) is given

    by

    G-I

    TD(e,K,G)

    =

    L max ( TDI (e,K,i, j)),

    1=1

    j=

    1,

    ...

    ,G- i , (3.25)

    where e =

    (el, ...

    ,eG). Figure 3.6 shows the annihilation process for comput

    ing the factorizations (3.23), where G = 5 and i i i denotes a submatrix elimi

    nated at stage i (i = 1, .. . , G - 1).

    Stage 1 Stage 2 Stage 3 Stage 4

    4

    "

    \.

    ,

    Figure

    3.6. Computing the factorization

    (3

    .

    23)

    using the

    diagonally-based

    method, where

    G=5.

    Figure 3.7 illustrates various annihilation schemes for computing the fac

    torization (3.24) by showing only the zeroed matrix A ~ 2 j, j and the

    lower

    triangular tY) matrix, where ei+j = 12, K - ej = 4. The annihilation schemes

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    92/195

    Updating

    and

    downdating the

    OLM

    77

    are equivalent to those of block-updating the QRD the only difference be

    ing that an upper-triangular matrix is replaced by a lower-triangular matrix

    [69, 75, 76, 81]. These annihilation schemes can be employed to annihilate

    different submatrices

    of

    A(

    1

    ),

    that is, at step

    i

    (i

    =

    1,

    ...

    , G -

    1)

    of

    the factoriza-

    . (3 23) th b . A-(i) A-(i) b

    e d t h

    . th

    h o n .

    e su matrices

    i+ l , I ' G,G-i

    can e zero

    WI

    out usmg e

    same annihilation scheme. Assuming that only UGS-2 or UGS-3 schemes are

    employed to annihilate each submatrix, then the number of CDGRs given by

    (3.25) is

    TDI (e,K,i,j) = K - ej +ei+j

    -1 .

    Hence, the total total number

    of

    CDGRs applied to compute the factorization

    (3.23) is given by

    G- l

    T ~ 2 v ( e , K , G ) = ~ max(K-ej+ei+j-l)

    1=1

    G- l

    =(G- l ) (K- l )+

    L,max(ei+j-ej),

    j=I , ...

    ,G-i .

    (3.26)

    i=1

    4 3 2 1 15

    14

    13 12

    6 5 3 1 4 3 2 1 4 3 2 1

    5 4 3 2 14

    13

    1211

    7 6

    4 2

    5 3 2

    1

    4 3 2

    1

    6 5 4 3

    13

    12 11

    ~

    8 7 6 3 5 4 2 1 4 3 2 1

    7 6

    5 4

    1"

    11 111

    9 9 8 7 6 6 4 3 1 4 3 2 1

    8 7 6 5

    11

    1()

    9 8 6 5 3

    1

    6 5

    3 1 4 3 2 1

    9 8 7 6

    1 ~

    9

    8 7 7 6 4 2 7 5 3 1 4 3 2 1

    H 9 8 7 9 8 7 6

    8 7 6 3 7 5 4 2 4 3 2 1

    U

    1()

    9 8

    8 7 6 5 9 8 7 6 8 6

    4 2

    4 3 2 1

    112

    1

    9 7 6 5 4 5 3

    1

    8 6

    4 2

    4 3 2

    1

    I l ~ I I

    6 5 4 3

    1 ~

    4 2

    9 7

    5 3 4 3 2 1

    114

    1 1 5 4 3 2

    11

    11(1

    3

    9 7 5 3 4 3

    2 1

    15

    l ,n.

    4 3

    2 1

    1 ~ 1

    III

    1()

    10

    8 6 4 4 3 2 1

    UGS-2 UGS-3

    Bitonic

    Greedy

    HOUSEHOLDER

    Figure

    3.7. Parallel strategies for computing the factorization (3.24)

    The factorization (3.23) is illustrated in Fig. 3.8 without showing the lower

    triangular matrix J(1), where each submatrix

    of

    A:(I) is annihilated using only

    the UGS-2 or Greedy schemes, K

    =

    10,

    G =

    4 and e

    =

    (2,3,6,8). This partic

    ular example shows that both the schemes require the application

    of

    the same

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    93/195

    78 PARALLEL ALGORITHMS FOR LINEAR MODELS

    number

    of

    CDGRs to compute the factorization. However, for problems where

    the number of rows far exceeds the number of columns in each submatrix, the

    Greedy method will require fewer steps than the other schemes.

    8 7 6 S 4 3 2 1

    9 8

    7

    6 S 4 3 2

    10

    9 8

    7

    6 5

    4

    3

    20 19

    18

    17

    16

    15

    14 13 7 6 5 4 3 2 1

    21

    20 19 18 17 16 15 14

    8 7 6 5 4 3 2

    22 21 20 19 18 17 16 15

    9 8 7 6 5 4 3

    23 22 21 20 19 18

    17

    16 10 9 8

    7

    6 5

    4

    24

    23

    22 21 20 19 18

    17

    11

    10

    9 8

    7 6 5

    25

    24

    23

    22

    21 20

    19 18

    12

    11

    10

    9 8 7 6

    34

    33 32 31 30

    29 28

    27

    19 18

    17

    16

    IS 14 13

    4

    3

    2

    1

    35

    34

    33

    32

    31 30

    29

    28

    O

    19 18

    17

    16 15

    14

    5

    4

    3

    2

    36

    35

    34

    33

    32 31 30 29 21 20 19 18 17 16 15 6 5 4 3

    37

    36

    35

    34

    33 32 31

    30

    2

    21 20 19 18 17 16 7 6 5 4

    38

    37

    36 35

    34 33 32 31

    23

    22 21 20 19 18 17

    8 7 6 5

    39 38 37

    36

    35 34

    33 32

    24

    23

    22 21 20 19 18

    9 8

    7

    6

    40 39 38 37 36 35

    34

    33

    25

    24

    23 22 21 20 19 10 9 8

    7

    41 40 39

    38 37 36 35

    34 26

    2S

    24

    23

    22

    21

    20

    11

    10 9

    8

    U ing

    only the UGS2

    scheme

    8 7 6 5 4 3 2 1

    9 8 7 6

    5 4 3

    1

    10 9 8 7

    6 5 4 2

    20 19 18 17 16 15 14 13

    7

    6 5 4 3 2 1

    21 20 19 18 17 16

    14 13

    8

    7

    6

    5 4 2

    1

    22 21

    20 19 18 16

    15

    13 9 8 7 6 4 3 1

    23

    22 21 20

    19

    17

    15 14 10

    9 8

    7

    5 3 2

    24

    23 22 21

    20

    18

    16

    14 11

    10

    9 8 6

    4 2

    25

    24

    23

    22

    21

    19

    17

    15

    12

    11

    10 9 7 5 3

    34

    33

    32

    31

    30

    29

    28

    27 19 18

    17

    16 15 14 13 4 3 2 1

    35 34 33 32

    31 30 28

    27

    20 19 18

    17

    16

    14 13 S 4 2

    1

    36

    35

    34 33

    32

    30

    29

    27 21 20 19 18 17 15 13

    6 4 3 1

    37

    36

    35

    34

    32 31 29

    27

    22 21 20 19

    17

    16 13

    6

    5 3

    1

    38

    37

    36

    35

    33

    31 30

    1

    8

    23

    22 21 19 18 16 14

    7

    5 4 2

    39

    38

    37 36

    34

    32

    30

    28

    24

    23

    22

    20

    18 17 14

    8 6

    4 2

    40

    39 38

    37

    35

    33

    31 29

    25

    24

    23

    21

    19 17

    15

    9 7 5 3

    41

    40

    39

    38

    36

    34

    32 30 26

    25

    24 22

    20

    18 16 10

    8 6 4

    Using

    only

    the

    Greedy

    cherne

    Figure 3.B.

    Computing factorization (3.23).

    The intrinsically independent annihilation of the submatrices in a block

    subdiagonal ofA

    1)

    makes this factorization strategy well suited for distributed

    memory systems since it does not involve any inter-processor communication.

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    94/195

    Updating and downdating the OLM 79

    However, the diagonally-based method has the drawback that the computa

    tional complexity at stage i (i = 1,

    . . .

    , G - 1) is dominated by the maximum

    b

    fCDGR

    . ed ihl th b .

    A-(i) A-(i)

    num

    er 0

    s requlf to ann

    1

    ate e su matnces

    Hl,l ' '

    HG-i,G-i.

    An alternative approach (called column-based method) which removes this

    drawback is to start annihilating simultaneously the submatrices AI, ...

    ,AG-l '

    where

    j = I , ...

    ,G-1.

    (3.27)

    Consider the case

    of

    using the UGS-2 scheme. Initially UGS-2 is applied to

    annihilate the matrix

    ..1(1)

    under the assumption that it is dense. As a result the

    steps within the zero submatrices are eliminated and the remaining steps are

    adjusted so that the sequence starts from step 1. Figure 3.9 shows the derivation

    of this sequence using the same problem dimensions as in Fig. 3.8. Generally,

    for PI = 1, Pj = Pj- l +2ej - K ( l < j < G) and 11 =

    min(Pl,

    . . .

    ,PG-d,

    the

    annihilation

    of

    the submatrix

    Ai

    starts at step

    si=Pi- I l+1, i = I ,

    ... ,G-1.

    The number

    of

    CDGRs needed to compute the factorization (3.23) is given by

    T ~ ~ v ( e , K , G ' I l ) = E + K

    -2e l - l l .

    (3.28)

    Comparison o f T ~ ~ v ( e , K , G ) and T ~ ~ v ( e , K , G ' I l ) shows that, when the UGSs

    are used, the diagonally-based method never performs better than the

    column

    based method. Both methods need the same number of steps in the exceptional

    case where G

    =

    2.

    The

    column-based

    method employing the

    Greedy

    scheme is illustrated in

    Fig. 3.10. The first sequence is the result

    of

    directly applying the Greedy

    -(i) -(i)

    scheme on the Ai+l,l' ... ,Ai+G-i,G-i submatrices. Let the columns of each

    submatrix be numbered from right to left, that is, in reverse order.

    The

    number

    of elements annihilated by the qth (q > 0)

    CDGR

    in the jth (j = 0, ... ,K - ei)

    column of the ith submatrix Ai is given by

    where a(i,q) is defined as

    J

    o

    ei+1

    ry,q) = l(ay,q) +1)/2J,

    (i,q-l) + (i,q-l) _ (i,q-l) + (i-l,q-l)

    a

    j

    r

    j

    _

    l

    r j rK-ei_1

    (i,q-l) + (i,q-l) (i,q-l)

    a

    j

    r

    j

    _

    l

    -

    r j

    if j > q and j >

    k

    - ei,

    if q = j = 1,

    if j = 1 and q > 1,

    otherwise.

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    95/195

    80 PARALLEL ALGORITHMS FOR LINEAR MODELS

    19

    18

    17

    16

    IS

    14

    13

    1"

    11 10

    9 8

    7 6 5

    4 3 2

    1

    20

    19

    18

    17

    16 15

    14 13

    12

    11

    10

    9 8

    7

    6

    5 4 3

    2

    21 20 19

    18 17 16

    15 14 13 12

    11

    10

    9 8

    7

    6

    5

    4

    3

    22

    21

    20 19 18 17 16

    IS

    14

    13

    12

    11

    10 9 8 7 6 5

    4

    23

    22

    21

    20

    19 18

    17

    16 15

    14

    13

    12 11

    10

    9 8

    7

    6

    5

    24

    23

    22

    21

    20

    19 18 17

    16

    15

    14 13 12 11

    Ie

    9 8

    7

    6

    25 24

    23 22

    21 20 19

    18 17 16

    15

    14

    13 12

    11

    10

    9 8 7

    26

    25 24

    23 22

    21

    20

    19 18

    17

    16 15

    14

    13

    12 11

    10

    9 8

    27 26 25

    24

    23 22

    21 20

    19

    18

    17

    16

    IS

    14

    13

    12

    11

    10

    9

    28

    27

    26

    25

    24

    23

    22 21 20

    19

    18 17 16

    15 14

    13 12 11 10

    29

    28 27 26 25

    24 23

    2..

    21

    20

    19 18

    17

    16 15 14 13 12

    11

    30 29 28 27 26 25

    24 23

    22 21 20

    19 18 17 16

    15

    14

    13

    12

    31 30

    29

    28 27 26

    25

    24

    23

    22 21

    20

    19

    18

    17 16

    15

    14

    13

    32

    31

    30

    29

    28

    27

    26 25

    24

    23

    22

    21

    20 19

    18 17

    16

    IS

    14

    33 32 31 30

    29

    28

    27

    26 25 24 23

    22

    21

    20

    19 18

    17

    16

    15

    34 33 32 31 30 29 28

    27

    26

    25

    24

    23

    22

    21 20

    19

    18

    17

    16

    35

    34 33 32

    31

    30 29 28

    27

    26

    25

    24

    23

    22

    21 20

    19

    18

    17

    UGS-2

    12 11

    10 9 8 7

    6 5

    13

    12 11

    10

    9 8 7 6

    14

    13

    12

    11

    10

    9 8 7

    15 14

    13

    12

    11

    10

    9

    8 7 6 5 4 3 2 1

    16

    15

    14

    13

    12

    11

    10

    9 8

    7

    6

    5 4

    3

    2

    17

    16

    15

    14 13

    12

    11

    10

    9 8 7 6 5 4 3

    18 17 16 15

    14

    13 12

    11 10

    9 8

    7

    6

    5

    4

    19

    18

    17

    16 IS

    14

    13

    12

    11 10

    9 8

    7

    6

    5

    20 19

    18

    17

    16

    15

    14

    13

    12 11

    10

    9 8

    7

    6

    21

    20

    19 18

    17 16 15 14

    13

    12

    11

    10 9 8 7 6 5

    4

    3

    22

    21

    20

    19 18 17 16

    IS

    14

    13

    12 11

    10

    9 8 7 6 5 4

    23

    22

    21 20 19

    18

    17

    16

    15

    14 13 12 11

    10

    9 8 7 6 5

    24 23 22

    21

    20

    19 18

    17 16

    15

    14

    13

    12 11 10

    9 8

    7

    6

    25

    24

    23

    22

    21

    20

    19 18 17

    16

    IS

    14

    13

    12

    11

    10 9 8 7

    26

    25 24

    23 22

    21

    20

    19

    18 17 16 IS 14

    13

    1"

    11 10 9 8

    27

    26

    25

    24

    23

    22

    21 20 19 18

    17

    16 15

    14

    13

    12

    I I

    10

    9

    28

    27

    26

    25

    24

    23

    22 21

    ~ O

    19 18

    17 16

    15

    14

    13

    12

    11 10

    Modified UGS-2

    Figure

    3.9. The

    column-based

    method using the

    UGS-2

    scheme.

    The sequence terminates at step

    q

    if

    Vi, j:

    ry,q)

    =

    O.

    The second sequence in Fig. 3.10, called

    Modified Greedy,

    is generated from

    the application of the

    Greedy

    algorithm in [69] by employing the same tech

    nique for deriving the column-based sequence using the UGS-2 scheme. No

    tice however that the second

    Greedy sequence does not correspond to and is not

    as efficient as the former sequence which applies the Greedy method directly

  • 7/23/2019 (Advances in Computational Economics 15) Erricos John Kontoghiorghes (Auth.)-Parallel Algorithms for Linear Mode

    96/195

    Updating

    and

    downdating the

    OLM 81

    8 7

    6 5 4 3 2

    1

    9 8

    7 6 5 4 3

    1

    10

    9 8 7 6 5 4

    2

    IS

    14

    13

    12

    11

    10

    9 8 7 6 5 4 3 2 1

    16 15

    14 13

    12

    11

    10 9 8

    7

    6

    5 4 2

    1

    17 16 IS 14 13 12 11 10 9 8

    7

    6

    4

    3

    1

    18 17 16 IS 14 13

    12